If you are using Avisynth+ and DGDecNV, then you have DGSharpen(), which is a very fast CUDA implementation with functionality like LSFmod. It works in 8 or 16-bit depth.
You really should get away from all the high-bit-depth hacks, IMHO, and go for native support.
|