Offloading to GPU

FranceBB · 23rd January 2021, 14:17

Hi there, folks,
out of curiosity, this is one of the scripts I'm currently running automatically on several files on a farm with two servers which have an Intel Xeon 28c/56th and 64 GB of RAM each and I'm quite happy with the results I've got as output, however it's currently working at 0.6fps and encoding it's not the issue, Avisynth is. Is there anything (besides the Indexer) that I can offload to the GPU? I know about DGDecodeNV and I'm also a "customer" (Donald knows

) but what about the other filters I'm using? As to encoding itself, I'm not really planning to use GPU encoding for the output due to quality concerns, so... is there anything inside Avisynth that I can offload to GPU?

(side note: Input is v210 lossless .avi 720x576 25i so nothing hard to index, you know...)

Code:

#Indexing
video=FFVideoSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi")
ch1=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=1)
ch2=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=2)
ch3=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=3)
ch4=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=4)
audio=MergeChannels(ch1, ch2, ch3, ch4, ch1, ch2, ch3, ch4)
AudioDub(video, audio)

#Bob-deinterlacing
AssumeTFF()
QTGMC( Preset="Placebo")

#Bring everything to 16bit planar
HBD=ConvertBits(m_clip, bits=16)

#Convert to 4:2:2 planar 16bit
c=Converttoyuv422(HBD, matrix="Rec601")

#De-Sport 16bit planar
SpotLess(c)

#Degrain in 16bit planar
super = MSuper(pel=2, sharp=1)
bv1 = MAnalyse(super, isb = true, delta = 1, overlap=4)
fv1 = MAnalyse(super, isb = false, delta = 1, overlap=4)
bv2 = MAnalyse(super, isb = true, delta = 2, overlap=4)
fv2 = MAnalyse(super, isb = false, delta = 2, overlap=4)
degrain=MDegrain2(super,bv1,fv1,bv2,fv2,thSADC=1200, thSAD=1200)

#Spatial denoise 16bit planar
denoise=dfttest(degrain, sigma=64, tbsize=1, lsb_in=false, lsb=false, Y=true, U=true, V=true, dither=0)

#Adding borders for 1.33 PB 4:3 with 16bit planar precision
borders=AddBorders(denoise, 152, 0, 152, 0)

#Upscale to FULL HD with Spline64 + NNEDI and 16bit planar precision
resized=nnedi3_rpow2(borders, cshift="Spline64ResizeMT", rfactor=2, fwidth=1920, fheight=1080, nsize=4, nns=4, qual=1, etype=0, pscrn=2, threads=56, csresize=true, mpeg2=true, threads_rs=0, logicalCores_rs=true, MaxPhysCore_rs=true, SetAffinity_rs=false)

#From 16bit planar to 16bit interleaved
interleaved=ConvertToDoubleWidth(resized)

#Matrix Conversion from BT601 to BT709 with 16bit interleaved precision
color=Matrix(interleaved, from=601, to=709, rg=1.0, gg=1.0, bg=1.0, a=16, b=235, ao=16, bo=235, bitdepth=16)

#From 16bit interleaved to 16bit planar
planar=ConvertFromDoubleWidth(color)

#Dithering from 16bit planar to 8bit planar with the Floyd-Steinberg error diffusion
dithered=ConvertBits(planar, bits=8, dither=1)

#Limiter TV Range 0.0 - 0.7V
m_clip=Limiter(dithered, min_luma=16, max_luma=235, min_chroma=16, max_chroma=240)


Return m_clip

Atak_Snajpera · 23rd January 2021, 15:32

By the way, no prefetch in script?

Code:

#Indexing
video=FFVideoSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi")
ch1=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=1)
ch2=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=2)
ch3=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=3)
ch4=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=4)
audio=MergeChannels(ch1, ch2, ch3, ch4, ch1, ch2, ch3, ch4)
AudioDub(video, audio)

#Bob-deinterlacing
AssumeTFF()
QTGMC( Preset="Placebo")

#Bring everything to 16bit planar
HBD=ConvertBits(m_clip, bits=16)

#Convert to 4:2:2 planar 16bit
c=Converttoyuv422(HBD, matrix="Rec601")

#De-Sport 16bit planar
SpotLess(c)

#Degrain in 16bit planar
super = MSuper(pel=2, sharp=1)
bv1 = MAnalyse(super, isb = true, delta = 1, overlap=4)
fv1 = MAnalyse(super, isb = false, delta = 1, overlap=4)
bv2 = MAnalyse(super, isb = true, delta = 2, overlap=4)
fv2 = MAnalyse(super, isb = false, delta = 2, overlap=4)
degrain=MDegrain2(super,bv1,fv1,bv2,fv2,thSADC=1200, thSAD=1200)

#Spatial denoise 16bit planar
denoise=dfttest(degrain, sigma=64, tbsize=1, lsb_in=false, lsb=false, Y=true, U=true, V=true, dither=0)

#Adding borders for 1.33 PB 4:3 with 16bit planar precision
borders=AddBorders(denoise, 152, 0, 152, 0)

#Upscale to FULL HD with Spline64 + NNEDI and 16bit planar precision
resized=nnedi3_rpow2(borders, cshift="Spline64ResizeMT", rfactor=2, fwidth=1920, fheight=1080, nsize=4, nns=4, qual=1, etype=0, pscrn=2, threads=56, csresize=true, mpeg2=true, threads_rs=0, logicalCores_rs=true, MaxPhysCore_rs=true, SetAffinity_rs=false)

#From 16bit planar to 16bit interleaved
interleaved=ConvertToDoubleWidth(resized)

#Matrix Conversion from BT601 to BT709 with 16bit interleaved precision
color=Matrix(interleaved, from=601, to=709, rg=1.0, gg=1.0, bg=1.0, a=16, b=235, ao=16, bo=235, bitdepth=16)

#From 16bit interleaved to 16bit planar
planar=ConvertFromDoubleWidth(color)

#Dithering from 16bit planar to 8bit planar with the Floyd-Steinberg error diffusion
dithered=ConvertBits(planar, bits=8, dither=1)

#Limiter TV Range 0.0 - 0.7V
m_clip=Limiter(dithered, min_luma=16, max_luma=235, min_chroma=16, max_chroma=240)

#Prefetch
m_clip=Prefetch(m_clip,28)

Return m_clip

Furthermore, Personally I wouldn't use such high values in MDegrain2 (thSADC=1200, thSAD=1200) because it is recipe for ugly ghosting artefacts.

real.finder · 23rd January 2021, 16:14

there are https://github.com/nekopanda/AviSynthPlus/releases (I think avs+ can do same of it work after 3.6 update)

and there are some CUDA plugins here https://github.com/nekopanda/AviSynthCUDAFilters (don't know if they can work in avs+ 3.6)

and aside from all that, maybe we need someone backport opencl versions of plugins from VS like https://github.com/HomeOfVapourSynth...Synth-NNEDI3CL (the SEt avs one is closed source and no one can update it since SEt is no longer active) and there are some plugins have both CL and CPU functions in the same plugin like https://github.com/HomeOfVapourSynth...urSynth-TCanny Asd already backport it to avs but only for the CPU function!

Frank62 · 23rd January 2021, 19:08

I wouldn't use QTGMC with "placebo". Seems the bottleneck to me. And often "slow" leaves more details. Try it.

FranceBB · 24th January 2021, 10:47

The only reason why I didn't add Prefetch is that I know that some of the plugins create their own thread pool, like plugins_JPSDR which I'm using in my filterchain, so I don't know how it's gonna behave, but if you think it's gonna behave nicely, I'll add it.

Quote:

Originally Posted by Atak_Snajpera

Personally I wouldn't use such high values in MDegrain2 (thSADC=1200, thSAD=1200) because it is recipe for ugly ghosting artefacts.

True, however there's so much noise and grain on those tapes that I don't have much choice. Those are 25i truly interlaced recordings of live feeds from RAI from the 70s on U-Matic and believe me, they have no details whatsoever, tons of grain and far too much noise (due to the transmission methods employed at the time).

Note: not my pictures

We've also noticed a deterioration of the binders in a magnetic tape which hold the iron oxide magnetic coating to its plastic carrier. Some people suggested dehydrating them in a carefully controlled manner, but we don't have the tools to do that, anyway for now it seems they're playing someway, somehow, so it might as well be the last time they play. They're in horrible conditions and a very strong denoise and degrain is needed (oh and I checked, I don't get ghosting, except when the ball is sometimes removed in tennis matches, but I encode them with different parameters to solve the problem, so it's not a big deal.

)

Quote:

Originally Posted by Frank62

I wouldn't use QTGMC with "placebo". Seems the bottleneck to me. And often "slow" leaves more details. Try it.

Not that there are many details in those contents, but I'll give it a shot.

Frank62 · 24th January 2021, 13:50

Ok... with so much grain it really will make no difference. Then better try "fast"...

For electronical grain like this we still use NeatVideo, since many years. But also quite slow.

FranceBB · 24th January 2021, 14:01

Quote:

Originally Posted by Frank62

For electronical grain like this we still use NeatVideo, since many years. But also quite slow.

Yeah, Jean Philippe also suggested it to me two years ago (although it's a paid solution).

Frank62 · 24th January 2021, 18:54

Just if you are interested:
We use NeatVideo as best solution for this kind of noise, but I forgot: VERY carefully...
In amost all cases we turn it to only 5% spatial heights (mids and lows zero!), and temporally 2 or 3 frames. So it provides the best temporal noise remover I know up to now.
In many cases we also mix back some of the original noise (overlay, transparency ~0.3) to avoid wax-effect.

Atak_Snajpera · 24th January 2021, 19:06

Yeah, forget about QTGMC placebo and just use medium. Anything above that is a waste of time and electricity. Regarding prefetch ,i recommend using number of physical cores first instead of going straight to number of total supported threads. You may also reduce number of threads in nnedi to 2 or even 1.

FranceBB · 28th January 2021, 11:17

Ok, I tried with Prefetch and I gotta say, I'm not impressed at all...
If anything, I'm surprised 'cause it's even slower than without it...
I tried limiting NNEDI to 1 thread and also removing it completely from the filter chain, but nothing, in all my tests, I dropped from 0.3-0.5fps without Prefetch to 0.1fps with Prefetch at 28...

EDIT: Lowering Prefetch down to 8 or 6 allows me to get the very same speed I usually get without Prefetch, so 0.3fps... It's not really worth it... I'm not gonna be using Prefetch! (Keep in mind that it's a 28c/56th Xeon, so I expected much better from it...)

pinterf · 28th January 2021, 12:10

Quote:

Originally Posted by FranceBB

Ok, I tried with Prefetch and I gotta say, I'm not impressed at all...
If anything, I'm surprised 'cause it's even slower than without it...
I tried limiting NNEDI to 1 thread and also removing it completely from the filter chain, but nothing, in all my tests, I dropped from 0.3-0.5fps without Prefetch to 0.1fps with Prefetch at 28...

EDIT: Lowering Prefetch down to 8 or 6 allows me to get the very same speed I usually get without Prefetch, so 0.3fps... It's not really worth it... I'm not gonna be using Prefetch! (Keep in mind that it's a 28c/56th Xeon, so I expected much better from it...)

Have you adjusted SetMemoryMax? Large thread count needs more memory. Low memory kills the caches and the speed. Put it to a huge value, then check the actual memory consumption with Avsmeter.

FranceBB · 28th January 2021, 13:53

Setting SetMemoryMax(128000) so 128 GB, which is the maximum available RAM on the other server and Prefetch to 28, it goes all the way up to 21 GB of RAM, then it goes down to 14 GB, then it goes up to 21 GB, then it drops to 14 GB in a loop.
The speed however is the same: 0.1fps.
With Prefetch 2 the RAM is steady and way lower and the speed is 0.3fps, so about the same as I get without Prefetch.
This is definitely weird...

pinterf · 28th January 2021, 15:53

The bottleneck is TemporalMedian in Spotless.
TemporalMedian works internally by histograms, bit depth heavily affects the speed. Checking only 256 levels is much quicker than doing it with a histogram array size of 65536.

First I have modded the plugin to use SSE2 for 16 bit videos.
Presently only 8 bit videos have SSE2 in TemporalMedian, 10+ bit depths are using plain C. (Untested, did not put it in live code)
It got quicker but not that much.

Then I tried feeding MedianBlur with only a 10 bit clip. I recommend you trying this option.

EDIT:
specify directly threads=1 for dfttest when using Prefetch. Its default value is 0, which means that it is using num_processors internal threads. When thread count is not 1, this filter has MT_SERIALIZED behaviour instead of MT_MULTI_INSTANCE.

FranceBB · 30th January 2021, 14:19

Quote:

Originally Posted by pinterf

specify directly threads=1 for dfttest when using Prefetch. Its default value is 0, which means that it is using num_processors internal threads. When thread count is not 1, this filter has MT_SERIALIZED behaviour instead of MT_MULTI_INSTANCE.

Ok, I'll try with threads=1 on dfttest as well, but question: I just noticed that it can't handle more than 16 threads if I use it normally without Prefetch.

Code:

	if (threads < 0 || threads > 16)
		env->ThrowError("dfttest:  threads must be between 0 and 16 (inclusive)!");

line 1345-1346 of dfttest.cpp. Why is that?

pinterf · 30th January 2021, 14:42

I don't know.
Back to Spotless: the way TemporalMedian is used (radius=0, temporal radius=1) is highly unoptimal in present plugin, I'm considering optimizing this special case.
You could also try z_ConvertFormat instead of Matrix, it can combine the colorspace the bit depth conversion and dithering.

FranceBB · 30th January 2021, 14:58

Quote:

Originally Posted by pinterf

I don't know.
Back to Spotless: the way TemporalMedian is used (radius=0, temporal radius=1) is highly unoptimal in present plugin, I'm considering optimizing this special case.
You could also try z_ConvertFormat instead of Matrix, it can combine the colorspace the bit depth conversion and dithering.

Gotcha.
I'll try to replace it with z_ConvertFormat so that I don't have to go to 16bit interleaved and come back. That should speed things up even further.

pinterf · 30th January 2021, 16:45

I've just tried the above mentioned special use case (radius=0, temporal radius=1) with an optimized TemporalMedian version.
Breaking the script after Spotless:
With the original DLL version the script run at 0.37fps.
Then I developed AVX2 into TemporalMedian (still the generic approach) and it reached 0.57fps. Good.
But this special case separation resulted in a huge speed gain, now I'm getting 3.08fps. A significant change.
AAA+ Green Label

I'm doing some more checks then I release it in some days.

pinterf · 30th January 2021, 21:25

Please test with this one: MedianBlur2 new version.
https://github.com/pinterf/MedianBlur2/releases/tag/1.1

Code:

- 1.1 (20210130) - pinterf
  - Speed: SSE2 and AVX2 for 10+ bits (generic case, MedianBlur)
  - Speed: SSE2 and AVX2 for TemporalMedianBlur
  - Speed: Much-much quicker: TemporalMedianBlur special case: temporal radius=1 or 2, spatial radius=0 (C, SSE4.1, AVX2)
  - Pass frame properties when Avisynth interface>=8
  - Debug helper parameter 'opt': integer default -1
    <0: autodetect CPU
    0: C only (disable SSE2 and AVX2)
    1: SSE2 (disable SSE4.1 and AVX2)
    2: SSE4 (disable AVX2)
    3: AVX2

FranceBB · 30th January 2021, 21:26

Quote:

Originally Posted by pinterf

I've just tried the above mentioned special use case (radius=0, temporal radius=1) with an optimized TemporalMedian version.
Breaking the script after Spotless:
With the original DLL version the script run at 0.37fps.
Then I developed AVX2 into TemporalMedian (still the generic approach) and it reached 0.57fps. Good.
But this special case separation resulted in a huge speed gain, now I'm getting 3.08fps. A significant change.
AAA+ Green Label

I'm doing some more checks then I release it in some days.

Wow! 3FPS? That would be a dream!! *_*
It would speed things up a lot considering that this filterchain is here to stay in the foreseeable future in our server! Thanks!!

I really look forward to try it and put it in production!

Frank62 · 30th January 2021, 23:15

Thanks from me, too! Will save a lot of time in the future!

23rd January 2021, 16:14	#3 \| Link
real.finder Registered User Join Date: Jan 2012 Location: Mesopotamia Posts: 2,587	there are https://github.com/nekopanda/AviSynthPlus/releases (I think avs+ can do same of it work after 3.6 update) and there are some CUDA plugins here https://github.com/nekopanda/AviSynthCUDAFilters (don't know if they can work in avs+ 3.6) and aside from all that, maybe we need someone backport opencl versions of plugins from VS like https://github.com/HomeOfVapourSynth...Synth-NNEDI3CL (the SEt avs one is closed source and no one can update it since SEt is no longer active) and there are some plugins have both CL and CPU functions in the same plugin like https://github.com/HomeOfVapourSynth...urSynth-TCanny Asd already backport it to avs but only for the CPU function! __________________ See My Avisynth Stuff Last edited by real.finder; 23rd January 2021 at 16:17.

24th January 2021, 19:06	#9 \| Link
Atak_Snajpera RipBot264 author Join Date: May 2006 Location: Poland Posts: 7,815	Yeah, forget about QTGMC placebo and just use medium. Anything above that is a waste of time and electricity. Regarding prefetch ,i recommend using number of physical cores first instead of going straight to number of total supported threads. You may also reduce number of threads in nnedi to 2 or even 1. __________________ Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper Last edited by Atak_Snajpera; 24th January 2021 at 19:10.

28th January 2021, 11:17	#10 \| Link
FranceBB Broadcast Encoder Join Date: Nov 2013 Location: Royal Borough of Kensington & Chelsea, UK Posts: 2,904	Ok, I tried with Prefetch and I gotta say, I'm not impressed at all... If anything, I'm surprised 'cause it's even slower than without it... I tried limiting NNEDI to 1 thread and also removing it completely from the filter chain, but nothing, in all my tests, I dropped from 0.3-0.5fps without Prefetch to 0.1fps with Prefetch at 28... EDIT: Lowering Prefetch down to 8 or 6 allows me to get the very same speed I usually get without Prefetch, so 0.3fps... It's not really worth it... I'm not gonna be using Prefetch! (Keep in mind that it's a 28c/56th Xeon, so I expected much better from it...) __________________ LUT Collection FFAStrans Videotek - AAA - SafeColorLimiter Last edited by FranceBB; 28th January 2021 at 12:00.

28th January 2021, 13:53	#12 \| Link
FranceBB Broadcast Encoder Join Date: Nov 2013 Location: Royal Borough of Kensington & Chelsea, UK Posts: 2,904	Setting SetMemoryMax(128000) so 128 GB, which is the maximum available RAM on the other server and Prefetch to 28, it goes all the way up to 21 GB of RAM, then it goes down to 14 GB, then it goes up to 21 GB, then it drops to 14 GB in a loop. The speed however is the same: 0.1fps. With Prefetch 2 the RAM is steady and way lower and the speed is 0.3fps, so about the same as I get without Prefetch. This is definitely weird... __________________ LUT Collection FFAStrans Videotek - AAA - SafeColorLimiter

28th January 2021, 15:53	#13 \| Link
pinterf Registered User Join Date: Jan 2014 Posts: 2,314	The bottleneck is TemporalMedian in Spotless. TemporalMedian works internally by histograms, bit depth heavily affects the speed. Checking only 256 levels is much quicker than doing it with a histogram array size of 65536. First I have modded the plugin to use SSE2 for 16 bit videos. Presently only 8 bit videos have SSE2 in TemporalMedian, 10+ bit depths are using plain C. (Untested, did not put it in live code) It got quicker but not that much. Then I tried feeding MedianBlur with only a 10 bit clip. I recommend you trying this option. EDIT: specify directly threads=1 for dfttest when using Prefetch. Its default value is 0, which means that it is using num_processors internal threads. When thread count is not 1, this filter has MT_SERIALIZED behaviour instead of MT_MULTI_INSTANCE. __________________ AviSynth+ on github, Other repos: RgTools, Masktools2, MvTools2, TIVTC, Average Last edited by pinterf; 28th January 2021 at 17:50. Reason: dfttest

23rd January 2021, 19:08	#4 \| Link
Frank62 Registered User Join Date: Mar 2017 Location: Germany Posts: 234	I wouldn't use QTGMC with "placebo". Seems the bottleneck to me. And often "slow" leaves more details. Try it.

24th January 2021, 13:50	#6 \| Link
Frank62 Registered User Join Date: Mar 2017 Location: Germany Posts: 234	Ok... with so much grain it really will make no difference. Then better try "fast"... For electronical grain like this we still use NeatVideo, since many years. But also quite slow.

24th January 2021, 18:54	#8 \| Link
Frank62 Registered User Join Date: Mar 2017 Location: Germany Posts: 234	Just if you are interested: We use NeatVideo as best solution for this kind of noise, but I forgot: VERY carefully... In amost all cases we turn it to only 5% spatial heights (mids and lows zero!), and temporally 2 or 3 frames. So it provides the best temporal noise remover I know up to now. In many cases we also mix back some of the original noise (overlay, transparency ~0.3) to avoid wax-effect.

30th January 2021, 14:42	#15 \| Link
pinterf Registered User Join Date: Jan 2014 Posts: 2,314	I don't know. Back to Spotless: the way TemporalMedian is used (radius=0, temporal radius=1) is highly unoptimal in present plugin, I'm considering optimizing this special case. You could also try z_ConvertFormat instead of Matrix, it can combine the colorspace the bit depth conversion and dithering. __________________ AviSynth+ on github, Other repos: RgTools, Masktools2, MvTools2, TIVTC, Average

30th January 2021, 16:45	#17 \| Link
pinterf Registered User Join Date: Jan 2014 Posts: 2,314	I've just tried the above mentioned special use case (radius=0, temporal radius=1) with an optimized TemporalMedian version. Breaking the script after Spotless: With the original DLL version the script run at 0.37fps. Then I developed AVX2 into TemporalMedian (still the generic approach) and it reached 0.57fps. Good. But this special case separation resulted in a huge speed gain, now I'm getting 3.08fps. A significant change. AAA+ Green Label I'm doing some more checks then I release it in some days. __________________ AviSynth+ on github, Other repos: RgTools, Masktools2, MvTools2, TIVTC, Average

30th January 2021, 23:15	#20 \| Link
Frank62 Registered User Join Date: Mar 2017 Location: Germany Posts: 234	Thanks from me, too! Will save a lot of time in the future!