Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd January 2021, 14:17   #1  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Offloading to GPU

Hi there, folks,
out of curiosity, this is one of the scripts I'm currently running automatically on several files on a farm with two servers which have an Intel Xeon 28c/56th and 64 GB of RAM each and I'm quite happy with the results I've got as output, however it's currently working at 0.6fps and encoding it's not the issue, Avisynth is. Is there anything (besides the Indexer) that I can offload to the GPU? I know about DGDecodeNV and I'm also a "customer" (Donald knows ) but what about the other filters I'm using? As to encoding itself, I'm not really planning to use GPU encoding for the output due to quality concerns, so... is there anything inside Avisynth that I can offload to GPU?

(side note: Input is v210 lossless .avi 720x576 25i so nothing hard to index, you know...)

Code:
#Indexing
video=FFVideoSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi")
ch1=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=1)
ch2=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=2)
ch3=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=3)
ch4=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=4)
audio=MergeChannels(ch1, ch2, ch3, ch4, ch1, ch2, ch3, ch4)
AudioDub(video, audio)

#Bob-deinterlacing
AssumeTFF()
QTGMC( Preset="Placebo")

#Bring everything to 16bit planar
HBD=ConvertBits(m_clip, bits=16)

#Convert to 4:2:2 planar 16bit
c=Converttoyuv422(HBD, matrix="Rec601")

#De-Sport 16bit planar
SpotLess(c)

#Degrain in 16bit planar
super = MSuper(pel=2, sharp=1)
bv1 = MAnalyse(super, isb = true, delta = 1, overlap=4)
fv1 = MAnalyse(super, isb = false, delta = 1, overlap=4)
bv2 = MAnalyse(super, isb = true, delta = 2, overlap=4)
fv2 = MAnalyse(super, isb = false, delta = 2, overlap=4)
degrain=MDegrain2(super,bv1,fv1,bv2,fv2,thSADC=1200, thSAD=1200)

#Spatial denoise 16bit planar
denoise=dfttest(degrain, sigma=64, tbsize=1, lsb_in=false, lsb=false, Y=true, U=true, V=true, dither=0)

#Adding borders for 1.33 PB 4:3 with 16bit planar precision
borders=AddBorders(denoise, 152, 0, 152, 0)

#Upscale to FULL HD with Spline64 + NNEDI and 16bit planar precision
resized=nnedi3_rpow2(borders, cshift="Spline64ResizeMT", rfactor=2, fwidth=1920, fheight=1080, nsize=4, nns=4, qual=1, etype=0, pscrn=2, threads=56, csresize=true, mpeg2=true, threads_rs=0, logicalCores_rs=true, MaxPhysCore_rs=true, SetAffinity_rs=false)

#From 16bit planar to 16bit interleaved
interleaved=ConvertToDoubleWidth(resized)

#Matrix Conversion from BT601 to BT709 with 16bit interleaved precision
color=Matrix(interleaved, from=601, to=709, rg=1.0, gg=1.0, bg=1.0, a=16, b=235, ao=16, bo=235, bitdepth=16)

#From 16bit interleaved to 16bit planar
planar=ConvertFromDoubleWidth(color)

#Dithering from 16bit planar to 8bit planar with the Floyd-Steinberg error diffusion
dithered=ConvertBits(planar, bits=8, dither=1)

#Limiter TV Range 0.0 - 0.7V
m_clip=Limiter(dithered, min_luma=16, max_luma=235, min_chroma=16, max_chroma=240)


Return m_clip
FranceBB is offline   Reply With Quote
Old 23rd January 2021, 15:32   #2  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,815
By the way, no prefetch in script?

Code:
#Indexing
video=FFVideoSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi")
ch1=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=1)
ch2=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=2)
ch3=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=3)
ch4=FFAudioSource("\\mibcssda001\Media Ingest\00_INGEST_MAM\A.R.C.A\00_FILE_DA_ENCODARE\file.avi", track=4)
audio=MergeChannels(ch1, ch2, ch3, ch4, ch1, ch2, ch3, ch4)
AudioDub(video, audio)

#Bob-deinterlacing
AssumeTFF()
QTGMC( Preset="Placebo")

#Bring everything to 16bit planar
HBD=ConvertBits(m_clip, bits=16)

#Convert to 4:2:2 planar 16bit
c=Converttoyuv422(HBD, matrix="Rec601")

#De-Sport 16bit planar
SpotLess(c)

#Degrain in 16bit planar
super = MSuper(pel=2, sharp=1)
bv1 = MAnalyse(super, isb = true, delta = 1, overlap=4)
fv1 = MAnalyse(super, isb = false, delta = 1, overlap=4)
bv2 = MAnalyse(super, isb = true, delta = 2, overlap=4)
fv2 = MAnalyse(super, isb = false, delta = 2, overlap=4)
degrain=MDegrain2(super,bv1,fv1,bv2,fv2,thSADC=1200, thSAD=1200)

#Spatial denoise 16bit planar
denoise=dfttest(degrain, sigma=64, tbsize=1, lsb_in=false, lsb=false, Y=true, U=true, V=true, dither=0)

#Adding borders for 1.33 PB 4:3 with 16bit planar precision
borders=AddBorders(denoise, 152, 0, 152, 0)

#Upscale to FULL HD with Spline64 + NNEDI and 16bit planar precision
resized=nnedi3_rpow2(borders, cshift="Spline64ResizeMT", rfactor=2, fwidth=1920, fheight=1080, nsize=4, nns=4, qual=1, etype=0, pscrn=2, threads=56, csresize=true, mpeg2=true, threads_rs=0, logicalCores_rs=true, MaxPhysCore_rs=true, SetAffinity_rs=false)

#From 16bit planar to 16bit interleaved
interleaved=ConvertToDoubleWidth(resized)

#Matrix Conversion from BT601 to BT709 with 16bit interleaved precision
color=Matrix(interleaved, from=601, to=709, rg=1.0, gg=1.0, bg=1.0, a=16, b=235, ao=16, bo=235, bitdepth=16)

#From 16bit interleaved to 16bit planar
planar=ConvertFromDoubleWidth(color)

#Dithering from 16bit planar to 8bit planar with the Floyd-Steinberg error diffusion
dithered=ConvertBits(planar, bits=8, dither=1)

#Limiter TV Range 0.0 - 0.7V
m_clip=Limiter(dithered, min_luma=16, max_luma=235, min_chroma=16, max_chroma=240)

#Prefetch
m_clip=Prefetch(m_clip,28)

Return m_clip
Furthermore, Personally I wouldn't use such high values in MDegrain2 (thSADC=1200, thSAD=1200) because it is recipe for ugly ghosting artefacts.

Last edited by Atak_Snajpera; 23rd January 2021 at 15:47.
Atak_Snajpera is offline   Reply With Quote
Old 23rd January 2021, 16:14   #3  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,587
there are https://github.com/nekopanda/AviSynthPlus/releases (I think avs+ can do same of it work after 3.6 update)

and there are some CUDA plugins here https://github.com/nekopanda/AviSynthCUDAFilters (don't know if they can work in avs+ 3.6)

and aside from all that, maybe we need someone backport opencl versions of plugins from VS like https://github.com/HomeOfVapourSynth...Synth-NNEDI3CL (the SEt avs one is closed source and no one can update it since SEt is no longer active) and there are some plugins have both CL and CPU functions in the same plugin like https://github.com/HomeOfVapourSynth...urSynth-TCanny Asd already backport it to avs but only for the CPU function!
__________________
See My Avisynth Stuff

Last edited by real.finder; 23rd January 2021 at 16:17.
real.finder is offline   Reply With Quote
Old 23rd January 2021, 19:08   #4  |  Link
Frank62
Registered User
 
Join Date: Mar 2017
Location: Germany
Posts: 234
I wouldn't use QTGMC with "placebo". Seems the bottleneck to me. And often "slow" leaves more details. Try it.
Frank62 is offline   Reply With Quote
Old 24th January 2021, 10:47   #5  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
The only reason why I didn't add Prefetch is that I know that some of the plugins create their own thread pool, like plugins_JPSDR which I'm using in my filterchain, so I don't know how it's gonna behave, but if you think it's gonna behave nicely, I'll add it.


Quote:
Originally Posted by Atak_Snajpera View Post
Personally I wouldn't use such high values in MDegrain2 (thSADC=1200, thSAD=1200) because it is recipe for ugly ghosting artefacts.
True, however there's so much noise and grain on those tapes that I don't have much choice. Those are 25i truly interlaced recordings of live feeds from RAI from the 70s on U-Matic and believe me, they have no details whatsoever, tons of grain and far too much noise (due to the transmission methods employed at the time).





Note: not my pictures

We've also noticed a deterioration of the binders in a magnetic tape which hold the iron oxide magnetic coating to its plastic carrier. Some people suggested dehydrating them in a carefully controlled manner, but we don't have the tools to do that, anyway for now it seems they're playing someway, somehow, so it might as well be the last time they play. They're in horrible conditions and a very strong denoise and degrain is needed (oh and I checked, I don't get ghosting, except when the ball is sometimes removed in tennis matches, but I encode them with different parameters to solve the problem, so it's not a big deal. )

Quote:
Originally Posted by Frank62 View Post
I wouldn't use QTGMC with "placebo". Seems the bottleneck to me. And often "slow" leaves more details. Try it.
Not that there are many details in those contents, but I'll give it a shot.

Last edited by FranceBB; 24th January 2021 at 10:51.
FranceBB is offline   Reply With Quote
Old 24th January 2021, 13:50   #6  |  Link
Frank62
Registered User
 
Join Date: Mar 2017
Location: Germany
Posts: 234
Ok... with so much grain it really will make no difference. Then better try "fast"...
For electronical grain like this we still use NeatVideo, since many years. But also quite slow.
Frank62 is offline   Reply With Quote
Old 24th January 2021, 14:01   #7  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by Frank62 View Post
For electronical grain like this we still use NeatVideo, since many years. But also quite slow.
Yeah, Jean Philippe also suggested it to me two years ago (although it's a paid solution).
FranceBB is offline   Reply With Quote
Old 24th January 2021, 18:54   #8  |  Link
Frank62
Registered User
 
Join Date: Mar 2017
Location: Germany
Posts: 234
Just if you are interested:
We use NeatVideo as best solution for this kind of noise, but I forgot: VERY carefully...
In amost all cases we turn it to only 5% spatial heights (mids and lows zero!), and temporally 2 or 3 frames. So it provides the best temporal noise remover I know up to now.
In many cases we also mix back some of the original noise (overlay, transparency ~0.3) to avoid wax-effect.
Frank62 is offline   Reply With Quote
Old 24th January 2021, 19:06   #9  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,815
Yeah, forget about QTGMC placebo and just use medium. Anything above that is a waste of time and electricity. Regarding prefetch ,i recommend using number of physical cores first instead of going straight to number of total supported threads. You may also reduce number of threads in nnedi to 2 or even 1.

Last edited by Atak_Snajpera; 24th January 2021 at 19:10.
Atak_Snajpera is offline   Reply With Quote
Old 28th January 2021, 11:17   #10  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Ok, I tried with Prefetch and I gotta say, I'm not impressed at all...
If anything, I'm surprised 'cause it's even slower than without it...
I tried limiting NNEDI to 1 thread and also removing it completely from the filter chain, but nothing, in all my tests, I dropped from 0.3-0.5fps without Prefetch to 0.1fps with Prefetch at 28...

EDIT: Lowering Prefetch down to 8 or 6 allows me to get the very same speed I usually get without Prefetch, so 0.3fps... It's not really worth it... I'm not gonna be using Prefetch! (Keep in mind that it's a 28c/56th Xeon, so I expected much better from it...)

Last edited by FranceBB; 28th January 2021 at 12:00.
FranceBB is offline   Reply With Quote
Old 28th January 2021, 12:10   #11  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
Quote:
Originally Posted by FranceBB View Post
Ok, I tried with Prefetch and I gotta say, I'm not impressed at all...
If anything, I'm surprised 'cause it's even slower than without it...
I tried limiting NNEDI to 1 thread and also removing it completely from the filter chain, but nothing, in all my tests, I dropped from 0.3-0.5fps without Prefetch to 0.1fps with Prefetch at 28...

EDIT: Lowering Prefetch down to 8 or 6 allows me to get the very same speed I usually get without Prefetch, so 0.3fps... It's not really worth it... I'm not gonna be using Prefetch! (Keep in mind that it's a 28c/56th Xeon, so I expected much better from it...)
Have you adjusted SetMemoryMax? Large thread count needs more memory. Low memory kills the caches and the speed. Put it to a huge value, then check the actual memory consumption with Avsmeter.
pinterf is offline   Reply With Quote
Old 28th January 2021, 13:53   #12  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Setting SetMemoryMax(128000) so 128 GB, which is the maximum available RAM on the other server and Prefetch to 28, it goes all the way up to 21 GB of RAM, then it goes down to 14 GB, then it goes up to 21 GB, then it drops to 14 GB in a loop.
The speed however is the same: 0.1fps.
With Prefetch 2 the RAM is steady and way lower and the speed is 0.3fps, so about the same as I get without Prefetch.
This is definitely weird...
FranceBB is offline   Reply With Quote
Old 28th January 2021, 15:53   #13  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
The bottleneck is TemporalMedian in Spotless.
TemporalMedian works internally by histograms, bit depth heavily affects the speed. Checking only 256 levels is much quicker than doing it with a histogram array size of 65536.

First I have modded the plugin to use SSE2 for 16 bit videos.
Presently only 8 bit videos have SSE2 in TemporalMedian, 10+ bit depths are using plain C. (Untested, did not put it in live code)
It got quicker but not that much.

Then I tried feeding MedianBlur with only a 10 bit clip. I recommend you trying this option.

EDIT:
specify directly threads=1 for dfttest when using Prefetch. Its default value is 0, which means that it is using num_processors internal threads. When thread count is not 1, this filter has MT_SERIALIZED behaviour instead of MT_MULTI_INSTANCE.

Last edited by pinterf; 28th January 2021 at 17:50. Reason: dfttest
pinterf is offline   Reply With Quote
Old 30th January 2021, 14:19   #14  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by pinterf View Post
specify directly threads=1 for dfttest when using Prefetch. Its default value is 0, which means that it is using num_processors internal threads. When thread count is not 1, this filter has MT_SERIALIZED behaviour instead of MT_MULTI_INSTANCE.
Ok, I'll try with threads=1 on dfttest as well, but question: I just noticed that it can't handle more than 16 threads if I use it normally without Prefetch.

Code:
	if (threads < 0 || threads > 16)
		env->ThrowError("dfttest:  threads must be between 0 and 16 (inclusive)!");
line 1345-1346 of dfttest.cpp. Why is that?
FranceBB is offline   Reply With Quote
Old 30th January 2021, 14:42   #15  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
I don't know.
Back to Spotless: the way TemporalMedian is used (radius=0, temporal radius=1) is highly unoptimal in present plugin, I'm considering optimizing this special case.
You could also try z_ConvertFormat instead of Matrix, it can combine the colorspace the bit depth conversion and dithering.
pinterf is offline   Reply With Quote
Old 30th January 2021, 14:58   #16  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by pinterf View Post
I don't know.
Back to Spotless: the way TemporalMedian is used (radius=0, temporal radius=1) is highly unoptimal in present plugin, I'm considering optimizing this special case.
You could also try z_ConvertFormat instead of Matrix, it can combine the colorspace the bit depth conversion and dithering.
Gotcha.
I'll try to replace it with z_ConvertFormat so that I don't have to go to 16bit interleaved and come back. That should speed things up even further.
FranceBB is offline   Reply With Quote
Old 30th January 2021, 16:45   #17  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
I've just tried the above mentioned special use case (radius=0, temporal radius=1) with an optimized TemporalMedian version.
Breaking the script after Spotless:
With the original DLL version the script run at 0.37fps.
Then I developed AVX2 into TemporalMedian (still the generic approach) and it reached 0.57fps. Good.
But this special case separation resulted in a huge speed gain, now I'm getting 3.08fps. A significant change.
AAA+ Green Label
I'm doing some more checks then I release it in some days.
pinterf is offline   Reply With Quote
Old 30th January 2021, 21:25   #18  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,314
Please test with this one: MedianBlur2 new version.
https://github.com/pinterf/MedianBlur2/releases/tag/1.1
Code:
- 1.1 (20210130) - pinterf
  - Speed: SSE2 and AVX2 for 10+ bits (generic case, MedianBlur)
  - Speed: SSE2 and AVX2 for TemporalMedianBlur
  - Speed: Much-much quicker: TemporalMedianBlur special case: temporal radius=1 or 2, spatial radius=0 (C, SSE4.1, AVX2)
  - Pass frame properties when Avisynth interface>=8
  - Debug helper parameter 'opt': integer default -1
    <0: autodetect CPU
    0: C only (disable SSE2 and AVX2)
    1: SSE2 (disable SSE4.1 and AVX2)
    2: SSE4 (disable AVX2)
    3: AVX2
pinterf is offline   Reply With Quote
Old 30th January 2021, 21:26   #19  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,904
Quote:
Originally Posted by pinterf View Post
I've just tried the above mentioned special use case (radius=0, temporal radius=1) with an optimized TemporalMedian version.
Breaking the script after Spotless:
With the original DLL version the script run at 0.37fps.
Then I developed AVX2 into TemporalMedian (still the generic approach) and it reached 0.57fps. Good.
But this special case separation resulted in a huge speed gain, now I'm getting 3.08fps. A significant change.
AAA+ Green Label
I'm doing some more checks then I release it in some days.
Wow! 3FPS? That would be a dream!! *_*
It would speed things up a lot considering that this filterchain is here to stay in the foreseeable future in our server! Thanks!!
I really look forward to try it and put it in production!
FranceBB is offline   Reply With Quote
Old 30th January 2021, 23:15   #20  |  Link
Frank62
Registered User
 
Join Date: Mar 2017
Location: Germany
Posts: 234
Thanks from me, too! Will save a lot of time in the future!
Frank62 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:36.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.