Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 5th March 2023, 11:41   #21  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,029
Quote:
Originally Posted by DTL View Post
Though i9-7980XE is about the same as currently used TR 1950 - both 4channels of DDR4 and comparable number of cores. So i9-7980 may be a bit faster in some applications (where AVX512 is more or less good used) - https://static.techspot.com/articles...1HandBrake.png and somewhere slower.

Real decreasing of processing time twice is simply adding second TR 1950 system and spreading task to 2 parts. As time going on - the TR1950 will be more and more cheaper because at home systems people typically like to have one a bit more performance system and throw away for a low price second hand old system after upgrade.
That's the one massive advantage of using RipBot, with the Distributed Encoding, you can have up to 16 other PC's encoding the one job, and depending on how much filtering you need to do, I can get a full 4K movie encoded in less time than is would be to watch the movie , and that's using all 4 Ryzens, the 13900KF, and sometimes a dual E5-2697v2 server/workstation
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 6th March 2023, 02:47   #22  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,029
Quote:
Originally Posted by TDS View Post
OK, DTL, I have given up for today, so I will have a fiddle with tomorrow.
Cheers.
Hi DTL,

Well, talk about fiddle...

I ended up using this script (showing only the 1080 one):-

Code:
#Custom
SetCacheMode(1)
SetMemoryMax(16384)
LoadPlugin("%AVISYNTHPLUGINS%\PD_TOOLS\AddGrainC\AddGrainC.dll")
LoadPlugin("%AVISYNTHPLUGINS%\PD_TOOLS\vsTTempSmooth\vsTTempSmooth.dll")
ConvertToYV12(video)
AddGrain(10)
video=BlankClip(video, width=1920, height=1080, color=color_gray)
video=vsTTempSmooth(video,maxr=2)
I updated both AddGrainC & vsTTempSmooth, and I also had to change the test files from 10 bit to 8 bit

I would just like to add that it would NOT be possible to run this using "standard" RipBot264 !!!!

Anyway, here are the screenshots of the results, interesting to say the least.

Green borders is the 7950X, Red is the 13900KF...



Now, I'd like to see your test results....
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 6th March 2023, 08:08   #23  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,132
Your script not completely correct - you need to add noise to input of vsTTempSmooth. Using clear frame from BlankClip not load gathering engine as with real noised frames and in run mostly in 'broadcast' mode of single index gathered. It run somehow faster in compare with real life noised load and random indexes gathering for weight table.
Also using Trim+Loop after adding noise allow to use cache of AVS as uncompressed frameserver to vsTTempSmooth without calculating AddNoise for each frame - it allow to better test RAW vsTTempSmooth performance.

Code:
SetCacheMode(1)
SetMemoryMax(16384)
LoadPlugin("vsTTempSmooth.dll")
LoadPlugin("AddGrainC.dll")

video=BlankClip(width=1920, height=1080, color=color_gray, pixel_type="YV12")
video=AddGrain(video,10)
video=Trim(video, 0, 15)
video=Loop(video, 100000000) #cache frames from AddGrain to skip time for AddGrain activity from testing
video=vsTTempSmooth(video,maxr=2)
My current results with maxr=2 from Xeon Gold 6134 (16 threads prefetch - full hyperthreading):
1080p
AVX2 (opt=2) - 1010 fps
AVX512 (default) - 1200 fps

2160p
AVX2 (opt=2) - 255 fps
AVX512 (default) - 300 fps

Enabling hyperthreading at 8 cores Xeon adds only about 5..15% (difference between Prefetch(16) and Prefetch(8)) depending on maxr param and others.

From i5-9600K (AVX2 only, 6 threads):
1080p - 480 fps
2160p - 118 fps

Using AVX512 adds about only about 20% above AVX2 in full gathering mode. If use only frame-number based weighting - the benefit may be about 50% (no random memory gathering - more easy task for memory subsystem).
To enable frame-number only weighting mode I use params set
Code:
ythresh=10, ymdiff=20, uthresh= 10, vthresh=10, umdiff=20, vmdiff=20
for vsTTempSmooth().

Also I not understand why so small CPU usage reported. May you not add Prefetch(number of cores or HT threads) at the script ?

Last edited by DTL; 6th March 2023 at 08:29.
DTL is offline   Reply With Quote
Old 6th March 2023, 10:33   #24  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,029
Quote:
Originally Posted by DTL View Post
Your script not completely correct....

Also I not understand why so small CPU usage reported. May you not add Prefetch(number of cores or HT threads) at the script ?
Gee's I dunno, you do a guy a favour, and he picks it to pieces.

Like I said, RipBot can't use certain scripts, and the way they are written AFAIK.

I don't know what you use to run your scripts, but I only "know" RipBot, it's not widely used, but it certainly does what I need it to do, well the PD builds, that is.

I had to fiddle around with it until AVSMeter (within RipBot) actually ran the script.

Also, Prefetch is set elsewhere in the script, not here, and it's set @ 12, which seems to be the best for 16 core CPU's, and Avisynth, in RipBot.

It's not something I would use, so that's it.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..
TDS is offline   Reply With Quote
Old 6th March 2023, 10:37   #25  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,132
Well - thank you for results. May be AVSMeter may report not correct CPU usage when run not from command line. The fps numbers looks like many threads are used and CPU usage by script should be about 100%.. It is still visible some better performance of AMD 7xxx over intel 13 series.
DTL is offline   Reply With Quote
Old 6th March 2023, 11:41   #26  |  Link
TDS
Formally known as .......
 
TDS's Avatar
 
Join Date: Sep 2021
Location: Down Under.
Posts: 1,029
Quote:
Originally Posted by DTL View Post
Well - thank you for results. May be AVSMeter may report not correct CPU usage when run not from command line. The fps numbers looks like many threads are used and CPU usage by script should be about 100%.. It is still visible some better performance of AMD 7xxx over intel 13 series.
You are welcome, sorry it didn't quite go to your plan.

Most of the more experienced guys on Doom9 know how to test & benchmark all these complex scripts & commands, etc, etc

All I want to learn is to run filters & scripts that I can actually use.

I agree that the 7950X does have a slight advantage when actually encoding videos, they are so close, and for anyone that wants a good encoding system (without going overboard), an Intel i9-13900 can be a cheaper option than the 7950X.
__________________
Long term RipBot264 user.

RipBot264 modded builds..
*new* x264 & x265 addon packs..

Last edited by TDS; 6th March 2023 at 12:19.
TDS is offline   Reply With Quote
Old 7th March 2023, 06:11   #27  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
Just something to keep in mind, Avisynth is not always super effective at using many cores. For heavy scripts and HD content, you can get 16GB saturated with just 8 instances. Running 32 concurrent threats... can be quite a lot. Depends whether your filters need to be recreated for each threat, the type of video you process, and how much RAM you got available.
MysteryX is offline   Reply With Quote
Old 10th March 2023, 14:41   #28  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,935
Quote:
Originally Posted by DTL View Post
run the script with AVSMeter (x64 build) for about 10 seconds to have stable 'avg' fps (last number in fps line).
Test 1:
Type: Personal PC (Desktop)
CPU: i7 5930K 6c/12th AVX2 3.50GHz
RAM: 32GB DDR4 (4x8 GB)
GPU: NVIDIA GTX 980Ti
OS: Windows 10 Enterprise x64

AVS Script:

Code:
SetCacheMode(1)
SetMemoryMax(32000)
BlankClip(15, 1920, 1080, color=color_gray)
ConvertToYV12()
AddGrain(10)
Loop(10000000)
vsTTempSmooth(maxr=2)
Prefetch(12)

Code:
FPS (min | max | average):          97.29 | 211864 | 514.6
Process memory usage (max):         515 MiB
Thread count:                       25
CPU usage (average):                93.2%

Test 2:
Type: Server
CPU: Intel Xeon Gold 6238R x2 56c/112th AVX512 2.20GHz
RAM: 128GB DDR4 (4x32 GB)
GPU: NVIDIA Quadro M5000
OS: Windows Server 2019 x64


AVS Script:

Code:
SetCacheMode(1)
SetMemoryMax(128000)
BlankClip(15, 1920, 1080, color=color_gray)
ConvertToYV12()
AddGrain(10)
Loop(10000000)
vsTTempSmooth(maxr=2)
Prefetch(112)

Code:
FPS (min | max | average):          140.1 | 6557 | 2948
Process memory usage (max):         2853 MiB
Thread count:                       169
CPU usage (average):                82.5%
unfortunately Avisynth isn't numa node aware and it was therefore using only one CPU while the other was sitting there doing absolutely nothing.




This means that having so many threads in prefetch probably harmed performances, therefore the test has been repeated with a lower thread count pretending to be a single socket configuration (i.e Prefetch(56)):

Code:
FPS (min | max | average):          174.8 | 22862 | 3174
Process memory usage (max):         1652 MiB
Thread count:                       113
CPU usage (average):                90.8%
and indeed it helped and achieved faster speeds.


Test 3:
Type: Workstation
CPU: Intel Xeon E-2640 v4 x2 20c/40th AVX2 2.40 GHz
RAM: 64GB DDR3 (8x8 GB)
GPU: NVIDIA Quadro P4000
OS: Windows 10 Enterprise x64


Same as before, being a dual socket configuration, Prefetch had to be set to 20 instead of 40 as Avisynth isn't numa node aware:

AVS Script:

Code:
SetCacheMode(1)
SetMemoryMax(64000)
BlankClip(15, 1920, 1080, color=color_gray)
ConvertToYV12()
AddGrain(10)
Loop(10000000)
vsTTempSmooth(maxr=2)
Prefetch(20)

Code:
FPS (min | max | average):          132.1 | 119904 | 974.1
Process memory usage (max):         758 MiB
Thread count:                       61
CPU usage (average):                49.1%

This last one is gonna be a very interesting test 'cause I happen to have an old server with the very same CPU running in single socket configuration (although with less RAM ):


Test 4:
Type: Server
CPU: Intel Xeon E-2640 v4 10c/20th AVX2 2.40 GHz
RAM: 32GB DDR3 (4x8 GB)
GPU: NVIDIA Quadro P4000
OS: Windows Server 2016 x64

AVS Script:

Code:
SetCacheMode(1)
SetMemoryMax(32000)
BlankClip(15, 1920, 1080, color=color_gray)
ConvertToYV12()
AddGrain(10)
Loop(10000000)
vsTTempSmooth(maxr=2)
Prefetch(20)

Code:
FPS (min | max | average):          74.33 | 105215 | 799.6
Process memory usage (max):         765 MiB
Thread count:                       41
CPU usage (average):                92.6%



So, in a nutshell:

514.6fps - (i7 5930K 6c/12th AVX2 3.50GHz)
799.6fps - (Intel Xeon E-2640 v4 10c/20th AVX2 2.40 GHz)**
974.1fps - (Intel Xeon E-2640 v4 x2 20c/40th AVX2 2.40 GHz)**
3174fps - (Intel Xeon Gold 6238R x2 56c/112th AVX512 2.20GHz)

**the speed bump above comes from the fact that the dual socket server has twice the RAM of the single socket one, NOT from the fact that it's dual socket as Avisynth isn't numa nodes aware and will only use 1 socket**

Note: AVSMeter is "confused" by dual socket configs in server and non server versions of Windows. On Server editions of Windows, it calculates the % of the single socket, so if 1 socket is 100% and the other is at 1% it will say 100% while it should be 51%. On non server versions of Windows like Windows 10 Enterprise, instead, it *sees* the % as the sum of the individual sockets, so cpu 0 and cpu 1, therefore if 1 socket is at 100% and the other is at 1% it says 51%.

Last edited by FranceBB; 10th March 2023 at 14:48.
FranceBB is offline   Reply With Quote
Old 10th March 2023, 16:36   #29  |  Link
DTL
Registered User
 
Join Date: Jul 2018
Posts: 1,132
"3174fps - (Intel Xeon Gold 6238R x2 56c/112th "

It looks this chip is hardly limited with very low RAM performance relative to number of cores. The math calculation in vsTTempSmooth is very simple and most of time it looks only move data from global process working set in system RAM to CPU caches. For such massive multicore chips it is better to have task with many small workunits about AVX512 registerfile size of 2 kBytes or may be L1D cache size of about 32 kBytes or a bit more and long computing inside this dataset. So performance of massive multicore chips is limited by number of RAM channels and DDR-generation and Xeon Gold 6238R with 6channels DDR4 easily enough wins. Unfortunately Windows Task Manager show busy/load state of core even if it really stall doing nothing and waiting for data from memory subsystem.
Though I thought 56Core Xeon should have >6 memory channels like 8. Or Gold is too cheap and only Platinum have 8 and more memory channels ? Like Xeon Platinum 9221 - https://en.wikichip.org/wiki/intel/xeon_platinum/9221 - supports up to twelve channels of DDR4-2933 memory. For 32 core only.

The really fast data compute accelerators already go away from this very old and slow DDR-RAM and use HBM RAM. Typical NVIDIA DCA like Tesla A100 have about 2 TB/s RAM speed @80 GB card and 12channels DDR4 at expensive Xeon only about 250 GB/s (0.25 TB/s). So poor even for sub $10000 price. Intel CPUs still about many computing with relatively small data workunits. HBM RAM expected in 'general purpose' CPUs setups in 202x years. So may be too early to upgrade to one more expensive Xeon with still very slow RAM.

For real more calculating load of core it may be used some 'gold pack' of AVS typical linear processing operations as 'AVS performance test script': 1. denoise with mvtools, 2. spatial correction with generalconvolution not very small 5x5 matrix, 3. downsize (rip for release creation)
So it will be not only test of speed of memory subsystem.
Something like:
Code:
LoadPlugin("mvtools2.dll")
LoadPlugin("AddGrainC.dll")

BlankClip(30, 1920, 1080, color=color_gray) # big number of frames in AVS source cache may make benefit of large L2/L3 cache CPUs ?
#BlankClip(30, 3840, 2160, color=color_gray)

ConvertToYV12()
ConvertBits(16) #? or in 8bit ? not very nice for 202x with HDR and 10bit enduser h.265
AddGrain(30) # or 50 or more ?
Loop(10000000)
#source preparation end here

tr=7
super=MSuper(pel=2) #medium quality
multi_vec=MAnalyse(super, multi=true, delta=tr, search=3, overlap=4, trymany=true) #exhaustive search, full 4x overlap, all predictors refining, lots of hard-to-SIMD algorithmic load
MDegrainN(super, multi_vec, tr, thSAD=500)

GeneralConvolution(0, "
-1 -2 -2 -2 -1
-2 -2 -3 -2 -2
-2 -3 80 -3 -2
-2 -2 -3 -2 -2
-1 -2 -2 -2 -1", auto=true, luma=true, chroma=false)

LanczosResize(1280, 720, taps=2) # need resize with support=2 or more to have equal load to SinPow/UD2 resizers

ConvertBits(10, dither=1)

Last edited by DTL; 11th March 2023 at 22:19.
DTL is offline   Reply With Quote
Reply

Tags
1950x, 7950x, amd, ryzen, threadripper

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 04:37.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.