Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
![]() |
#21 | Link | |
Formally known as .......
Join Date: Sep 2021
Location: On a need to know basis.
Posts: 806
|
Quote:
![]() ![]()
__________________
This can be Very "TeDiouS".. Long term RipBot264 user. Ryzen 9 7950X Intel i9-13900KF Ryzen 9 5950X Ryzen 9 5900X Ryzen 9 3950X Link to RB v1.27.0 |
|
![]() |
![]() |
![]() |
#22 | Link | |
Formally known as .......
Join Date: Sep 2021
Location: On a need to know basis.
Posts: 806
|
Quote:
Well, talk about fiddle... I ended up using this script (showing only the 1080 one):- Code:
#Custom SetCacheMode(1) SetMemoryMax(16384) LoadPlugin("%AVISYNTHPLUGINS%\PD_TOOLS\AddGrainC\AddGrainC.dll") LoadPlugin("%AVISYNTHPLUGINS%\PD_TOOLS\vsTTempSmooth\vsTTempSmooth.dll") ConvertToYV12(video) AddGrain(10) video=BlankClip(video, width=1920, height=1080, color=color_gray) video=vsTTempSmooth(video,maxr=2) ![]() I would just like to add that it would NOT be possible to run this using "standard" RipBot264 !!!! Anyway, here are the screenshots of the results, interesting to say the least. Green borders is the 7950X, Red is the 13900KF... ![]() Now, I'd like to see your test results....
__________________
This can be Very "TeDiouS".. Long term RipBot264 user. Ryzen 9 7950X Intel i9-13900KF Ryzen 9 5950X Ryzen 9 5900X Ryzen 9 3950X Link to RB v1.27.0 |
|
![]() |
![]() |
![]() |
#23 | Link |
Registered User
Join Date: Jul 2018
Posts: 980
|
Your script not completely correct - you need to add noise to input of vsTTempSmooth. Using clear frame from BlankClip not load gathering engine as with real noised frames and in run mostly in 'broadcast' mode of single index gathered. It run somehow faster in compare with real life noised load and random indexes gathering for weight table.
Also using Trim+Loop after adding noise allow to use cache of AVS as uncompressed frameserver to vsTTempSmooth without calculating AddNoise for each frame - it allow to better test RAW vsTTempSmooth performance. Code:
SetCacheMode(1) SetMemoryMax(16384) LoadPlugin("vsTTempSmooth.dll") LoadPlugin("AddGrainC.dll") video=BlankClip(width=1920, height=1080, color=color_gray, pixel_type="YV12") video=AddGrain(video,10) video=Trim(video, 0, 15) video=Loop(video, 100000000) #cache frames from AddGrain to skip time for AddGrain activity from testing video=vsTTempSmooth(video,maxr=2) 1080p AVX2 (opt=2) - 1010 fps AVX512 (default) - 1200 fps 2160p AVX2 (opt=2) - 255 fps AVX512 (default) - 300 fps Enabling hyperthreading at 8 cores Xeon adds only about 5..15% (difference between Prefetch(16) and Prefetch(8)) depending on maxr param and others. From i5-9600K (AVX2 only, 6 threads): 1080p - 480 fps 2160p - 118 fps Using AVX512 adds about only about 20% above AVX2 in full gathering mode. If use only frame-number based weighting - the benefit may be about 50% (no random memory gathering - more easy task for memory subsystem). To enable frame-number only weighting mode I use params set Code:
ythresh=10, ymdiff=20, uthresh= 10, vthresh=10, umdiff=20, vmdiff=20 Also I not understand why so small CPU usage reported. May you not add Prefetch(number of cores or HT threads) at the script ? Last edited by DTL; 6th March 2023 at 08:29. |
![]() |
![]() |
![]() |
#24 | Link | |
Formally known as .......
Join Date: Sep 2021
Location: On a need to know basis.
Posts: 806
|
Quote:
![]() Like I said, RipBot can't use certain scripts, and the way they are written AFAIK. I don't know what you use to run your scripts, but I only "know" RipBot, it's not widely used, but it certainly does what I need it to do, well the PD builds, that is. I had to fiddle around with it until AVSMeter (within RipBot) actually ran the script. Also, Prefetch is set elsewhere in the script, not here, and it's set @ 12, which seems to be the best for 16 core CPU's, and Avisynth, in RipBot. It's not something I would use, so that's it.
__________________
This can be Very "TeDiouS".. Long term RipBot264 user. Ryzen 9 7950X Intel i9-13900KF Ryzen 9 5950X Ryzen 9 5900X Ryzen 9 3950X Link to RB v1.27.0 |
|
![]() |
![]() |
![]() |
#25 | Link |
Registered User
Join Date: Jul 2018
Posts: 980
|
Well - thank you for results. May be AVSMeter may report not correct CPU usage when run not from command line. The fps numbers looks like many threads are used and CPU usage by script should be about 100%.. It is still visible some better performance of AMD 7xxx over intel 13 series.
|
![]() |
![]() |
![]() |
#26 | Link | |
Formally known as .......
Join Date: Sep 2021
Location: On a need to know basis.
Posts: 806
|
Quote:
Most of the more experienced guys on Doom9 know how to test & benchmark all these complex scripts & commands, etc, etc All I want to learn is to run filters & scripts that I can actually use. I agree that the 7950X does have a slight advantage when actually encoding videos, they are so close, and for anyone that wants a good encoding system (without going overboard), an Intel i9-13900 can be a cheaper option than the 7950X.
__________________
This can be Very "TeDiouS".. Long term RipBot264 user. Ryzen 9 7950X Intel i9-13900KF Ryzen 9 5950X Ryzen 9 5900X Ryzen 9 3950X Link to RB v1.27.0 Last edited by TDS; 6th March 2023 at 12:19. |
|
![]() |
![]() |
![]() |
#27 | Link |
Soul Architect
Join Date: Apr 2014
Posts: 2,559
|
Just something to keep in mind, Avisynth is not always super effective at using many cores. For heavy scripts and HD content, you can get 16GB saturated with just 8 instances. Running 32 concurrent threats... can be quite a lot. Depends whether your filters need to be recreated for each threat, the type of video you process, and how much RAM you got available.
|
![]() |
![]() |
![]() |
#28 | Link | |
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,664
|
Quote:
Type: Personal PC (Desktop) CPU: i7 5930K 6c/12th AVX2 3.50GHz RAM: 32GB DDR4 (4x8 GB) GPU: NVIDIA GTX 980Ti OS: Windows 10 Enterprise x64 AVS Script: Code:
SetCacheMode(1) SetMemoryMax(32000) BlankClip(15, 1920, 1080, color=color_gray) ConvertToYV12() AddGrain(10) Loop(10000000) vsTTempSmooth(maxr=2) Prefetch(12) Code:
FPS (min | max | average): 97.29 | 211864 | 514.6 Process memory usage (max): 515 MiB Thread count: 25 CPU usage (average): 93.2% Test 2: Type: Server CPU: Intel Xeon Gold 6238R x2 56c/112th AVX512 2.20GHz RAM: 128GB DDR4 (4x32 GB) GPU: NVIDIA Quadro M5000 OS: Windows Server 2019 x64 AVS Script: Code:
SetCacheMode(1) SetMemoryMax(128000) BlankClip(15, 1920, 1080, color=color_gray) ConvertToYV12() AddGrain(10) Loop(10000000) vsTTempSmooth(maxr=2) Prefetch(112) Code:
FPS (min | max | average): 140.1 | 6557 | 2948 Process memory usage (max): 2853 MiB Thread count: 169 CPU usage (average): 82.5% ![]() This means that having so many threads in prefetch probably harmed performances, therefore the test has been repeated with a lower thread count pretending to be a single socket configuration (i.e Prefetch(56)): Code:
FPS (min | max | average): 174.8 | 22862 | 3174 Process memory usage (max): 1652 MiB Thread count: 113 CPU usage (average): 90.8% Test 3: Type: Workstation CPU: Intel Xeon E-2640 v4 x2 20c/40th AVX2 2.40 GHz RAM: 64GB DDR3 (8x8 GB) GPU: NVIDIA Quadro P4000 OS: Windows 10 Enterprise x64 Same as before, being a dual socket configuration, Prefetch had to be set to 20 instead of 40 as Avisynth isn't numa node aware: AVS Script: Code:
SetCacheMode(1) SetMemoryMax(64000) BlankClip(15, 1920, 1080, color=color_gray) ConvertToYV12() AddGrain(10) Loop(10000000) vsTTempSmooth(maxr=2) Prefetch(20) Code:
FPS (min | max | average): 132.1 | 119904 | 974.1 Process memory usage (max): 758 MiB Thread count: 61 CPU usage (average): 49.1% This last one is gonna be a very interesting test 'cause I happen to have an old server with the very same CPU running in single socket configuration (although with less RAM ![]() Test 4: Type: Server CPU: Intel Xeon E-2640 v4 10c/20th AVX2 2.40 GHz RAM: 32GB DDR3 (4x8 GB) GPU: NVIDIA Quadro P4000 OS: Windows Server 2016 x64 AVS Script: Code:
SetCacheMode(1) SetMemoryMax(32000) BlankClip(15, 1920, 1080, color=color_gray) ConvertToYV12() AddGrain(10) Loop(10000000) vsTTempSmooth(maxr=2) Prefetch(20) Code:
FPS (min | max | average): 74.33 | 105215 | 799.6 Process memory usage (max): 765 MiB Thread count: 41 CPU usage (average): 92.6% So, in a nutshell: 514.6fps - (i7 5930K 6c/12th AVX2 3.50GHz) 799.6fps - (Intel Xeon E-2640 v4 10c/20th AVX2 2.40 GHz)** 974.1fps - (Intel Xeon E-2640 v4 x2 20c/40th AVX2 2.40 GHz)** 3174fps - (Intel Xeon Gold 6238R x2 56c/112th AVX512 2.20GHz) **the speed bump above comes from the fact that the dual socket server has twice the RAM of the single socket one, NOT from the fact that it's dual socket as Avisynth isn't numa nodes aware and will only use 1 socket** Note: AVSMeter is "confused" by dual socket configs in server and non server versions of Windows. On Server editions of Windows, it calculates the % of the single socket, so if 1 socket is 100% and the other is at 1% it will say 100% while it should be 51%. On non server versions of Windows like Windows 10 Enterprise, instead, it *sees* the % as the sum of the individual sockets, so cpu 0 and cpu 1, therefore if 1 socket is at 100% and the other is at 1% it says 51%. Last edited by FranceBB; 10th March 2023 at 14:48. |
|
![]() |
![]() |
![]() |
#29 | Link |
Registered User
Join Date: Jul 2018
Posts: 980
|
"3174fps - (Intel Xeon Gold 6238R x2 56c/112th "
It looks this chip is hardly limited with very low RAM performance relative to number of cores. The math calculation in vsTTempSmooth is very simple and most of time it looks only move data from global process working set in system RAM to CPU caches. For such massive multicore chips it is better to have task with many small workunits about AVX512 registerfile size of 2 kBytes or may be L1D cache size of about 32 kBytes or a bit more and long computing inside this dataset. So performance of massive multicore chips is limited by number of RAM channels and DDR-generation and Xeon Gold 6238R with 6channels DDR4 easily enough wins. Unfortunately Windows Task Manager show busy/load state of core even if it really stall doing nothing and waiting for data from memory subsystem. Though I thought 56Core Xeon should have >6 memory channels like 8. Or Gold is too cheap and only Platinum have 8 and more memory channels ? Like Xeon Platinum 9221 - https://en.wikichip.org/wiki/intel/xeon_platinum/9221 - supports up to twelve channels of DDR4-2933 memory. For 32 core only. The really fast data compute accelerators already go away from this very old and slow DDR-RAM and use HBM RAM. Typical NVIDIA DCA like Tesla A100 have about 2 TB/s RAM speed @80 GB card and 12channels DDR4 at expensive Xeon only about 250 GB/s (0.25 TB/s). So poor even for sub $10000 price. Intel CPUs still about many computing with relatively small data workunits. HBM RAM expected in 'general purpose' CPUs setups in 202x years. So may be too early to upgrade to one more expensive Xeon with still very slow RAM. For real more calculating load of core it may be used some 'gold pack' of AVS typical linear processing operations as 'AVS performance test script': 1. denoise with mvtools, 2. spatial correction with generalconvolution not very small 5x5 matrix, 3. downsize (rip for release creation) So it will be not only test of speed of memory subsystem. Something like: Code:
LoadPlugin("mvtools2.dll") LoadPlugin("AddGrainC.dll") BlankClip(30, 1920, 1080, color=color_gray) # big number of frames in AVS source cache may make benefit of large L2/L3 cache CPUs ? #BlankClip(30, 3840, 2160, color=color_gray) ConvertToYV12() ConvertBits(16) #? or in 8bit ? not very nice for 202x with HDR and 10bit enduser h.265 AddGrain(30) # or 50 or more ? Loop(10000000) #source preparation end here tr=7 super=MSuper(pel=2) #medium quality multi_vec=MAnalyse(super, multi=true, delta=tr, search=3, overlap=4, trymany=true) #exhaustive search, full 4x overlap, all predictors refining, lots of hard-to-SIMD algorithmic load MDegrainN(super, multi_vec, tr, thSAD=500) GeneralConvolution(0, " -1 -2 -2 -2 -1 -2 -2 -3 -2 -2 -2 -3 80 -3 -2 -2 -2 -3 -2 -2 -1 -2 -2 -2 -1", auto=true, luma=true, chroma=false) LanczosResize(1280, 720, taps=2) # need resize with support=2 or more to have equal load to SinPow/UD2 resizers ConvertBits(10, dither=1) Last edited by DTL; 11th March 2023 at 22:19. |
![]() |
![]() |
![]() |
Tags |
1950x, 7950x, amd, ryzen, threadripper |
Thread Tools | Search this Thread |
Display Modes | |
|
|