JPSDR Avisynth's plugins pack - Page 3

Atak_Snajpera · 23rd January 2018, 14:57

This multithreaded resizing plugin is really good! Much better than plain Prefetch option in AviSynth+MT
CPU: E5-2690@2.9GHz (8C/16T)
Source: 3840x2160 YUV420P10

Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(8)

Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(16)

Crop(0,280,0,-280) + Spline36ResizeMT(1920,800)

The fastest,lower memory consumption and cpu usage!

For comparison regular resizer.
Crop(0,280,0,-280) + Spline36Resize(1920,800)

jpsdr · 31st March 2018, 10:14

New version, see first post, and i've also added on it a part about the multi-threading.

Atak_Snajpera · 1st April 2018, 17:12

SetAffinity=true in latest version works terrible even without prefetch in script. Now it is slower than regular single threaded resizer!

SetAffinity=false

BTW. I see that newer version is faster than old one (55 fps vs 52 fps)

jpsdr · 1st April 2018, 17:19

What's the full script ?

Atak_Snajpera · 1st April 2018, 17:24

Code:

#MT



#VideoSource
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\ffms\ffms_latest\x64\ffms2.dll")
video=FFVideoSource("E:\_Video_Samples\mkv\Passengers_2016_4K.mkv",cachefile = "C:\Temp\RipBot264temp\job1\Passengers_2016_4K.mkv.ffindex")
#Deinterlace



#Decimate



#Crop
video=Crop(video,0,280,-0,-280)



#Resize
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\Plugins_JPSDR\Plugins_JPSDR.dll")
video=Spline36ResizeMT(video,1920,800,SetAffinity=true).Sharpen(0.2)



#Levels



#Colours



#Denoise



#Custom



#Prefetch



#Subtitles



#AudioSource
Import("C:\Temp\RipBot264temp\job1\job1_a1.avs")


#Triming



#AVSameLength



#ColorSpace



#Return

jpsdr · 1st April 2018, 18:02

What CPU do you have, more exactly how many logical cores do you have ? I just want to understand the 49 threads, but totaly expected if you have something like a 20 logical cores CPU.

Atak_Snajpera · 1st April 2018, 18:19

https://forum.doom9.org/showthread.p...60#post1831560

jpsdr · 2nd April 2018, 10:26

Ok, there is something odd indeed, thanks reporting. All the other filters seems to behave properly, but the resampler runs only on one core with SetAffinity set to true, which is totaly unexpected. I can't investigate right now, but i will very shortly.
There is a bug somewhere...

aWarpsharp2 and nnedi3 give me 41 threads on my 20 cores CPU, in both cases true/false.
ResampleMT gives me 41 threads with true, 174 with false !!!
Yes, there is something realy wrong...

jpsdr · 3rd April 2018, 11:43

More fishy !! I'm on my break lunch and made some tests on my PC a work, and everything works fine, but i don't have the same CPU than i have at home (it's a simple 4 cores without HT). The only thing i can't check for now is Intel vs VS. Have you made your tests with VS or Intel version ? If Intel, can you make a test with the VS version ? I'll try also this when back home, but it will not be before several hours.

Edit :
Sometimes i'm very stupid, of course i can test, i just have to download them from my github...

Results :
The VS AVX and Intel AVX2 versions work fine with standard avisynth.

The VS AVX version works fine with avs+ (both x86 & x64).
The Intel AVX2 version is working... "fishy" with avs+ (both x86 & x64), but only for the resampler, the other filters work fine.

I'll update the release files on github, removing the Intel versions, and keeping only VS version, and adding an VS AVX2 version. Wait at least 24h to check/re-download the files.

jpsdr · 3rd April 2018, 13:16

... Before totaly removing, i'll check if with /O2 instead of /O3 with the Intel compiler, there is still the issue.

Groucho2004 · 3rd April 2018, 16:12

Quote:

Originally Posted by jpsdr

Ok, there is something odd indeed, thanks reporting. All the other filters seems to behave properly, but the resampler runs only on one core with SetAffinity set to true, which is totaly unexpected.

I have a rather basic question - What makes you think that messing with Windows' thread scheduler by manipulating thread affinity improves the speed? What if another program does the same? Have you measured the speed in different scenarios (various Windows versions, CPUs with Hyperthreading, software that messes with thread priority)?

jpsdr · 3rd April 2018, 16:47

Quote:

Originally Posted by Groucho2004

What makes you think that messing with Windows' thread scheduler by manipulating thread affinity improves the speed?

Image is splitted horizontaly, so, for cache access, it may be better if contiguous zones are on the same physical core, no more, no less. If you don't have HT, less significant. That's what i think, and, yes, it's just a pure theorical thinking, didn't spend time to make all kind of test. (Wrote allready this in the part added in the 1rst post).
And, the threadpool i've used as exemple to make mine was even more restrictive, no choice, put each thread on one CPU only. I've expended that.
Nevertheless, this has nothing to do with Intel compiler messing the code...

But maybe it's also my fault, using /O3 may be too much experimental.

TheFluff · 3rd April 2018, 19:59

Quote:

Originally Posted by jpsdr

Image is splitted horizontaly, so, for cache access, it may be better if contiguous zones are on the same physical core, no more, no less.

u wot m8

it's random access memory, yeah?

jpsdr · 3rd April 2018, 20:19

LOL... Yes, the "Random" part means that you can directly access randomly to the data if you want, because the memory chipset/componant have an address bus allowing you to choose whatever memory data/case you want. Opposed to different kind of memory, which have for exemple only serial access, meaning that you can't directly access to whatever data you want without accessing to others before.

So... What this has to do with the fact that the memory zone you're working on can eventualy fit in the cache ?

jpsdr · 3rd April 2018, 20:20

Intel version trashed, no difference between /O2 or /O3. File updated, redownload it.

jpsdr · 3rd April 2018, 20:25

Some bench tests :

Script :

Code:

Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
Spline36ResizeMT(1920,1080,SetAffinity=true)

Result :

Code:

[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      1080 | 1315 | 1293
Memory usage (phys | virt):     47 | 44 MiB
Thread count:                   41
CPU usage (average):            81%
Efficiency index:               15.96

Time (elapsed):                 00:00:07.736

SetAffinity=false, result :

Code:

[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      1050 | 1372 | 1173
Memory usage (phys | virt):     47 | 45 MiB
Thread count:                   41
CPU usage (average):            69%
Efficiency index:               17.00

Time (elapsed):                 00:00:08.527

Script :

Code:

Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
aWarpSharp2(SetAffinity=true)

Result :

Code:

[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      165.6 | 270.2 | 239.8
Memory usage (phys | virt):     60 | 60 MiB
Thread count:                   41
CPU usage (average):            84%
Efficiency index:               2.854

Time (elapsed):                 00:00:41.707

SetAffinity=false, result :

Code:

[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      170.4 | 271.0 | 207.2
Memory usage (phys | virt):     60 | 60 MiB
Thread count:                   41
CPU usage (average):            67%
Efficiency index:               3.092

Time (elapsed):                 00:00:48.270

Script :

Code:

Colorbars(width=1920,height=1080,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,4)
nnedi3(dh = true, nsize = 3, nns = 4, qual = 2,pscrn=0,threads=0,SetAffinity=true)

Result :

Code:

[Runtime info]
Frames processed:               5 (0 - 4)
FPS (min | max | average):      0.431 | 0.433 | 0.432
Memory usage (phys | virt):     49 | 53 MiB
Thread count:                   41
CPU usage (average):            97%
Efficiency index:               0.00446

Time (elapsed):                 00:00:11.569

SetAffinity=false, result :

Code:

[Runtime info]
Frames processed:               5 (0 - 4)
FPS (min | max | average):      0.379 | 0.403 | 0.392
Memory usage (phys | virt):     49 | 52 MiB
Thread count:                   41
CPU usage (average):            87%
Efficiency index:               0.00451

Time (elapsed):                 00:00:12.741

Is it what can be called empirical evidence ?

Nevertheless, doesn't mean it will be like this for everybody. This is why everyone can tune according his results.

Atak_Snajpera · 4th April 2018, 12:07

Still something is not right. I used dll from Release_W7 folder.
SetAffinity=true (it is even slower than before

SetAffinity=false

TheFluff · 4th April 2018, 13:15

Quote:

Originally Posted by jpsdr

LOL... Yes, the "Random" part means that you can directly access randomly to the data if you want, because the memory chipset/componant have an address bus allowing you to choose whatever memory data/case you want. Opposed to different kind of memory, which have for exemple only serial access, meaning that you can't directly access to whatever data you want without accessing to others before.

So... What this has to do with the fact that the memory zone you're working on can eventualy fit in the cache ?

Say that you read a megabyte of framebuffer data from RAM into CPU cache and do some work on it. You then want to read some other megabyte of framebuffer data to CPU cache and do some work on that. What, to you, implies that the second memory-to-cache transfer would be affected by the previous one?

I find "benchmarks are, like, just your opinion, maaaan" to be an exceptionally poor argument, by the way. Your results compared to Atak_Snajpera's ones seem to imply that your implementation doesn't actually work, or at least doesn't do what you think it does. Heavens above know what you're even benchmarking.

e: to just quickly restate the argument about cache locality in resizers: recall that most resizers are separable filters which work by moving a sampling window over the input image one dimension at a time. Where do you see the potential for great time savings in the form of cache hits in this, exactly?

jpsdr · 4th April 2018, 14:27

@Atak_Snajpera

Can you provide yours results with both true/false for just the following script :

Code:

Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
Spline36ResizeMT(1920,1080,SetAffinity=true)

No need to bother with pictures, just paste the [Runtime info] from the log file, it should be easier and faster for you.

jpsdr · 4th April 2018, 14:39

Quote:

Originally Posted by TheFluff

Heavens above know what you're even benchmarking.

The script are provided, so, if looking at them it's impossible to say what is benchmarked, i indeed don't know what to do more.

About cache, i'm just saying that if you have 8 physical CPUs with 8 threads workings each one on 1/8 of 1Mb frame and each thread on a different CPU, there is more chances that the working memory zone of each threads will totaly fit and stay within the cache during the whole process, than if you have 8 threads working each one on a full 1Mb frame.
No more, no less.

23rd January 2018, 14:57	#41 \| Link
Atak_Snajpera RipBot264 author Join Date: May 2006 Location: Poland Posts: 7,816	This multithreaded resizing plugin is really good! Much better than plain Prefetch option in AviSynth+MT CPU: E5-2690@2.9GHz (8C/16T) Source: 3840x2160 YUV420P10 Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(8) Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(16) Crop(0,280,0,-280) + Spline36ResizeMT(1920,800) The fastest,lower memory consumption and cpu usage! For comparison regular resizer. Crop(0,280,0,-280) + Spline36Resize(1920,800) __________________ Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper Last edited by Atak_Snajpera; 23rd January 2018 at 15:11.

31st March 2018, 10:14	#42 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	New version, see first post, and i've also added on it a part about the multi-threading. __________________ My github.

1st April 2018, 17:12	#43 \| Link
Atak_Snajpera RipBot264 author Join Date: May 2006 Location: Poland Posts: 7,816	SetAffinity=true in latest version works terrible even without prefetch in script. Now it is slower than regular single threaded resizer! SetAffinity=false BTW. I see that newer version is faster than old one (55 fps vs 52 fps) __________________ Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper

1st April 2018, 17:19	#44 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	What's the full script ? __________________ My github.

1st April 2018, 18:02	#46 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	What CPU do you have, more exactly how many logical cores do you have ? I just want to understand the 49 threads, but totaly expected if you have something like a 20 logical cores CPU. __________________ My github.

1st April 2018, 18:19	#47 \| Link
Atak_Snajpera RipBot264 author Join Date: May 2006 Location: Poland Posts: 7,816	https://forum.doom9.org/showthread.p...60#post1831560 __________________ Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper

2nd April 2018, 10:26	#48 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	Ok, there is something odd indeed, thanks reporting. All the other filters seems to behave properly, but the resampler runs only on one core with SetAffinity set to true, which is totaly unexpected. I can't investigate right now, but i will very shortly. There is a bug somewhere... aWarpsharp2 and nnedi3 give me 41 threads on my 20 cores CPU, in both cases true/false. ResampleMT gives me 41 threads with true, 174 with false !!! Yes, there is something realy wrong... __________________ My github. Last edited by jpsdr; 2nd April 2018 at 10:39.

3rd April 2018, 11:43	#49 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	More fishy !! I'm on my break lunch and made some tests on my PC a work, and everything works fine, but i don't have the same CPU than i have at home (it's a simple 4 cores without HT). The only thing i can't check for now is Intel vs VS. Have you made your tests with VS or Intel version ? If Intel, can you make a test with the VS version ? I'll try also this when back home, but it will not be before several hours. Edit : Sometimes i'm very stupid, of course i can test, i just have to download them from my github... Results : The VS AVX and Intel AVX2 versions work fine with standard avisynth. The VS AVX version works fine with avs+ (both x86 & x64). The Intel AVX2 version is working... "fishy" with avs+ (both x86 & x64), but only for the resampler, the other filters work fine. I'll update the release files on github, removing the Intel versions, and keeping only VS version, and adding an VS AVX2 version. Wait at least 24h to check/re-download the files. __________________ My github. Last edited by jpsdr; 3rd April 2018 at 12:09.

3rd April 2018, 13:16	#50 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	... Before totaly removing, i'll check if with /O2 instead of /O3 with the Intel compiler, there is still the issue. __________________ My github.

3rd April 2018, 20:19	#54 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	LOL... Yes, the "Random" part means that you can directly access randomly to the data if you want, because the memory chipset/componant have an address bus allowing you to choose whatever memory data/case you want. Opposed to different kind of memory, which have for exemple only serial access, meaning that you can't directly access to whatever data you want without accessing to others before. So... What this has to do with the fact that the memory zone you're working on can eventualy fit in the cache ? __________________ My github.

3rd April 2018, 20:20	#55 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	Intel version trashed, no difference between /O2 or /O3. File updated, redownload it. __________________ My github.

4th April 2018, 12:07	#57 \| Link
Atak_Snajpera RipBot264 author Join Date: May 2006 Location: Poland Posts: 7,816	Still something is not right. I used dll from Release_W7 folder. SetAffinity=true (it is even slower than before SetAffinity=false __________________ Windows 7 Image Updater - SkyLake\KabyLake\CoffeLake\Ryzen Threadripper

4th April 2018, 14:27	#59 \| Link
jpsdr Registered User Join Date: Oct 2002 Location: France Posts: 2,316	@Atak_Snajpera Can you provide yours results with both true/false for just the following script : Code: Colorbars(width=19202,height=10802,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999) Spline36ResizeMT(1920,1080,SetAffinity=true) No need to bother with pictures, just paste the [Runtime info] from the log file, it should be easier and faster for you. __________________ My github. Last edited by jpsdr; 4th April 2018 at 15:02.