Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd January 2018, 14:57   #41  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,816
This multithreaded resizing plugin is really good! Much better than plain Prefetch option in AviSynth+MT
CPU: E5-2690@2.9GHz (8C/16T)
Source: 3840x2160 YUV420P10

Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(8)


Crop(0,280,0,-280) + Spline36Resize(1920,800) + Prefetch(16)


Crop(0,280,0,-280) + Spline36ResizeMT(1920,800)


The fastest,lower memory consumption and cpu usage!

For comparison regular resizer.
Crop(0,280,0,-280) + Spline36Resize(1920,800)

Last edited by Atak_Snajpera; 23rd January 2018 at 15:11.
Atak_Snajpera is offline   Reply With Quote
Old 31st March 2018, 10:14   #42  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
New version, see first post, and i've also added on it a part about the multi-threading.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 1st April 2018, 17:12   #43  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,816
SetAffinity=true in latest version works terrible even without prefetch in script. Now it is slower than regular single threaded resizer!


SetAffinity=false


BTW. I see that newer version is faster than old one (55 fps vs 52 fps)
Atak_Snajpera is offline   Reply With Quote
Old 1st April 2018, 17:19   #44  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
What's the full script ?
__________________
My github.
jpsdr is offline   Reply With Quote
Old 1st April 2018, 17:24   #45  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,816
Code:
#MT



#VideoSource
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\ffms\ffms_latest\x64\ffms2.dll")
video=FFVideoSource("E:\_Video_Samples\mkv\Passengers_2016_4K.mkv",cachefile = "C:\Temp\RipBot264temp\job1\Passengers_2016_4K.mkv.ffindex")
#Deinterlace



#Decimate



#Crop
video=Crop(video,0,280,-0,-280)



#Resize
LoadPlugin("C:\Users\Dave\Documents\Delphi_Projects\RipBot264\_Compiled\Tools\AviSynth plugins\Plugins_JPSDR\Plugins_JPSDR.dll")
video=Spline36ResizeMT(video,1920,800,SetAffinity=true).Sharpen(0.2)



#Levels



#Colours



#Denoise



#Custom



#Prefetch



#Subtitles



#AudioSource
Import("C:\Temp\RipBot264temp\job1\job1_a1.avs")


#Triming



#AVSameLength



#ColorSpace



#Return
Atak_Snajpera is offline   Reply With Quote
Old 1st April 2018, 18:02   #46  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
What CPU do you have, more exactly how many logical cores do you have ? I just want to understand the 49 threads, but totaly expected if you have something like a 20 logical cores CPU.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 1st April 2018, 18:19   #47  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,816
https://forum.doom9.org/showthread.p...60#post1831560
Atak_Snajpera is offline   Reply With Quote
Old 2nd April 2018, 10:26   #48  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Ok, there is something odd indeed, thanks reporting. All the other filters seems to behave properly, but the resampler runs only on one core with SetAffinity set to true, which is totaly unexpected. I can't investigate right now, but i will very shortly.
There is a bug somewhere...
aWarpsharp2 and nnedi3 give me 41 threads on my 20 cores CPU, in both cases true/false.
ResampleMT gives me 41 threads with true, 174 with false !!!
Yes, there is something realy wrong...
__________________
My github.

Last edited by jpsdr; 2nd April 2018 at 10:39.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 11:43   #49  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
More fishy !! I'm on my break lunch and made some tests on my PC a work, and everything works fine, but i don't have the same CPU than i have at home (it's a simple 4 cores without HT). The only thing i can't check for now is Intel vs VS. Have you made your tests with VS or Intel version ? If Intel, can you make a test with the VS version ? I'll try also this when back home, but it will not be before several hours.

Edit :
Sometimes i'm very stupid, of course i can test, i just have to download them from my github...

Results :
The VS AVX and Intel AVX2 versions work fine with standard avisynth.

The VS AVX version works fine with avs+ (both x86 & x64).
The Intel AVX2 version is working... "fishy" with avs+ (both x86 & x64), but only for the resampler, the other filters work fine.

I'll update the release files on github, removing the Intel versions, and keeping only VS version, and adding an VS AVX2 version. Wait at least 24h to check/re-download the files.
__________________
My github.

Last edited by jpsdr; 3rd April 2018 at 12:09.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 13:16   #50  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
... Before totaly removing, i'll check if with /O2 instead of /O3 with the Intel compiler, there is still the issue.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 16:12   #51  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by jpsdr View Post
Ok, there is something odd indeed, thanks reporting. All the other filters seems to behave properly, but the resampler runs only on one core with SetAffinity set to true, which is totaly unexpected.
I have a rather basic question - What makes you think that messing with Windows' thread scheduler by manipulating thread affinity improves the speed? What if another program does the same? Have you measured the speed in different scenarios (various Windows versions, CPUs with Hyperthreading, software that messes with thread priority)?
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 3rd April 2018, 16:47   #52  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Quote:
Originally Posted by Groucho2004 View Post
What makes you think that messing with Windows' thread scheduler by manipulating thread affinity improves the speed?
Image is splitted horizontaly, so, for cache access, it may be better if contiguous zones are on the same physical core, no more, no less. If you don't have HT, less significant. That's what i think, and, yes, it's just a pure theorical thinking, didn't spend time to make all kind of test. (Wrote allready this in the part added in the 1rst post).
And, the threadpool i've used as exemple to make mine was even more restrictive, no choice, put each thread on one CPU only. I've expended that.
Nevertheless, this has nothing to do with Intel compiler messing the code...
But maybe it's also my fault, using /O3 may be too much experimental.
__________________
My github.

Last edited by jpsdr; 3rd April 2018 at 16:50.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 19:59   #53  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by jpsdr View Post
Image is splitted horizontaly, so, for cache access, it may be better if contiguous zones are on the same physical core, no more, no less.
u wot m8

it's random access memory, yeah?
TheFluff is offline   Reply With Quote
Old 3rd April 2018, 20:19   #54  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
LOL... Yes, the "Random" part means that you can directly access randomly to the data if you want, because the memory chipset/componant have an address bus allowing you to choose whatever memory data/case you want. Opposed to different kind of memory, which have for exemple only serial access, meaning that you can't directly access to whatever data you want without accessing to others before.

So... What this has to do with the fact that the memory zone you're working on can eventualy fit in the cache ?
__________________
My github.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 20:20   #55  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Intel version trashed, no difference between /O2 or /O3. File updated, redownload it.
__________________
My github.
jpsdr is offline   Reply With Quote
Old 3rd April 2018, 20:25   #56  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Some bench tests :

Script :
Code:
Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
Spline36ResizeMT(1920,1080,SetAffinity=true)
Result :
Code:
[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      1080 | 1315 | 1293
Memory usage (phys | virt):     47 | 44 MiB
Thread count:                   41
CPU usage (average):            81%
Efficiency index:               15.96

Time (elapsed):                 00:00:07.736
SetAffinity=false, result :
Code:
[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      1050 | 1372 | 1173
Memory usage (phys | virt):     47 | 45 MiB
Thread count:                   41
CPU usage (average):            69%
Efficiency index:               17.00

Time (elapsed):                 00:00:08.527
Script :
Code:
Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
aWarpSharp2(SetAffinity=true)
Result :
Code:
[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      165.6 | 270.2 | 239.8
Memory usage (phys | virt):     60 | 60 MiB
Thread count:                   41
CPU usage (average):            84%
Efficiency index:               2.854

Time (elapsed):                 00:00:41.707
SetAffinity=false, result :
Code:
[Runtime info]
Frames processed:               10000 (0 - 9999)
FPS (min | max | average):      170.4 | 271.0 | 207.2
Memory usage (phys | virt):     60 | 60 MiB
Thread count:                   41
CPU usage (average):            67%
Efficiency index:               3.092

Time (elapsed):                 00:00:48.270
Script :
Code:
Colorbars(width=1920,height=1080,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,4)
nnedi3(dh = true, nsize = 3, nns = 4, qual = 2,pscrn=0,threads=0,SetAffinity=true)
Result :
Code:
[Runtime info]
Frames processed:               5 (0 - 4)
FPS (min | max | average):      0.431 | 0.433 | 0.432
Memory usage (phys | virt):     49 | 53 MiB
Thread count:                   41
CPU usage (average):            97%
Efficiency index:               0.00446

Time (elapsed):                 00:00:11.569
SetAffinity=false, result :
Code:
[Runtime info]
Frames processed:               5 (0 - 4)
FPS (min | max | average):      0.379 | 0.403 | 0.392
Memory usage (phys | virt):     49 | 52 MiB
Thread count:                   41
CPU usage (average):            87%
Efficiency index:               0.00451

Time (elapsed):                 00:00:12.741
Is it what can be called empirical evidence ?

Nevertheless, doesn't mean it will be like this for everybody. This is why everyone can tune according his results.
__________________
My github.

Last edited by jpsdr; 3rd April 2018 at 20:29.
jpsdr is offline   Reply With Quote
Old 4th April 2018, 12:07   #57  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,816
Still something is not right. I used dll from Release_W7 folder.
SetAffinity=true (it is even slower than before


SetAffinity=false
Atak_Snajpera is offline   Reply With Quote
Old 4th April 2018, 13:15   #58  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by jpsdr View Post
LOL... Yes, the "Random" part means that you can directly access randomly to the data if you want, because the memory chipset/componant have an address bus allowing you to choose whatever memory data/case you want. Opposed to different kind of memory, which have for exemple only serial access, meaning that you can't directly access to whatever data you want without accessing to others before.

So... What this has to do with the fact that the memory zone you're working on can eventualy fit in the cache ?
Say that you read a megabyte of framebuffer data from RAM into CPU cache and do some work on it. You then want to read some other megabyte of framebuffer data to CPU cache and do some work on that. What, to you, implies that the second memory-to-cache transfer would be affected by the previous one?

I find "benchmarks are, like, just your opinion, maaaan" to be an exceptionally poor argument, by the way. Your results compared to Atak_Snajpera's ones seem to imply that your implementation doesn't actually work, or at least doesn't do what you think it does. Heavens above know what you're even benchmarking.

e: to just quickly restate the argument about cache locality in resizers: recall that most resizers are separable filters which work by moving a sampling window over the input image one dimension at a time. Where do you see the potential for great time savings in the form of cache hits in this, exactly?

Last edited by TheFluff; 4th April 2018 at 13:19.
TheFluff is offline   Reply With Quote
Old 4th April 2018, 14:27   #59  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
@Atak_Snajpera

Can you provide yours results with both true/false for just the following script :

Code:
Colorbars(width=1920*2,height=1080*2,pixel_type="yv12").killaudio().assumefps(25,1).trim(0,9999)
Spline36ResizeMT(1920,1080,SetAffinity=true)
No need to bother with pictures, just paste the [Runtime info] from the log file, it should be easier and faster for you.
__________________
My github.

Last edited by jpsdr; 4th April 2018 at 15:02.
jpsdr is offline   Reply With Quote
Old 4th April 2018, 14:39   #60  |  Link
jpsdr
Registered User
 
Join Date: Oct 2002
Location: France
Posts: 2,316
Quote:
Originally Posted by TheFluff View Post
Heavens above know what you're even benchmarking.
The script are provided, so, if looking at them it's impossible to say what is benchmarked, i indeed don't know what to do more.

About cache, i'm just saying that if you have 8 physical CPUs with 8 threads workings each one on 1/8 of 1Mb frame and each thread on a different CPU, there is more chances that the working memory zone of each threads will totaly fit and stay within the cache during the whole process, than if you have 8 threads working each one on a full 1Mb frame.
No more, no less.
__________________
My github.
jpsdr is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:22.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.