Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

Reply
 
Thread Tools Search this Thread Display Modes
Old 1st April 2020, 11:15   #141  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,883
Quote:
Originally Posted by MeteorRain View Post
If you can compile it youself, here's the compiling script: https://gist.github.com/msg7086/bb23...cc8c2c49f90c0e
Ok, sure.

Quote:
Originally Posted by pinterf View Post
It's probably not an AVX problem, for XP you have to build with v141_xp platform toolset (Visual Studio 2017 - Windows XP (v141_xp)) _and_ manually adding
/Zc:threadSafeInit-
to the additional options for C/C++ compiler
Setting threadsafeinit in the compiler did the trick!
It's working flawlessly now on XP!

Quote:
Originally Posted by MeteorRain View Post
Thanks, please try NeoFFT3D_r1v5-windows-xp.zip
Your new build works too!

Thank you both!
FranceBB is offline   Reply With Quote
Old 2nd April 2020, 00:04   #142  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
@FranceBB
Time to update AVSMeter. 2.8.5 is 1.5 years old.
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 2nd April 2020, 00:57   #143  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 2,883
Quote:
Originally Posted by Groucho2004 View Post
@FranceBB
Time to update AVSMeter. 2.8.5 is 1.5 years old.
Oops.
How can I have missed that...?
Updated right now!
By the way, thank you for AVSMeter, I found it very useful on several occasions, especially when I was walking by some of my colleagues who had any kind of fucked up installation with double/triple plugins, random names etc.
It also helped me to keep my installation and plugin folder perfectly clean with no redundant things.
FranceBB is offline   Reply With Quote
Old 7th April 2020, 12:17   #144  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
r2 -- marked as pre-release for now -- has been uploaded to GitHub.

I spend a few hours to completely rewrite the dual interface wrappers and it should be easier to work with now. I'm planning to move some of my filters into the new platform if I have time, including f3kdb (before fixing some raised issue).

Let me know how it works -- and if it is faster than it was. I experimentally used C++ Parallel STL in a few hot code paths and should have improved speed a bit.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 7th April 2020, 13:31   #145  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
Quote:
Originally Posted by ChaosKing View Post
v4 is looking good
A quick speed test (x64) with a ntsc DVD, default values:
- neo fft clang ~ 71.8 fps
- neo fft msvc ~73.9 fps
- neo fft msvc-xp ~74.1 fps
- "old" fft ~ 67.4 fps

run multiple times 1500 frames in vsedit
With r2 -- marked as pre-release
Code:
clang 109.2 fps
msvc 113.3 fps
Everything is tested on a Ryzen 2600
PHP Code:
std.BlankClip(format=vs.YUV420P8width=1920height=1080length=500).neo_fft3d.FFT3D()

x64 r2-pre:
clang:Output 500 frames in 25.57 seconds (19.55 fps)
msvcOutput 500 frames in 24.22 seconds (20.64 fps)

x64 r1
msvc
Output 500 frames in 39.87 seconds (12.54 fps)

x64 original fft3dfilter
Output 500 frames in 42.38 seconds 
(11.80 fps
Nearly x2 faster

And some warning to go
Code:
Core freed but 1866240000 bytes still allocated in framebuffers
In Avisynth r1 complains about the pixel_type for BlankClip() with a correct error msg, but r2 just throws an error: There is no function named neo_fft3d(). It works if I use a valid pixel type BlankClip(pixel_type = "YUV420P8")

EDIT: No wait, I mixed up the dlls: r2 always throws this error: There is no function named neo_fft3d(). I tried all x64+x86 versions
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database

Last edited by ChaosKing; 7th April 2020 at 13:42.
ChaosKing is offline   Reply With Quote
Old 7th April 2020, 22:50   #146  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Yea vapoursynth should have some issues. Let me set up some test environment and see how to deal with the issues.

A leaking filter instance cost me nearly an hour or two. This thing is a time killer.

And I put the wrong name on the function name, so...
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median

Last edited by MeteorRain; 8th April 2020 at 01:49.
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 09:14   #147  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Uploaded r3. Should have solved all kinds of issues.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 10:00   #148  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
The fps gap between clang and msvc is now almost gone.
Both ranged between 31-33 fps.

Output 500 frames in 15.71 seconds (31.82 fps)
Output 500 frames in 14.99 seconds (33.37 fps)
185 fps now on my tested DVD ntsc source.

But this speed up again... What kind of sorcery is this?

EDIT:
memory warning is gone
avisynth works. I get 28-29fps in avisynth x64 3.4
So now we have a nearly x3 faster fft3d filter. Awesome job!

EDIT2:
Added plugin to vsrepo & avsrepo
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database

Last edited by ChaosKing; 8th April 2020 at 10:28.
ChaosKing is offline   Reply With Quote
Old 8th April 2020, 10:49   #149  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by ChaosKing View Post
The fps gap between clang and msvc is now almost gone.
As a programming noob as I am, may I ask why some of AVS contributors started to provide both compiles? I mean, which are differences and advantages? Speed on different CPUs? I am really curious.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th April 2020, 11:24   #150  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
It is just something I noticed...

In general different compilers have different optimization strategies implemented. This can lead to plugin A is faster then plugin B with compiler X but plugin C is faster with compiler Y.

Then you could also inspect the generated machine code by compiler X and see why it is faster or slower then the code by compiler Y.

tldr: compiler, please undestand what I want.
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database
ChaosKing is offline   Reply With Quote
Old 8th April 2020, 12:24   #151  |  Link
pinterf
Registered User
 
Join Date: Jan 2014
Posts: 2,309
There are so many options and sub-filters that it is impossible to test all cases and choose one "best" compiler.
E.g. there are a tons of possible branches by processor types: SSE2, SSE4, AVX, AVX2, plus 24 modes in RgTools, plus repair option. Many hundred different scenarios.
Then it was a real case that msvc produced 10% faster code in AVX2 for Removegrain mode X while it was 10% slower at mode Y, compared to clang.
Let the user choose, if user is a power-user and have inifinite time to test, he will know why this or that version is chosen. All other user just choose by sympathy (MS rulez, clang inside yeah).
pinterf is offline   Reply With Quote
Old 8th April 2020, 12:27   #152  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
And you forgot Intel Parallel XE too xD

Work colleagues told me miracles of that.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th April 2020, 12:58   #153  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Quote:
Originally Posted by tormento View Post
As a programming noob as I am, may I ask why some of AVS contributors started to provide both compiles? I mean, which are differences and advantages? Speed on different CPUs? I am really curious.
A few reasons.

1) clang does lots of optimizations on even SIMD intrinsics code, while MSVC tends to compile as-is. For example, in f3kdb, where there's a selection operation (I can't remember the details), clang directly optimize the whole operation into a completely different instruction sequence, which greatly improved the speed on SSE2, even making it almost as fast or faster than the AVX variant (due to lack of equivalent instruction in AVX). While MSVC simply compiled as-is, with some unroll maybe?

2) clang checks lots of extra warnings and errors. Having a clang compiled version means the code will likely compile on clang @ Linux or MacOS as well.

I personally also compile the code in GCC to make sure it satisfies all common compilers. However GCC uses POSIX ABI so it's not usable under MSVC C++ AVS+ so I didn't include it in the release.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 13:05   #154  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by pinterf View Post
There are so many options and sub-filters that it is impossible to test all cases and choose one "best" compiler.
I just tried a very simple filter case, that uses KNLMeansCL, MaskTools and MVTools only. The only available filter as both clang and VS2019 is MaskTools.

Average on 10x launch of same script:

MaskTools clang: 13.43 fps
MaskTools VS2019: 13.59 fps

The true bottleneck here is MVTools, that I always beg you to release on both compiler versions to see if any difference occurs.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th April 2020, 13:08   #155  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by MeteorRain View Post
A few reasons.
Thanks for the exaustive reply. In the mean time I asked my old friends in University what they use and confirmed me Intel compiler is one of the best to have fast code. On some peculiar cases, mostly scientific simulations, they can see 20x increase vs VS one.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 8th April 2020, 13:08   #156  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Quote:
Originally Posted by ChaosKing View Post
But this speed up again... What kind of sorcery is this?
I have a pretty smart LRU caching system and a multi-threading counting system, to make sure I use the least possible memory to work with give multi-threading model. So it should now runs at 6 threads from input.

Then for each plane I let them fly on their own threads, so 18 threads into the engine.

From there, some hot paths like copying window data into blocks operation gets parallel'd using C++ PSTL, which is basically a trouble free thread pool.

Core functions such as Apply and Sharpen families get full SSE/AVX/AVX512 optimization. Because I have a centralized wrapper and put variants into templates, it's pretty easy for me to transform loops into parallel computing. I've tried putting all blocks into their own threads, and it turns out it runs slower. So current design is to run 4 parallel computing batch, and each run a quarter of the data blocks.

I definitely don't want it to occupy too much CPU, because it's usually paired with other filters. So I tweak it to use about 3-4 full cores maximum to maintain a high running efficiency.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 13:15   #157  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Quote:
Originally Posted by tormento View Post
Thanks for the exaustive reply. In the mean time I asked my old friends in University what they use and confirmed me Intel compiler is one of the best to have fast code. On some peculiar cases, mostly scientific simulations, they can see 20x increase vs VS one.
Yes and no. Lots of CPU intensive code nowadays are written in assembly / intrinsics anyway, so the compiler is having less and less things to optimize.

Like I said, you must be able to optimize assembly code to further improve the performance. Sometimes you also have to take into consideration what the target CPU is, because, there's a few instructions that runs faster and slower on different CPUs.

For example, if in the SIMD I write code A; B; C, but the compiler thinks my code is dumb because it can be done using X; Y, which is faster, it should delete my code and change it to the faster version right away.

Then there's times when ABC is faster on, for example, haswell, but XY is faster on skylake, now the compiler would ask you, which CPU would you prefer to target. If you say I love haswell then it should leave ABC there.

Optimization is a very interesting topic to study, and I only know a small portion of it. Here's my $0.02.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 13:25   #158  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
Quote:
Originally Posted by MeteorRain View Post
I have a pretty smart LRU caching system and a multi-threading counting system, to make sure I use the least possible memory to work with give multi-threading model. So it should now runs at 6 threads from input.

Then for each plane I let them fly on their own threads, so 18 threads into the engine.

From there, some hot paths like copying window data into blocks operation gets parallel'd using C++ PSTL, which is basically a trouble free thread pool.

Core functions such as Apply and Sharpen families get full SSE/AVX/AVX512 optimization. Because I have a centralized wrapper and put variants into templates, it's pretty easy for me to transform loops into parallel computing. I've tried putting all blocks into their own threads, and it turns out it runs slower. So current design is to run 4 parallel computing batch, and each run a quarter of the data blocks.

I definitely don't want it to occupy too much CPU, because it's usually paired with other filters. So I tweak it to use about 3-4 full cores maximum to maintain a high running efficiency.
Now do the same for mvtools
Haha just kidding.
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database
ChaosKing is offline   Reply With Quote
Old 8th April 2020, 13:30   #159  |  Link
MeteorRain
結城有紀
 
Join Date: Dec 2003
Location: NJ; OR; Shanghai
Posts: 894
Never looked at that code. You tell me how hard it could be for me to get hands on it. If it's not to hard I may actually give it a try.
__________________
Projects
x265 - Yuuki-Asuna-mod Download / GitHub
TS - ADTS AAC Splitter | LATM AAC Splitter | BS4K-ASS
Neo AviSynth+ filters - F3KDB | FFT3D | DFTTest | MiniDeen | Temporal Median
MeteorRain is offline   Reply With Quote
Old 8th April 2020, 13:46   #160  |  Link
ChaosKing
Registered User
 
Join Date: Dec 2005
Location: Germany
Posts: 1,795
Quote:
Originally Posted by MeteorRain View Post
Never looked at that code. You tell me how hard it could be for me to get hands on it. If it's not to hard I may actually give it a try.
As far as I read on doom9, the code is a mess...

https://github.com/pinterf/mvtools
https://github.com/dubhater/vapoursynth-mvtools

But maybe it would be better and easier to improve the c++ reimplementation of mvtools by feisty2
https://forum.doom9.org/showthread.php?t=172525
https://github.com/IFeelBloated/vapoursynth-mvtools-sf <-- I had some problematic scenes where this version was ok where the other mvtools showed some artifacts. Maybe 32bit precision had something to do with it, idk.
__________________
AVSRepoGUI // VSRepoGUI - Package Manager for AviSynth // VapourSynth
VapourSynth Portable FATPACK || VapourSynth Database

Last edited by ChaosKing; 8th April 2020 at 13:50.
ChaosKing is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:11.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.