Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
2nd December 2010, 01:39 | #1 | Link |
easily bamboozled user
Join Date: Sep 2002
Location: Atlanta
Posts: 373
|
New 64-bit FluxSmooth with SSE2 and SSSE3
A straight 64-bit port of FluxSmooth 1.1a (and what I started from) can be found here: http://forum.doom9.org/showthread.ph...28#post1425528
I've made new 64-bit versions of FluxSmooth for YV12 using SSE2 and SSSE3. The SSE2 version is optimized specifically for Athlon 64 and the SSSE3 version specifically for the 65nm Intel Core 2 (Conroe/Kentsfield). None of AMD's current chips support SSSE3, annoyingly. I haven't done the YUY2 version yet, I'll be starting on that next. For now, these .dll's contain the original C++ and MMX versions for YUY2. Although FluxSmooth's documentation said it was SSE optimized it was actually in MMX, so these new versions involve a total rewrite, that's why it's taking so long. I'm posting this now because I'd like some feedback and help testing. I've gotten all the obvious bugs, but there's bound to be something that needs more to uncover. Being new to writing Avisynth plugins, I don't have any good methods of testing filters. I'd also really like to see some speed numbers for the Athlon 64. About the SSE2 version... There was a problem in moving FluxSmooth's method of doing the average to SSE. The MMX routine uses a lookup table that's a whopping 512KB. It's impossible to use this method in 128-bit -- it would require a 64GB table. For the SSE2 version I had to use floating-point code to take the average. It's not that much of a speed loss though, because with the MMX version's 512KB table, virtually every access would be a cache miss (and a lot of them an L2 cache miss). The SSSE3 version is able to avoid this thanks to the new instructions, and it uses the same average calculation as the C++ and MMX versions. There are enough changes in this release that I think it should constitute a new version, v1.2, of FluxSmooth:
Now, on to the speed gains. I've only done some very brief speed testing with avs2avi64. I'd really love to get some feedback on the speed from other users with other hardware. I've got a Core 2 Quad Q6600 (65nm), and I tested running at stock clock speed, 2.4GHz. I tested on an Xvid avi to get a more realistic scenario, so FluxSmooth doesn't have the CPU cache all to itself. Xvid (no b-frames, no Qpel) 720 x 480 46580 frames 232MB Empty: 1:15.3 (618.97 fps) 1:15.0 (621.03 fps) C: 4:49.0 (161.17 fps) 4:51.0 (160.05 fps) MMX: 4:23.5 (176.76 fps) 4:25.3 (175.59 fps) SSE2: 3:23.5 (228.87 fps) 3:23.8 (228.58 fps) SSSE3: 2:41.8 (287.96 fps) 2:44.3 (283.58 fps) If you're wondering what contribution the FP code makes to the SSE2 version's time (I was), I also made an SSSE3 version that uses the same FP code for the average, but tuned for Core 2 instead of Athlon 64. Looks like the FP code makes up a substantial portion of the time: SSSE3 /w FP: 3:00.3 (258.40 fps) 3:00.8 (257.69 fps) For testing I've made a version of the plugin named FluxSmoothTest.dll that includes all of the different filter versions for YV12, and an extra parameter to choose which to use. The parameter is an integer called "opt": 0 = C code, 1 = MMX, 2 = SSE2, 3 = SSSE3. If your CPU doesn't support the instruction set you chose, it defaults back to the C code (it does not fall back to the next-best optimized version, this way you'll know for sure which version is being run). I liked the way RemoveGrain did its dlls, so I did the same here. There's a dll that contains only the SSE2 code and another with only the SSSE3 code, so they're much smaller. They throw an error if your CPU doesn't support the required instructions. That won't be an issue with the SSE2 because all 64-bit x86 chips have it. If people really want it, I can make a dll with all optimized versions that chooses the best one your CPU supports. |
2nd December 2010, 14:46 | #3 | Link | |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
Quote:
|
|
2nd December 2010, 18:44 | #5 | Link | |
23sKiDdOo!
Join Date: May 2010
Location: Germany
Posts: 182
|
Quote:
and BTW: Thanx for your efforts... |
|
2nd December 2010, 22:59 | #7 | Link | |
easily bamboozled user
Join Date: Sep 2002
Location: Atlanta
Posts: 373
|
Quote:
Yeah, it's like Intel went out of its way to make it confusing. I made the error message for the detection code in the SSSE3-only dll say it "requires SSSE3 (not just SSE3)". SSE3 is all floating-point instructions. SSSE3 (Supplemental SSE3) is all integer instructions. |
|
2nd December 2010, 23:35 | #8 | Link |
23sKiDdOo!
Join Date: May 2010
Location: Germany
Posts: 182
|
Could someone of the (s)mods finally approve the attachment - it can't be, that one waits 2 days for allowance - that is not the first time it happened. Whats wrong? Looking at the On-Times of our mods gives the result, that several Mods were online during this two days...
please unlock the attachment soon... |
2nd December 2010, 23:40 | #9 | Link |
͡҉҉ ̵̡̢̛̗̘̙̜̝̞̟̠͇̊̋̌̍̎̏̿̿
Join Date: Feb 2009
Location: No support in PM
Posts: 712
|
I often move my plugins directory to different computers, with different CPUs, or share it with other people to have a common, up to date base to make our scripts run correctly. Obviously, checking each dll version and substituting the files is a big PITA. Therefore I end up using the smallest common denominator, SSE2 or even plain C++.
A good solution would be an autodetection with an optional parameter to override it. Like for example the "opt" parameter in most of the Tritical's plug-ins.
__________________
dither 1.28.1 for AviSynth | avstp 1.0.4 for AviSynth development | fmtconv r30 for Vapoursynth & Avs+ | trimx264opt segmented encoding |
3rd December 2010, 04:30 | #11 | Link |
easily bamboozled user
Join Date: Sep 2002
Location: Atlanta
Posts: 373
|
The attachments have been approved now. I'll compile a dll that autodetects which optimization to use once we know there's no bugs in the YV12 code. For now there's no need, it needs testing first and foremost.
For a lot of testing I used a script like this: Code:
LoadPlugin("C:\Program Files (x86)\Avisynth 2.5\plugins64\FluxSmoothTest.dll") s = AviSource("E:\metropolis\test3.avi") fc = s.FluxSmoothST(12,10,0) f = s.FluxSmoothST(12,10,2) cu = fc.UToY() cv = fc.VToY() fu = f.UToY() fv = f.VToY() y = Overlay(fc, f, mode="Difference", pc_range=true).ColorYUV(autogain=true,cont_u=1024,cont_v=1024) u = Overlay(cu, fu, mode="Difference", pc_range=true).ColorYUV(autogain=true) v = Overlay(cv, fv, mode="Difference", pc_range=true).ColorYUV(autogain=true) StackHorizontal(y, StackVertical(u, v)) One other thing I forgot to mention: the YUY2 code is completely untouched. The "opt" parameter in the test dll doesn't affect it; it'll autodetect MMX just like always. |
4th December 2010, 00:16 | #13 | Link |
Compiling Encoder
Join Date: Jan 2007
Posts: 1,348
|
Avisynth already offers a way to get the CPU capabilities without needing to write your own detection algorithm with the
GetCPUFlags() function on the IScriptEnvironment class, where flags are also defined in avisynth.h |
25th April 2012, 22:06 | #15 | Link |
Leader of Dual-Duality
Join Date: Aug 2010
Location: America
Posts: 134
|
BTW is there a reason why nobody has ported the updated version to 32bit yet? Are some of the new key factors limited to 64bit functionality? Or is this more along the lines that the original creator has disappeared and everyone who could actually make this happen is busy or working on something else?
__________________
I'm Mr.Fixit and I feel good, fixin all the sources in the neighborhood My New filter is in the works, and will be out soon |
16th December 2013, 23:07 | #17 | Link |
unsigned int
Join Date: Oct 2012
Location: 🇪🇺
Posts: 760
|
At a glance, the new code makes use of the extra general-purpose and xmm registers available only in 64 bit mode. Modifying it to use only the registers available in 32 bit mode is probably non-trivial, if it's possible at all.
__________________
Buy me a "coffee" and/or hire me to write code! |
26th February 2019, 13:27 | #18 | Link |
Registered User
Join Date: Nov 2017
Location: Russia, Nizhny Novgorod
Posts: 25
|
I modified the code a bit (for Avisynth+ 64bit).
Added Code:
static IScriptEnvironment * AVSenvironment; const AVS_Linkage * AVS_linkage = nullptr; extern "C" __declspec (dllexport) const char * __stdcall AvisynthPluginInit3 (IScriptEnvironment * env, const AVS_Linkage * const vectors) { AVS_linkage = vectors; AVSenvironment = env; If anyone is interested. I can post it on the forum (with source code and MS VS project). Checked - it works. |
Tags |
avisynth, fluxsmooth, x64 |
|
|