Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Development

 
 
Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 2nd December 2010, 01:39   #1  |  Link
Prettz
easily bamboozled user
 
Prettz's Avatar
 
Join Date: Sep 2002
Location: Atlanta
Posts: 373
New 64-bit FluxSmooth with SSE2 and SSSE3

A straight 64-bit port of FluxSmooth 1.1a (and what I started from) can be found here: http://forum.doom9.org/showthread.ph...28#post1425528

I've made new 64-bit versions of FluxSmooth for YV12 using SSE2 and SSSE3. The SSE2 version is optimized specifically for Athlon 64 and the SSSE3 version specifically for the 65nm Intel Core 2 (Conroe/Kentsfield). None of AMD's current chips support SSSE3, annoyingly.

I haven't done the YUY2 version yet, I'll be starting on that next. For now, these .dll's contain the original C++ and MMX versions for YUY2. Although FluxSmooth's documentation said it was SSE optimized it was actually in MMX, so these new versions involve a total rewrite, that's why it's taking so long.

I'm posting this now because I'd like some feedback and help testing. I've gotten all the obvious bugs, but there's bound to be something that needs more to uncover. Being new to writing Avisynth plugins, I don't have any good methods of testing filters. I'd also really like to see some speed numbers for the Athlon 64.

About the SSE2 version... There was a problem in moving FluxSmooth's method of doing the average to SSE. The MMX routine uses a lookup table that's a whopping 512KB. It's impossible to use this method in 128-bit -- it would require a 64GB table. For the SSE2 version I had to use floating-point code to take the average. It's not that much of a speed loss though, because with the MMX version's 512KB table, virtually every access would be a cache miss (and a lot of them an L2 cache miss). The SSSE3 version is able to avoid this thanks to the new instructions, and it uses the same average calculation as the C++ and MMX versions.

There are enough changes in this release that I think it should constitute a new version, v1.2, of FluxSmooth:
  • FluxSmooth's documentation never mentioned that the MMX version actually skips over not just the edge pixels but the first 4 and last 4 pixels of every row. So, previously, FluxSmooth's only optimized version returned different results from the reference C++ code. The new SSE2 and SSSE3 versions only skip the edge pixels; they smooth all the same pixels that the C++ version does.
  • The temporal-only versions of the original FluxSmooth skipped over the same pixels that the ST versions did, although there was no reason to do this. I've modified the C++ code to process all pixels on each frame, and I made (highly-optimized) standalone versions of the SSE2 and SSSE3 for temporal-only that also process all pixels. The MMX code processes the top and bottom row but continues to skip the first and last 4 pixels of each row.
  • Because the SSE2 version uses floating-point for the average, its results are occassionally off by 1 from the pixel values the other versions give (due to rounding). This isn't that big of a deal, though, because FluxSmooth's regular average calculation is itself off by 1 from the true average every once in a while. However, I still felt that this was worth mentioning.

Now, on to the speed gains. I've only done some very brief speed testing with avs2avi64. I'd really love to get some feedback on the speed from other users with other hardware. I've got a Core 2 Quad Q6600 (65nm), and I tested running at stock clock speed, 2.4GHz. I tested on an Xvid avi to get a more realistic scenario, so FluxSmooth doesn't have the CPU cache all to itself.

Xvid (no b-frames, no Qpel)
720 x 480
46580 frames
232MB

Empty:
1:15.3 (618.97 fps)
1:15.0 (621.03 fps)

C:
4:49.0 (161.17 fps)
4:51.0 (160.05 fps)

MMX:
4:23.5 (176.76 fps)
4:25.3 (175.59 fps)

SSE2:
3:23.5 (228.87 fps)
3:23.8 (228.58 fps)

SSSE3:
2:41.8 (287.96 fps)
2:44.3 (283.58 fps)

If you're wondering what contribution the FP code makes to the SSE2 version's time (I was), I also made an SSSE3 version that uses the same FP code for the average, but tuned for Core 2 instead of Athlon 64. Looks like the FP code makes up a substantial portion of the time:

SSSE3 /w FP:
3:00.3 (258.40 fps)
3:00.8 (257.69 fps)

For testing I've made a version of the plugin named FluxSmoothTest.dll that includes all of the different filter versions for YV12, and an extra parameter to choose which to use. The parameter is an integer called "opt": 0 = C code, 1 = MMX, 2 = SSE2, 3 = SSSE3. If your CPU doesn't support the instruction set you chose, it defaults back to the C code (it does not fall back to the next-best optimized version, this way you'll know for sure which version is being run).

I liked the way RemoveGrain did its dlls, so I did the same here. There's a dll that contains only the SSE2 code and another with only the SSSE3 code, so they're much smaller. They throw an error if your CPU doesn't support the required instructions. That won't be an issue with the SSE2 because all 64-bit x86 chips have it. If people really want it, I can make a dll with all optimized versions that chooses the best one your CPU supports.
Attached Files
File Type: 7z FluxSmooth SSE DLLs.7z (53.1 KB, 1522 views)
File Type: 7z FluxSmooth_x64_code.7z (38.0 KB, 517 views)
Prettz is offline   Reply With Quote
 

Tags
avisynth, fluxsmooth, x64

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 13:56.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.