Doom9's Forum - View Single Post

NikosD · 6th November 2019, 11:15

Quote:

Originally Posted by SmilingWolf

No, once again it's you who got it all wrong: https://code.videolan.org/videolan/d.../0.2.2...0.3.0
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever.

And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog

I really like names like Jean-Baptiste or Jesus from Nazareth, but I like more to read the official release notes than specific blogs: https://code.videolan.org/videolan/dav1d/-/releases

So, what do we have here ?

Quote:

0.2.2 brings large improvements in speed on ARM64 and SSSE3 (more than 10% speed increase) and even manages to gain around 5% on the already fast AVX-2 implementation.

10% for SSSE3 and 5% for AVX2 using 0.2.2 compared to previous version aka 0.2.1

Quote:

0.3.0 brings large improvements in speed on ARM64 (15% speedup) and SSSE3 (more than 12% fps increase) and even manages to gain around 5% on the already fast AVX-2 implementation.

Another 12% for SSSE3 and 5% for AVX2 using 0.3.0 compared to previous version aka 0.2.2

Quote:

0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations.

Another 40% for SSSE3 and 4-7% for AVX2 using 0.5.0 compared to previous version aka 0.3.0.

Once again, please do the math.

It's ~72% from 0.2.1 to 0.5.1 regarding SSSE3 optimizations and ~16% for AVX2, according to the official, publicly released notes.

Quote:

Originally Posted by SmilingWolf

I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings.

Admittedly close to the low end of the promised speedups, but definitely within the given range.

I have updated my previous post regarding benchmarks and I get 22% and 29% for Chimera and Dua Lipa using my Core2Duo, still too far away from 72%

Quote:

Originally Posted by SmilingWolf

Moreover, you keep yelling at a whole bunch of clouds: it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist.

Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something.

SmilingWolf with a Big Mouth, I could easily add.

Quote:

Originally Posted by clsid

You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/d...e_requests/792

Interesting specs...2 x Xeon with AVX2, DDR4 etc= 2x14cores = 28 cores with hyperthreading for testing SSSE3.

He has an average gain of ~23% which is in the range of my 22% to 29% gain, but I don't understand how the build versions used by him are connected to final versions (0.2.1, 0.2.2 etc)

But his results made me struggle to understand what is really going on with SSSE3 and propose something different.

My first Haswell processor was a Pentium with artificially disabled AVX/AVX2 instructions.

So, I remembered late yesterday night and confirmed with my 2013 (!) benchmark results that my 128bit SIMD (SSEx) benchmarks running on Pentium Haswell, were a lot faster at the same clock than my desktop Core2Duo E7300, unusually faster and not justified by the architecture differences.
It was like running 128bit instructions on 256bit registers and I say that because of the huge difference.

My suggestion:
@Beelzebubu, dAV1d team, x265/x264 fans, @doom9 and every other people running benchmarks on different SIMD optimizations.

If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!)

I think you are going to be surprised by the results and these results could explain some performance difference.