dav1d accelerated AV1 decoder - Page 2

Beelzebubu · 3rd November 2019, 01:04

Quote:

Originally Posted by NikosD

@Beelzebubu
@nevcairiel

Guys, I posted a huge benchmark report regarding dAV1d decoder progress between 0.2.1 vs 0.5.1 versions, meaning for the last seven months and I see no replies or reactions from you since.

Can you confirm or reject my findings with yours, showing different things ?

I have seen a lot of huge numbers regarding dAV1d progress from the dAV1d team in the official release notes - which I couldn't confirm - but in here you are very quiet.

Waiting for your feedback!

Just to add to SmilingWolf's comments, I agree you and I have diverging results and I've been discussing with various people as for what could be the cause. I don't immediately have a solution or explanation, but I haven't forgotten about it either.

To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool.

NikosD · 3rd November 2019, 02:59

Quote:

Originally Posted by Beelzebubu

... I don't immediately have a solution or explanation, but I haven't forgotten about it either.

To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool.

Ok, but SmilingWolf and you, have tested different things than me.
Firstly, he posted single threaded performance difference and I posted multi-threaded performance difference, besides the obvious difference of the implementation.
VLC is a popular media player - no doubt about it - but here we mostly prefer other players (MPC-HC / MPC-BE / MPV.NET etc)
I don't think there is other way to find out what is going on, than to reproduce the tests by yourself.
Is it possible to test the two versions of LAV's implementation I posted above ?
Also, the huge gains of performance posted in various release notes of dAV1d are for single-threaded or multi-threaded performance ?
Thanks!

NikosD · 3rd November 2019, 08:18

Quote:

Originally Posted by SmilingWolf

Conveniently forgetting about my two posts dedicated to multi threaded performance aren't we?

Conveniently forgetting about my word "firstly" as you posted initially single-tnreaded performance only, while I was asking to confirm or reject my multi-threaded results, as I posted first, regarding this issue.
After my comment you posted multi-threaded results, not using the same tools and with different threading status.
Anyway, the point here is to understand what's going on and not once again playing with words or intensions.
You could try to delete the config file of DXVA Checker and uninstall and reinstall everything.
I'm still waiting for an answer if the publicly available reported gains between versions of dAV1d referred to single-tnreaded or multi-threaded performance.
BTW, how do you benchmark dAV1d with the two executables you posted here ?
There is no internal command in these.

NikosD · 3rd November 2019, 18:38

Quote:

Originally Posted by SmilingWolf

There isn't a DXVA Checker report yet, afternoon spent trying to make it work notwithstanding, but as I said, CPU utilization goes between 70% and 90% with the two sequences used.

The main issue of dAV1d progress between 0.2.1 and 0.5.1 is not CPU Utilization.
The drop of CPU utilization using Skylake was only 2% although using Core2Duo the drop was huge.
The main issue of dAV1d it's the loss of any single-thread gain in real-world multi-thread decoding for whatever internal reason.
In the end, the end user doesn't know and doesn't care for the reasons that Dua Lipa video has exactly the same decoding speed for both versions of dAV1d 0.2.1 and 0.5.1 for two different CPU architectures and instructions sets (Skylake using AVX2 / Core2Duo using SSSE3)
It is us that we are still searching why is this happening and under what circumstances.

SmilingWolf · 3rd November 2019, 19:27

Oh but you seemed so worried about how much dav1d was using all my cores just one day ago.

But here, have a Chimera run:

Code:

LAVFilters 0.74.1-29:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 8929
FPS: 170,234 [103-349]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

LAVFilters 0.74.1:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 8929
FPS: 139,201 [77-306]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

And Dua Lipa:

Code:

LAVFilters 0.74.1-29:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 5615
FPS: 260,815 [183-335]
CPU Usage: -
GPU Usage: 0 [0-1] %
GPU Video Engine Usage: 0 [0-0] %

LAVFilters 0.74.1:
CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
GPU: NVIDIA GeForce GTX 1080
Decoder: LAV Video Decoder
Decoder Device: -
Frames: 5615
FPS: 248,936 [137-328]
CPU Usage: -
GPU Usage: 0 [0-0] %
GPU Video Engine Usage: 0 [0-0] %

NikosD · 4th November 2019, 08:55

Quote:

Originally Posted by SmilingWolf

Oh but you seemed so worried about how much dav1d was using all my cores just one day ago.

My worries were and still are, the same.
The very low gain of real-world multi-thread performance between versions 0.2.1 vs 0.5.1 of dAV1d decoder, as measured by me using the above systems and tools, compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations.
All the other comments by me, express my agony to explain by any means that huge difference.
Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video)
I think @nevcairiel could explain better and test LAV filter's dAV1d implementation.

NikosD · 4th November 2019, 13:01

@nevcairiel
@Beelzebubu
@SmilingWolf

A few more interesting notes regarding LAV filters.

LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d)

But for all the other codecs, it uses only 1 thread as it should, based on the selection.

LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread.

So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs.

In LAV filters 0.74.1-29, when setting Thread = 4, it has exactly the same performance as Auto for my Core i5 6500.

sneaker_ger · 4th November 2019, 13:43

Quote:

Originally Posted by NikosD

LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d)

I think in LAV dav1d tilethreads are hard-coded to 2.
https://github.com/Nevcairiel/LAVFil...codec.cpp#L370

dav1d has 2 thread number settings but LAV only exposes 1 to the user so that's just how it is. I guess nev thinks this is good enough for playback.

SmilingWolf · 4th November 2019, 21:00

Quote:

Originally Posted by NikosD

All the other comments by me, express my agony to explain by any means that huge difference.

And explaining by any means would be nice IF you actually bothered to follow up with a sistematic approach to prove your hypothesis.
This would imply removing all the fluff, going down to the most basic level and doing tests going up from there:
- use the dav1d util, single threaded, on IVF files to reduce to the minimum the amount of non-concerned code that is executed, like container parsing
-- Does it not show gains? Then you're right, there haven't been improvements
-- Does it show gains? Then you're wrong, look elsewhere
- use the dav1d util, with multiple threads.
-- Does it eat the gains? Then the problem is multithreading overhead.
-- Does it show the same gains, like I have measured? Then the problem is not single vs multithreaded performance
- use ffmpeg+dav1d, always on IVF files, singlethreaded.
-- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration
-- Does it show gains? Then look elsewhere
- use ffmpeg+dav1d, always on IVF files, multithreaded.
-- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration and the way multithreading interacts in either or both tools. I had this happen with ffmpeg+libvmaf, where one of the two would simply hang waiting for data that would never come
-- Does it show gains? Then look elsewhere
[I'm not going to write the whole thing again for AV1 files inside MKV containers but, well, if that's all you have left to look at, why not]
-- Did ffmpeg+dav1d integrate well? Then look at LAVFilters
Etc. etc. etc.

Quote:

Originally Posted by NikosD

compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations.

You're forgetting a whole host of volunteers who followed and helped during development: https://code.videolan.org/videolan/dav1d/issues/15

Quote:

Originally Posted by NikosD

Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video)

The only thing my Dua Lipa results confirm is that most routines that would be weighting down decode performance for this particular encode had already been optimized in AVX2 by the time 0.2.1 was released. AVX2 optimization, I would like to remind you, was considered almost complete by the time 0.2.0 was released: https://code.videolan.org/videolan/d...60f09/NEWS#L96
Hell, if I have time I might even build every single tag leading to 0.2.1 to pintpoint the exact release that brought us to today's performance.

Quote:

Originally Posted by NikosD

LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread.

So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs.

Congrats, this might be your first correct conjecture in this whole hordeal.

Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/d.../src/lib.c#L84
The problem, however, is not in LAVFilters, but in FFmpeg's formula for frame distribution between the two modes, starting here: https://github.com/FFmpeg/FFmpeg/blo...ibdav1d.c#L137
Just in case, the formula is: frame_threads = threads / tile_threads, with all numbers involved being integers. For 2 tile_threads, this solves to 0 frame_threads, as shown here: https://godbolt.org/z/cz65UY, which is below the minimum of 1 frame thread required by dav1d.
This is indeed a bug. The easiest fix would be to cast threads to float before doing the division, as shown here: https://godbolt.org/z/wBLYoo, to avoid having dav1d bail. Threads distribution will still be higher than selected, but at least it'll work.

NikosD · 4th November 2019, 22:23

Quote:

Originally Posted by SmilingWolf

And explaining by any means would be nice IF you actually bothered to follow up with a sistematic approach to prove your hypothesis...Then look at LAVFilters
Etc. etc. etc.

I really like your analytical thought, but we need 2 lifes to check all these.
And if all this procedure was so clear for you, why didn't you do it?
We have to suggest things that are feasible in real world, not just crazy detailed procedures.

Quote:

Originally Posted by SmilingWolf

The only thing my Dua Lipa results confirm is that most routines that would be weighting down decode performance for this particular encode had already been optimized in AVX2 by the time 0.2.1 was released. AVX2 optimization, I would like to remind you, was considered almost complete by the time 0.2.0 was released: https://code.videolan.org/videolan/d...60f09/NEWS#L96
Hell, if I have time I might even build every single tag leading to 0.2.1 to pintpoint the exact release that brought us to today's performance.

Unfortunately there are two issues here.
Firstly, I have already said that for 7 months not a lot things have been added to AVX2 optimizations according to my tests although if we followed every release notes after 0.2.1 up to 0.5.1 we should see a lot more AVX2 gain than 5%.
The second more important issue is that according to my tests using LAV filters with Core2Duo in multi-thread mode, there is no difference using SSSE3 optimizations between 0.2.1 and 0.5.1 which is really bad according to release notes.

Quote:

Originally Posted by SmilingWolf

Congrats, this might be your first correct conjecture in this whole hordeal.
Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/d.../src/lib.c#L84...This is indeed a bug. The easiest fix would be to cast threads to float before doing the division, as shown here: https://godbolt.org/z/wBLYoo, to avoid having dav1d bail. Threads distribution will still be higher than selected, but at least it'll work.

I'm here to point to bugs, to discover bugs or even make developers think that something is going wrong that could be a bug, so I'm happy that I discovered one.
But certainly I'm not here to fix it, as I'm not a developer.
Still, the way I understand the bug and the fix presented by you, I'm not sure if it's going to recover the multi-thread "loss" or whatever other reason exists that 0.2.1 is so close to 0.5.1 using LAV for both AVX2 and SSSE3 according to my tests.
So, are we still looking for answers or case closed after fixing the bug ?

SmilingWolf · 4th November 2019, 22:43

Well that procedure is the only certain way to find the source of the slowdown. It should't take more than one afternoon to run those tests, especially with some scripting and logging thrown in the mix.

And the reason I didn't follow my own procedure is that I can't reproduce your results, and have nothing to diagnose. I'm seeing between 4% (Dua Lipa) and 18% (Chimera) improvements in AVX2, and above 30% in SSSE3.
That's far above anything you're seeing on your computers, and more or less in line with what was announced:
- 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
- 0.5.1: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.1 - posted for completeness sake only, there's no mention of SSSE3 or AVX2 speedups

Is there a specific figure you were expecting?

Quote:

Originally Posted by NikosD

Still, the way I understand the bug and the fix presented by you, I'm not sure if it's going to recover the multi-thread "loss" or whatever other reason exists that 0.2.1 is so close to 0.5.1 using LAV for both AVX2 and SSSE3 according to my tests.
So, are we still looking for answers or case closed after fixing the bug ?

That's correct, no case closed yet.

From where I'm standing, the problem is that you are the only one with access to those troublesome systems.
If you want me to help by compiling different versions dav1d or ffmpeg for Windows, I'm game, but that's as far as I can go from here.

nevcairiel · 4th November 2019, 22:46

Quote:

Originally Posted by SmilingWolf

Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/d.../src/lib.c#L84
The problem, however, is not in LAVFilters, but in FFmpeg's formula for frame distribution between the two modes, starting here: https://github.com/FFmpeg/FFmpeg/blo...ibdav1d.c#L137

LAV Filters was actually meant to avoid the calculation logic in FFmpeg entirely, but since I last looked at it, it was changed again (previously it directly took framethreads = threads). So I've adjusted how LAV configures ffmpeg-dav1d, and it should never use their calculations - and it'll now also disable all threading if you set it to 1.

NikosD · 5th November 2019, 09:17

Quote:

Originally Posted by SmilingWolf

And the reason I didn't follow my own procedure is that I can't reproduce your results, and have nothing to diagnose. I'm seeing between 4% (Dua Lipa) and 18% (Chimera) improvements in AVX2, and above 30% in SSSE3.

Using what tools to achieve those figures and in what mode, single-thread or multi-thread ?

Quote:

Originally Posted by SmilingWolf

That's far above anything you're seeing on your computers, and more or less in line with what was announced:
- 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
- 0.5.1: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.1 - posted for completeness sake only, there's no mention of SSSE3 or AVX2 speedups

Is there a specific figure you were expecting?

You got it all wrong here.
To be more scientifically accurate, allow me to correct you according to the publicly available release notes:

- 0.2.2 :
SSSE3 +10% of 0.2.1
AVX2 +5% of 0.2.1

- 0.3.0 :
SSSE3 +12% of 0.2.2
AVX2 +5% of 0.2.2

- 0.5.0 :
SSSE +40% of 0.3.0
AVX2 +(4-7%), for my calculations I take 5% on average of 0.3.0

So, if you do the math correctly we are expecting a gain between 0.2.1 and 0.5.1 versions as follows:

SSSE3 ~72%
AVX2 ~16%

Even your troublesome calculations, as you mixed single-thread mode with multi-thread mode and dAV1d executables with lower than expected number of threads and LAV filters without managing to run DXVA Checker properly, couldn't reach those figures.

Quote:

Originally Posted by SmilingWolf

From where I'm standing, the problem is that you are the only one with access to those troublesome systems.

From where I'm standing I'm the only one with four and not two video samples measured (for both 1080p and 4K), with proper measurements using LAV filters in multi-thread mode and correct DXVA Checker results.
TBH, I'm the only one who even noticed the issue of false reporting the gains between versions, at least using LAV filters in multi-thread mode and as I proved just above, you also confirmed my claims even using dAV1d executables and without wanting to.

I'm not sure what is your connection with dAV1d team, but you are certainly not offering a good job as their unofficial "lawyer"

I'm still waiting for an answer from you or any other member of dAV1d team regarding that 16% gain of AVX2 and 72% gain of SSSE3 between 0.2.1 and 0.5.1 reported in the release notes, is it for single-thread or multi-thread mode ?

NikosD · 5th November 2019, 14:19

Quote:

Originally Posted by nevcairiel

Comparisons between LAV 0.74.1 and later nightly versions are flawed since the threading strategy changed in FFmpeg, which resulted in 0.74.1 using more frame threads then the later nightlies, making 0.74.1 artificially faster. As such, all your results are invalidated.
This is why you should use as little software as possible to do benchmarking (ie. go as close to the core as possible), as you never know what changes might interfer with your conclusions.

So...It seems that the inconsistency of LAV filters between the threading management of 0.74.1 (0.2.1 dAV1d) and 0.74.1-29 (0.5.1 dAV1d) caused a lot of troubles for benchmarking.

Also, your decision to reject single-thread decoding for 0.74.1 and 0.74.1-29, didn't allow me and still doesn't allow me to test this kind of performance gain (single-thread)

But, as I said before, the end user using a Media Player couldn't care less for single-thread performance/gain.

It's the real-world multi-thread decoding that does matter.

Quote:

Originally Posted by nevcairiel

I've also once again changed the thread distribution in 0.74.1-30 from last night, and while its going to use more threads again now, similar to the old logic, its not going to be identical to 0.74.1 in all cases (because I added more tile threads on high core-count CPUs)

OK, let's move on to new benchmarks using multi-thread performance of 0.74.1-30.

1080p

Chimera ~6.6Mbps

Core i5 6500 95/144/285 CPU 92% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 86/134/290 CPU 87% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 77/127/273 CPU 91% -0.2.1 (LAV 0.74.1)

Core2Duo T7600 12/22/103 CPU 87% -0.5.1 (LAV 0.74.1-30)
Core2Duo T7600 10/19/94 CPU 72% -0.5.1 (LAV 0.74.1-29)
Core2Duo T7600 8/17/100 CPU 87% -0.2.1 (LAV 0.74.1)

Dua Lipa ~2.2Mbps

Core i5 6500 135/194/255 CPU 91% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 120/186/251 CPU 87% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 112/186/255 CPU 91% -0.2.1 (LAV 0.74.1)

Core2Duo T7600 11/22/62 CPU 84% -0.5.1 (LAV 0.74.1-30)
Core2Duo T7600 7/18/70 CPU 65% -0.5.1 (LAV 0.74.1-29)
Core2Duo T7600 7/18/69 CPU 84% -0.2.1 (LAV 0.74.1)

4K

Holi Festival ~14Mbps

Core i5 6500 34/43/62 CPU 94% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 34/43/61 CPU 94% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 30/40/60 CPU 95% -0.2.1 (LAV 0.74.1)

Summer Nature ~23Mbps

Core i5 6500 31/42/55 CPU 92% -0.5.1 (LAV 0.74.1-30)
Core i5 6500 32/43/57 CPU 93% -0.5.1 (LAV 0.74.1-29)
Core i5 6500 26/37/50 CPU 91% -0.2.1 (LAV 0.74.1)

Comments:

1) Unfortunately not a lot changed regarding AVX2 optimizations in general.

For 4K clips the decoding performance didn't change at all and there is also a slight regression for Summer Nature

But for 1080p we have a gain of 13% for Chimera and 4% for Dua Lipa comparing 0.2.1 vs 0.5.1, still far away from 16% of expected gain according to release notes.

2) I'm now 100% sure that dAV1d team should be a lot more cautious regarding publicly reported gains of their versions in release notes.

IMO, they should always include real-world multi-thread gains on multiple content and resolutions (at least 1080p and 4K)

3) SSSE3 optimizations give 22% and 29% gain for 0.5.1 vs 0.2.1 on Core2Duo CPU, which of course is far away than optimal 72% but a lot better than previous badly configured LAV filters 0.74.1-29.

4) LAV filters 0.74.1-30 and 0.74.1 have the same CPU utilization, so the bug of LAV 0.74.1-29 has been fixed and we can finally compare apples to apples.

clsid · 5th November 2019, 15:48

You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/d...e_requests/792

SmilingWolf · 5th November 2019, 18:34

Quote:

Originally Posted by NikosD

Using what tools to achieve those figures and in what mode, single-thread or multi-thread ?
You got it all wrong here.
To be more scientifically accurate, allow me to correct you according to the publicly available release notes:

- 0.2.2 :
SSSE3 +10% of 0.2.1
AVX2 +5% of 0.2.1

- 0.3.0 :
SSSE3 +12% of 0.2.2
AVX2 +5% of 0.2.2

- 0.5.0 :
SSSE +40% of 0.3.0
AVX2 +(4-7%), for my calculations I take 5% on average of 0.3.0

So, if you do the math correctly we are expecting a gain between 0.2.1 and 0.5.1 versions as follows:

SSSE3 ~72%
AVX2 ~16%

No, once again it's you who got it all wrong: https://code.videolan.org/videolan/d.../0.2.2...0.3.0
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever.

And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog:

So if YOU do the math correctly, you get:
- 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors"
- 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors"
So, for SSSE3, max: 75%, min: 40% if you consider the numbers in the TLDR, or 37% if you consider the lowest range given within the 0.3.0 blogpost.
And for AVX2: max: 12,4%, min: 109,2%

I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings.
29% figure: http://forum.doom9.org/showthread.ph...74#post1889274, 0.2.1 SSSE3 = 46,972s, 0.5.1 SSSE3 = 33,041s
38,8% figure: http://forum.doom9.org/showthread.ph...89#post1889289, 0.5.1 SSSE3 with "nonstandard" thread settings: 28,737s
Admittedly close to the low end of the promised speedups, but definitely within the given range.

The 4% and 18% figures come from this post: http://forum.doom9.org/showthread.ph...42#post1889442
Your beloved DXVA checker, LAVFilters 0.74.1 vs 0.74.1-29, AVX2, default multithreading, basically same conditions as you:
Chimera average FPS: 139,201 -> 170,234 = 18,2% slowdown when going from the most recent to the older, or 22% speedup when doing the opposite
Dua Lipa average FPS: 248,936 -> 260,815 = 4.6% slowdown when going from the most recent to the older, or 4.8% speedup when doing the opposite

Moreover, you keep yelling at a whole bunch of clouds: it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist.

Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something.

OR you could start doing as suggested, and MAYBE we'll find out exactly where the problem lies, and possibly fix it.

NikosD · 6th November 2019, 11:15

Quote:

Originally Posted by SmilingWolf

No, once again it's you who got it all wrong: https://code.videolan.org/videolan/d.../0.2.2...0.3.0
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever.

And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog

I really like names like Jean-Baptiste or Jesus from Nazareth, but I like more to read the official release notes than specific blogs: https://code.videolan.org/videolan/dav1d/-/releases

So, what do we have here ?

Quote:

0.2.2 brings large improvements in speed on ARM64 and SSSE3 (more than 10% speed increase) and even manages to gain around 5% on the already fast AVX-2 implementation.

10% for SSSE3 and 5% for AVX2 using 0.2.2 compared to previous version aka 0.2.1

Quote:

0.3.0 brings large improvements in speed on ARM64 (15% speedup) and SSSE3 (more than 12% fps increase) and even manages to gain around 5% on the already fast AVX-2 implementation.

Another 12% for SSSE3 and 5% for AVX2 using 0.3.0 compared to previous version aka 0.2.2

Quote:

0.5.0 brings large improvements in speed on SSSE3 CPU (up to 40% speedup), new speed improvements on AVX-2 (for 4-7%) and ARM64 (up to 10%) and ARM32. It introduces some VSX, SSE2 and SSE4 optimizations.

Another 40% for SSSE3 and 4-7% for AVX2 using 0.5.0 compared to previous version aka 0.3.0.

Once again, please do the math.

It's ~72% from 0.2.1 to 0.5.1 regarding SSSE3 optimizations and ~16% for AVX2, according to the official, publicly released notes.

Quote:

Originally Posted by SmilingWolf

I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings.

Admittedly close to the low end of the promised speedups, but definitely within the given range.

I have updated my previous post regarding benchmarks and I get 22% and 29% for Chimera and Dua Lipa using my Core2Duo, still too far away from 72%

Quote:

Originally Posted by SmilingWolf

Moreover, you keep yelling at a whole bunch of clouds: it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist.

Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something.

SmilingWolf with a Big Mouth, I could easily add.

Quote:

Originally Posted by clsid

You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/d...e_requests/792

Interesting specs...2 x Xeon with AVX2, DDR4 etc= 2x14cores = 28 cores with hyperthreading for testing SSSE3.

He has an average gain of ~23% which is in the range of my 22% to 29% gain, but I don't understand how the build versions used by him are connected to final versions (0.2.1, 0.2.2 etc)

But his results made me struggle to understand what is really going on with SSSE3 and propose something different.

My first Haswell processor was a Pentium with artificially disabled AVX/AVX2 instructions.

So, I remembered late yesterday night and confirmed with my 2013 (!) benchmark results that my 128bit SIMD (SSEx) benchmarks running on Pentium Haswell, were a lot faster at the same clock than my desktop Core2Duo E7300, unusually faster and not justified by the architecture differences.
It was like running 128bit instructions on 256bit registers and I say that because of the huge difference.

My suggestion:
@Beelzebubu, dAV1d team, x265/x264 fans, @doom9 and every other people running benchmarks on different SIMD optimizations.

If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!)

I think you are going to be surprised by the results and these results could explain some performance difference.

Beelzebubu · 6th November 2019, 15:30

Quote:

Originally Posted by NikosD

My suggestion:
@Beelzebubu, dAV1d team [..]

If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!)

That's a fair request, we can look into doing that.

marcomsousa · 8th November 2019, 13:14

Quote:

Originally Posted by Mr_Khyron

AOMedia Research Symposium 2019 Videos
https://www.youtube.com/playlist?lis...wewtWKpxXky8iI

soresu · 20th November 2019, 17:24

Quote:

Originally Posted by huhn

stadia "works" with pretty much every device. you don't need a chromecast a phone can do it so can a web browser on the PC. there is missing support for iOS and such but what ever.

It would drain battery tout suite, but dav1d 0.5.1 is more than fast enough to decode 1080p60 on any iPhone or iPad from the last 2-3 years, perhaps even 4K (though 4K60 seems doubtful).

Even the less impressive Cortex big cores on Snapdragon could handle 1080p60.

Obviously lacking ASIC decoder support is not ideal, but at least it is some support rather than nothing.

Heres hoping future improvements to the GPU code in dav1d will make decoding even more efficient for pre-ASIC devices than the initial GSoC 2019 efforts.

4th November 2019, 13:01	#27 \| Link
NikosD Registered User Join Date: Aug 2010 Location: Athens, Greece Posts: 2,901	@nevcairiel @Beelzebubu @SmilingWolf A few more interesting notes regarding LAV filters. LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d) But for all the other codecs, it uses only 1 thread as it should, based on the selection. LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread. So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs. In LAV filters 0.74.1-29, when setting Thread = 4, it has exactly the same performance as Auto for my Core i5 6500. __________________ Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all

5th November 2019, 15:48	#35 \| Link
clsid ***** Join Date: Feb 2005 Posts: 5,647	You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example: https://code.videolan.org/videolan/d...e_requests/792