Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
3rd November 2019, 01:04 | #21 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
To be clear, we don't just do command-line interface tests. We test this in end-user applications such as VLC and Chrome/Firefox also, and we see the same performance improvements there that we also see in "dav1d" the commandline tool. |
|
3rd November 2019, 02:59 | #22 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Firstly, he posted single threaded performance difference and I posted multi-threaded performance difference, besides the obvious difference of the implementation. VLC is a popular media player - no doubt about it - but here we mostly prefer other players (MPC-HC / MPC-BE / MPV.NET etc) I don't think there is other way to find out what is going on, than to reproduce the tests by yourself. Is it possible to test the two versions of LAV's implementation I posted above ? Also, the huge gains of performance posted in various release notes of dAV1d are for single-threaded or multi-threaded performance ? Thanks!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
3rd November 2019, 08:18 | #23 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
After my comment you posted multi-threaded results, not using the same tools and with different threading status. Anyway, the point here is to understand what's going on and not once again playing with words or intensions. You could try to delete the config file of DXVA Checker and uninstall and reinstall everything. I'm still waiting for an answer if the publicly available reported gains between versions of dAV1d referred to single-tnreaded or multi-threaded performance. BTW, how do you benchmark dAV1d with the two executables you posted here ? There is no internal command in these.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
3rd November 2019, 18:38 | #24 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
The drop of CPU utilization using Skylake was only 2% although using Core2Duo the drop was huge. The main issue of dAV1d it's the loss of any single-thread gain in real-world multi-thread decoding for whatever internal reason. In the end, the end user doesn't know and doesn't care for the reasons that Dua Lipa video has exactly the same decoding speed for both versions of dAV1d 0.2.1 and 0.5.1 for two different CPU architectures and instructions sets (Skylake using AVX2 / Core2Duo using SSSE3) It is us that we are still searching why is this happening and under what circumstances.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
3rd November 2019, 19:27 | #25 | Link |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
Oh but you seemed so worried about how much dav1d was using all my cores just one day ago.
But here, have a Chimera run: Code:
LAVFilters 0.74.1-29: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 8929 FPS: 170,234 [103-349] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % LAVFilters 0.74.1: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 8929 FPS: 139,201 [77-306] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % Code:
LAVFilters 0.74.1-29: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 5615 FPS: 260,815 [183-335] CPU Usage: - GPU Usage: 0 [0-1] % GPU Video Engine Usage: 0 [0-0] % LAVFilters 0.74.1: CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz GPU: NVIDIA GeForce GTX 1080 Decoder: LAV Video Decoder Decoder Device: - Frames: 5615 FPS: 248,936 [137-328] CPU Usage: - GPU Usage: 0 [0-0] % GPU Video Engine Usage: 0 [0-0] % Last edited by SmilingWolf; 3rd November 2019 at 20:41. |
4th November 2019, 08:55 | #26 | Link | |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
The very low gain of real-world multi-thread performance between versions 0.2.1 vs 0.5.1 of dAV1d decoder, as measured by me using the above systems and tools, compared to the advertised and publicly reported by dAV1d team regarding SSSE3 and AVX2 optimizations. All the other comments by me, express my agony to explain by any means that huge difference. Your results confirm mine in an absolute way regarding Dua Lipa video, but there is a small light in the end of the tunnel regarding Chimera (regardless the name of the video) I think @nevcairiel could explain better and test LAV filter's dAV1d implementation.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|
4th November 2019, 13:01 | #27 | Link |
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
@nevcairiel
@Beelzebubu @SmilingWolf A few more interesting notes regarding LAV filters. LAV filters v0.74.1 allows you to set Thread = 1 but it actually uses 50% something CPU utilization, which means 2 cores = 2 threads for AV1 (using dAV1d) But for all the other codecs, it uses only 1 thread as it should, based on the selection. LAV filters v0.74.1-29 doesn't even allow you to set Thread = 1 because if you set it to 1, it doesn't enumerate in DXVA Checker when trying to decode AV1 files, while it can be used for all the other codecs using only 1 thread. So, there is definitely something different regarding dAV1d integration in LAV filters, compared to all the other codecs. In LAV filters 0.74.1-29, when setting Thread = 4, it has exactly the same performance as Auto for my Core i5 6500.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
4th November 2019, 13:43 | #28 | Link | |
Registered User
Join Date: Dec 2002
Posts: 5,565
|
Quote:
https://github.com/Nevcairiel/LAVFil...codec.cpp#L370 dav1d has 2 thread number settings but LAV only exposes 1 to the user so that's just how it is. I guess nev thinks this is good enough for playback. |
|
4th November 2019, 21:00 | #29 | Link | ||||
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
Quote:
This would imply removing all the fluff, going down to the most basic level and doing tests going up from there: - use the dav1d util, single threaded, on IVF files to reduce to the minimum the amount of non-concerned code that is executed, like container parsing -- Does it not show gains? Then you're right, there haven't been improvements -- Does it show gains? Then you're wrong, look elsewhere - use the dav1d util, with multiple threads. -- Does it eat the gains? Then the problem is multithreading overhead. -- Does it show the same gains, like I have measured? Then the problem is not single vs multithreaded performance - use ffmpeg+dav1d, always on IVF files, singlethreaded. -- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration -- Does it show gains? Then look elsewhere - use ffmpeg+dav1d, always on IVF files, multithreaded. -- Does it eat the gains? Then your problem is in the dav1d+ffmpeg integration and the way multithreading interacts in either or both tools. I had this happen with ffmpeg+libvmaf, where one of the two would simply hang waiting for data that would never come -- Does it show gains? Then look elsewhere [I'm not going to write the whole thing again for AV1 files inside MKV containers but, well, if that's all you have left to look at, why not] -- Did ffmpeg+dav1d integrate well? Then look at LAVFilters Etc. etc. etc. Quote:
Quote:
Hell, if I have time I might even build every single tag leading to 0.2.1 to pintpoint the exact release that brought us to today's performance. Quote:
Based on the line of code highlighted by sneaker_ger I'd say having the number of tile threads implicitly set to 2 makes dav1d bail on this line: https://code.videolan.org/videolan/d.../src/lib.c#L84 The problem, however, is not in LAVFilters, but in FFmpeg's formula for frame distribution between the two modes, starting here: https://github.com/FFmpeg/FFmpeg/blo...ibdav1d.c#L137 Just in case, the formula is: frame_threads = threads / tile_threads, with all numbers involved being integers. For 2 tile_threads, this solves to 0 frame_threads, as shown here: https://godbolt.org/z/cz65UY, which is below the minimum of 1 frame thread required by dav1d. This is indeed a bug. The easiest fix would be to cast threads to float before doing the division, as shown here: https://godbolt.org/z/wBLYoo, to avoid having dav1d bail. Threads distribution will still be higher than selected, but at least it'll work. Last edited by SmilingWolf; 4th November 2019 at 21:06. |
||||
4th November 2019, 22:23 | #30 | Link | |||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
And if all this procedure was so clear for you, why didn't you do it? We have to suggest things that are feasible in real world, not just crazy detailed procedures. Quote:
Firstly, I have already said that for 7 months not a lot things have been added to AVX2 optimizations according to my tests although if we followed every release notes after 0.2.1 up to 0.5.1 we should see a lot more AVX2 gain than 5%. The second more important issue is that according to my tests using LAV filters with Core2Duo in multi-thread mode, there is no difference using SSSE3 optimizations between 0.2.1 and 0.5.1 which is really bad according to release notes. Quote:
But certainly I'm not here to fix it, as I'm not a developer. Still, the way I understand the bug and the fix presented by you, I'm not sure if it's going to recover the multi-thread "loss" or whatever other reason exists that 0.2.1 is so close to 0.5.1 using LAV for both AVX2 and SSSE3 according to my tests. So, are we still looking for answers or case closed after fixing the bug ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|||
4th November 2019, 22:43 | #31 | Link | |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
Well that procedure is the only certain way to find the source of the slowdown. It should't take more than one afternoon to run those tests, especially with some scripting and logging thrown in the mix.
And the reason I didn't follow my own procedure is that I can't reproduce your results, and have nothing to diagnose. I'm seeing between 4% (Dua Lipa) and 18% (Chimera) improvements in AVX2, and above 30% in SSSE3. That's far above anything you're seeing on your computers, and more or less in line with what was announced: - 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors" - 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors" - 0.5.1: http://www.jbkempf.com/blog/post/2019/dav1d-0.5.1 - posted for completeness sake only, there's no mention of SSSE3 or AVX2 speedups Is there a specific figure you were expecting? Quote:
From where I'm standing, the problem is that you are the only one with access to those troublesome systems. If you want me to help by compiling different versions dav1d or ffmpeg for Windows, I'm game, but that's as far as I can go from here. Last edited by SmilingWolf; 4th November 2019 at 23:12. |
|
4th November 2019, 22:46 | #32 | Link | |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,348
|
Quote:
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
|
5th November 2019, 09:17 | #33 | Link | |||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Quote:
To be more scientifically accurate, allow me to correct you according to the publicly available release notes: - 0.2.2 : SSSE3 +10% of 0.2.1 AVX2 +5% of 0.2.1 - 0.3.0 : SSSE3 +12% of 0.2.2 AVX2 +5% of 0.2.2 - 0.5.0 : SSSE +40% of 0.3.0 AVX2 +(4-7%), for my calculations I take 5% on average of 0.3.0 So, if you do the math correctly we are expecting a gain between 0.2.1 and 0.5.1 versions as follows: SSSE3 ~72% AVX2 ~16% Even your troublesome calculations, as you mixed single-thread mode with multi-thread mode and dAV1d executables with lower than expected number of threads and LAV filters without managing to run DXVA Checker properly, couldn't reach those figures. Quote:
TBH, I'm the only one who even noticed the issue of false reporting the gains between versions, at least using LAV filters in multi-thread mode and as I proved just above, you also confirmed my claims even using dAV1d executables and without wanting to. I'm not sure what is your connection with dAV1d team, but you are certainly not offering a good job as their unofficial "lawyer" I'm still waiting for an answer from you or any other member of dAV1d team regarding that 16% gain of AVX2 and 72% gain of SSSE3 between 0.2.1 and 0.5.1 reported in the release notes, is it for single-thread or multi-thread mode ?
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|||
5th November 2019, 14:19 | #34 | Link | ||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
Also, your decision to reject single-thread decoding for 0.74.1 and 0.74.1-29, didn't allow me and still doesn't allow me to test this kind of performance gain (single-thread) But, as I said before, the end user using a Media Player couldn't care less for single-thread performance/gain. It's the real-world multi-thread decoding that does matter. Quote:
1080p Chimera ~6.6Mbps Core i5 6500 95/144/285 CPU 92% -0.5.1 (LAV 0.74.1-30) Core i5 6500 86/134/290 CPU 87% -0.5.1 (LAV 0.74.1-29) Core i5 6500 77/127/273 CPU 91% -0.2.1 (LAV 0.74.1) Core2Duo T7600 12/22/103 CPU 87% -0.5.1 (LAV 0.74.1-30) Core2Duo T7600 10/19/94 CPU 72% -0.5.1 (LAV 0.74.1-29) Core2Duo T7600 8/17/100 CPU 87% -0.2.1 (LAV 0.74.1) Dua Lipa ~2.2Mbps Core i5 6500 135/194/255 CPU 91% -0.5.1 (LAV 0.74.1-30) Core i5 6500 120/186/251 CPU 87% -0.5.1 (LAV 0.74.1-29) Core i5 6500 112/186/255 CPU 91% -0.2.1 (LAV 0.74.1) Core2Duo T7600 11/22/62 CPU 84% -0.5.1 (LAV 0.74.1-30) Core2Duo T7600 7/18/70 CPU 65% -0.5.1 (LAV 0.74.1-29) Core2Duo T7600 7/18/69 CPU 84% -0.2.1 (LAV 0.74.1) 4K Holi Festival ~14Mbps Core i5 6500 34/43/62 CPU 94% -0.5.1 (LAV 0.74.1-30) Core i5 6500 34/43/61 CPU 94% -0.5.1 (LAV 0.74.1-29) Core i5 6500 30/40/60 CPU 95% -0.2.1 (LAV 0.74.1) Summer Nature ~23Mbps Core i5 6500 31/42/55 CPU 92% -0.5.1 (LAV 0.74.1-30) Core i5 6500 32/43/57 CPU 93% -0.5.1 (LAV 0.74.1-29) Core i5 6500 26/37/50 CPU 91% -0.2.1 (LAV 0.74.1) Comments: 1) Unfortunately not a lot changed regarding AVX2 optimizations in general. For 4K clips the decoding performance didn't change at all and there is also a slight regression for Summer Nature But for 1080p we have a gain of 13% for Chimera and 4% for Dua Lipa comparing 0.2.1 vs 0.5.1, still far away from 16% of expected gain according to release notes. 2) I'm now 100% sure that dAV1d team should be a lot more cautious regarding publicly reported gains of their versions in release notes. IMO, they should always include real-world multi-thread gains on multiple content and resolutions (at least 1080p and 4K) 3) SSSE3 optimizations give 22% and 29% gain for 0.5.1 vs 0.2.1 on Core2Duo CPU, which of course is far away than optimal 72% but a lot better than previous badly configured LAV filters 0.74.1-29. 4) LAV filters 0.74.1-30 and 0.74.1 have the same CPU utilization, so the bug of LAV 0.74.1-29 has been fixed and we can finally compare apples to apples.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all Last edited by NikosD; 5th November 2019 at 20:44. |
||
5th November 2019, 15:48 | #35 | Link |
*****
Join Date: Feb 2005
Posts: 5,647
|
You can find the exact benchmark results from Ewout in the individual MRs. There is a link to a spreadsheet with all test results and system spec. Example:
https://code.videolan.org/videolan/d...e_requests/792 |
5th November 2019, 18:34 | #36 | Link | |
I am maddo saientisto!
Join Date: Aug 2018
Posts: 95
|
Quote:
A grand total of 4 commits between 0.2.2 and 0.3.0, with a stability fix, some docs updates, and no performance related commits whatsoever. And if you had bothered to read the resources I linked to, you would have seen the numbers refer to 0.3.0 vs 0.2.1, as shown by the image on JBKempf's blog: So if YOU do the math correctly, you get: - 0.3.0: http://www.jbkempf.com/blog/post/201...even-faster%21 - "a gain of 15%-25% on SSSE3 processors; and even a 5% gain on AVX-2 processors" - 0.5.0: http://www.jbkempf.com/blog/post/201...elease-fastest - "a gain of 22%-40% on SSSE3 processors; and another gain of 4-7% on AVX-2 processors" So, for SSSE3, max: 75%, min: 40% if you consider the numbers in the TLDR, or 37% if you consider the lowest range given within the 0.3.0 blogpost. And for AVX2: max: 12,4%, min: 109,2% I have already shown that, with SSSE3, I can get a 29% improvement in "FFmpeg multithread" mode on Dua Lipa, and 38,8% if playing some more extensively with the thread settings. 29% figure: http://forum.doom9.org/showthread.ph...74#post1889274, 0.2.1 SSSE3 = 46,972s, 0.5.1 SSSE3 = 33,041s 38,8% figure: http://forum.doom9.org/showthread.ph...89#post1889289, 0.5.1 SSSE3 with "nonstandard" thread settings: 28,737s Admittedly close to the low end of the promised speedups, but definitely within the given range. The 4% and 18% figures come from this post: http://forum.doom9.org/showthread.ph...42#post1889442 Your beloved DXVA checker, LAVFilters 0.74.1 vs 0.74.1-29, AVX2, default multithreading, basically same conditions as you: Chimera average FPS: 139,201 -> 170,234 = 18,2% slowdown when going from the most recent to the older, or 22% speedup when doing the opposite Dua Lipa average FPS: 248,936 -> 260,815 = 4.6% slowdown when going from the most recent to the older, or 4.8% speedup when doing the opposite Moreover, you keep yelling at a whole bunch of clouds: it has been shown that a bunch of different projects have undergone a bunch of changes that make both your and my DXVA Checker measurements completely unreliable to find out about dav1d improvements or lack thereof, yet you insist. Meanwhile, all explanations (but your own), offers of help and alternative, more reliable solutions have been met with utter hostility. At this point, all resources are exhausted. You're right. dav1d is crap, the developers are incompetent, and you can live in your happy world where you can be mad at something. OR you could start doing as suggested, and MAYBE we'll find out exactly where the problem lies, and possibly fix it. Last edited by SmilingWolf; 5th November 2019 at 19:03. |
|
6th November 2019, 11:15 | #37 | Link | |||||||
Registered User
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
|
Quote:
So, what do we have here ? Quote:
Quote:
Quote:
Once again, please do the math. It's ~72% from 0.2.1 to 0.5.1 regarding SSSE3 optimizations and ~16% for AVX2, according to the official, publicly released notes. Quote:
Quote:
Quote:
He has an average gain of ~23% which is in the range of my 22% to 29% gain, but I don't understand how the build versions used by him are connected to final versions (0.2.1, 0.2.2 etc) But his results made me struggle to understand what is really going on with SSSE3 and propose something different. My first Haswell processor was a Pentium with artificially disabled AVX/AVX2 instructions. So, I remembered late yesterday night and confirmed with my 2013 (!) benchmark results that my 128bit SIMD (SSEx) benchmarks running on Pentium Haswell, were a lot faster at the same clock than my desktop Core2Duo E7300, unusually faster and not justified by the architecture differences. It was like running 128bit instructions on 256bit registers and I say that because of the huge difference. My suggestion: @Beelzebubu, dAV1d team, x265/x264 fans, @doom9 and every other people running benchmarks on different SIMD optimizations. If you want to benchmark specific 128bit SIMD optimizations and your target group is not only Pentiums/ Celerons with disabled AVX/AVX2 sets, but legacy hardware with SSEx only SIMD, then I suggest to run the tests on REAL SSEx-only (128bit only) capable hardware (e.g Core2Duo, Core2Quad or Core iX first generation) and not an emulation like running 128bit SSEx code with artificially disabled 256bit SIMD optimizations, but on a lot faster DDR4 and 256bit register capable CPU like 2 x Xeon (!) I think you are going to be surprised by the results and these results could explain some performance difference.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1) HEVC decoding benchmarks H.264 DXVA Benchmarks for all |
|||||||
6th November 2019, 15:30 | #38 | Link | |
Registered User
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
|
Quote:
|
|
20th November 2019, 17:24 | #40 | Link | |
Registered User
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
|
Quote:
Even the less impressive Cortex big cores on Snapdragon could handle 1080p60. Obviously lacking ASIC decoder support is not ideal, but at least it is some support rather than nothing. Heres hoping future improvements to the GPU code in dav1d will make decoding even more efficient for pre-ASIC devices than the initial GSoC 2019 efforts. |
|
|
|