Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > VP9 and AV1

Reply
 
Thread Tools Search this Thread Display Modes
Old 13th December 2018, 14:28   #1301  |  Link
utack
Registered User
 
Join Date: Apr 2018
Posts: 63
Quote:
Originally Posted by mzso View Post
Why shouldn't we like tiled encoding?
They make compression efficiency worse.The current implementation splits the frame into equal parts, and most of the times you get a split right in the center of the picture where most action takes place.
dav1d demonstrates pretty well that frame parallel decoding works fairly well, other encoders managed to get perfect frame parallel encoding done, so it just seems a lazy solution until libaom gets row_mt running well.
utack is offline   Reply With Quote
Old 13th December 2018, 16:34   #1302  |  Link
LigH
German doom9/Gleitz SuMo
 
LigH's Avatar
 
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 6,746
New uploads: (MSYS2; MinGW32: GCC 7.4.0 / MinGW64: GCC 8.2.1)

AOM v1.0.0-1030-g7ac3eb1bb
New parameters:
Code:
            --enable-dual-filter=<arg> 	Enable dual filter (0: false, 1: true (default))
            --enable-order-hint=<arg>  	Enable order hint (0: false, 1: true (default))
            --enable-dist-wtd-comp=<arg	Enable distance-weighted compound (0: false, 1: true (default))
            --enable-masked-comp=<arg> 	Enable masked (wedge/diff-wtd) compound (0: false, 1: true (default))
            --enable-interintra-comp=<a	Enable interintra compound (0: false, 1: true (default))
            --enable-diff-wtd-comp=<arg	Enable difference-weighted compound (0: false, 1: true (default))
            --enable-interinter-wedge=<	Enable interinter wedge compound (0: false, 1: true (default))
            --enable-interintra-wedge=<	Enable interintra wedge compound (0: false, 1: true (default))
            --enable-global-motion=<arg	Enable global motion (0: false, 1: true (default))
            --enable-warped-motion=<arg	Enable local warped motion (0: false, 1: true (default))
            --enable-obmc=<arg>        	Enable OBMC (0: false, 1: true (default))
rav1e 0.1.0 (64b9f50 / 2018-12-13)

dav1d 0.1.0 (e5bca59 / 2018-12-13)
__________________

New German Gleitz board
MediaFire: x264 | x265 | VPx | AOM | Xvid
LigH is offline   Reply With Quote
Old 13th December 2018, 16:52   #1303  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 95
Quote:
Originally Posted by utack View Post
They make compression efficiency worse.
In x265, WPP hurts efficiency too. Should we stop using it?

The clip used is the F.Y.C one I described some pages ago
Code:
Cmdlines:
x265 --preset veryslow --tune ssim --crf 20 -F 1 --no-wpp -o test.x265.crf20.1F.00WPP.hevc orig.i420.y4m
x265 --preset veryslow --tune ssim --crf 20 -F 1 -o test.x265.crf20.1F.12WPP.hevc orig.i420.y4m

Sizes:
test.x265.crf20.1F.00WPP.hevc: 5566953
test.x265.crf20.1F.12WPP.hevc: 5612446 (+0.81%)

PSNR-HVS-M:
test.x265.crf20.1F.00WPP.hevc: 42.9368
test.x265.crf20.1F.12WPP.hevc: 42.9299 (-0.02%)

MS-SSIM:
test.x265.crf20.1F.00WPP.hevc: 26.3172
test.x265.crf20.1F.12WPP.hevc: 26.3112 (-0.02%)
With libaom the compression efficiency loss is very very low with an acceptable amount of tiles (in this case, 4 on a 720p clip).
I have already measured it: http://forum.doom9.org/showthread.ph...39#post1856939.
That's -0.75% space efficiency with 0.0X% loss in quality. It's even comparable to x265's WPP!

On the other hand, libaom's --frame-parallel=1 exhibits a 6% overhead. Just so that we're clear, libaom's --frame-parallel has got nothing to do with libdav1d's decoding option with the similar name, which doesn't depend on any optional characteristic of the bitstream.
You can already have row-mt WITH tiles which should work decently. Maybe combine it with chunked encoding for better overall performance.
So again, no excuses to not use tiles.

Last edited by SmilingWolf; 13th December 2018 at 17:10.
SmilingWolf is offline   Reply With Quote
Old 13th December 2018, 16:57   #1304  |  Link
marcomsousa
Registered User
 
Join Date: Jul 2018
Posts: 80
Quote:
Originally Posted by Nintendo Maniac 64 View Post
I realize I sound like a broken record at this point, but the newest Pentiums and Celerons still do not support AVX, and this even applies to the models that use the full-fat Sky/Kaby/Coffeelake cores (though with smaller cache size) such as the ever-popular 2c/4t Pentium G4560 and its direct successor the G5400.
Dav1d is already optimize AVX2 (~50% market share)
Now they will begin optimizing for SSS3 and SSE4.1 that all CPU have.
They not know if it will work fine with just this two extensions...


1) You must think that this codec will not be mainstream until there are some HW encoders (6 months to 1 more year for Big Companies have a custom HW encoder)

2) If a Celerons can't decode 1080p with dav1d, the player have two options: serve another codec, or serve the same codec with less resolution.

If you are big enough like youtube you can serve H264 to that HW and save bandwidth with the majority

Or, it's just fine to serve AV1 720p videos to Celerons, and more to the others. (they shouldn't be too picky).
For sure 8k video will be only be serve with AV1 in Youtube, like today VP9 is for >1080p.
__________________
AV1 win64 VS2019 builds
Last build here

Last edited by marcomsousa; 13th December 2018 at 17:08.
marcomsousa is offline   Reply With Quote
Old 13th December 2018, 18:18   #1305  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by SmilingWolf View Post
Ronald Bultje commented on frame parallelism being a bad thing for VP9, so not much of a surprise it was turned off by default in AV1/libaom
No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.

OK, next, decoding. This is trickier. For ffh264, for example, we classically found that frame-multithreaded decoding gives a higher speedup than slice-multi-threaded decoding per added thread. Given this pattern of frame multithreading scaling better *and* having less quality loss than within-frame alternatives in a variety of codecs, you'd expect everything to be good, right? Well, not exactly. It holds true, but only to some extend.

The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.

Wait, you're asking now, what about that statement that frame parallelism is bad in libvpx? Well, it's not what you think it is. --frame-parallel in vpxenc has nothing to do with frame multi-threading in the encoder. It's a header bit that removes the entropy dependency I just talked about. So now, it scales better when using frame multi-threading, which is why this bit is called the "frame parallelism" bit, but it also costs you all backwards entropy, incurring >1% BDRATE quality loss. However, there is no reason to do this. Hardware is not allowed to support higher resolutions with vs. without this feature, and there is no software decoder that implements frame multithreading with but not without entropy dependencies disabled. And if entropy dependencies are present, you can saturate system load anyway by simply using more threads. So the whole thing is kind of silly. Why give up quality for no gain whatsoever?

Quote:
Originally Posted by mandarinka View Post
I can't tell how correct it is, but this was an interesting read: https://codecs.multimedia.cx/2018/12...cal-about-av1/
Author is a former libav/ffmpeg developer if you don't remember his name.
Kostya Shishkov.
Beelzebubu is offline   Reply With Quote
Old 13th December 2018, 18:29   #1306  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by benwaggoner View Post
Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.
It's around 50%, depending on what statistics you look at. So, I think some people have already tried to address the dav1d performance metrics, so to summarize:
  • single-threaded, playing 8-bits/component content on >=Haswell (i.e. AVX2=1) will give a 40-80% FPS increase when using dav1d compared to libaom;
  • multi-threaded, when using the right combinations of frame and tile threading (or just really large numbers) you can get several times higher FPS using dav1d compared to libaom when playing back 8-bit content on Hawell or newer (i.e. AVX2=1);
  • pre-Haswell (e.g. SSSE3), non-x86 (e.g. Neon), 32bit (the AVX2 assembly is 64-bit only) and 10-bits/component are not yet done. They will not be faster ATM, and possibly significantly slower. We're working on it;
  • Firefox has a problem integrating nasm so their version of dav1d has all assembly disabled ATM.

Last edited by Beelzebubu; 13th December 2018 at 18:31.
Beelzebubu is offline   Reply With Quote
Old 13th December 2018, 18:48   #1307  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,738
Quote:
Originally Posted by Beelzebubu View Post
No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.
Also, content may not be encoded with slices/tiles, but almost certainly will be encoded with hierarchically structured reference frames (like a I P B b structure) where the majority of frames aren't reference frames (e.g. all non-ref b-frames can be decoded in parallel as long as their reference frames are already decoded). So a performant decoder needs to have frame level parallelism, even if it also has slice/tile level as well.

Quote:
The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.
So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 13th December 2018, 19:50   #1308  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by benwaggoner View Post
So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.
Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this).
Beelzebubu is offline   Reply With Quote
Old 13th December 2018, 20:00   #1309  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,738
Quote:
Originally Posted by Beelzebubu View Post
Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this).
Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 13th December 2018, 20:05   #1310  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by benwaggoner View Post
Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.
TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)
Beelzebubu is offline   Reply With Quote
Old 13th December 2018, 21:16   #1311  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,738
Quote:
Originally Posted by Beelzebubu View Post
TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)
For RTC you could have backwards entropy states just fine, I think. So IPPPPPP could have each P reference the entropy state of the previous P. Error correction for lost packets would require trickiness. AV1 RTC would have the same issues.

Limiting entropy state reference to reference frames/tiles would be a lot more robust, but of reduce value. A bunch of non-ref b frames referencing the same frames probably have a lot more in common than any do to the ref-B/P/I frames they reference...

Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 13th December 2018, 21:20   #1312  |  Link
Beelzebubu
Registered User
 
Join Date: Feb 2003
Location: New York, NY (USA)
Posts: 109
Quote:
Originally Posted by benwaggoner View Post
Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.
Yes, you're correct, random access (seeking) is going to be slower because of this.
Beelzebubu is offline   Reply With Quote
Old 14th December 2018, 00:31   #1313  |  Link
mandarinka
Registered User
 
mandarinka's Avatar
 
Join Date: Jan 2007
Posts: 729
Quote:
Originally Posted by SmilingWolf View Post
In x265, WPP hurts efficiency too. Should we stop using it?
Why do you think the bestest encoders haven't? Enlightened ones have stropped using frame threading.

Quote:
Originally Posted by benwaggoner View Post
Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.
Steam is probably one of the largest datasets available but it is probably quite skewed. It covers disproportionate number of gaming-used computers, but likely almost no HTPCs or office-usage PCs. And all of those are going to watch AV1 video in browsers, even if it is just video ads. So real AVX2 penetration is likely worse than Steam shows, because of the Pentiums/Celerons and the like.
For illustration, look for example at the difference in Windows 10 versus Windows 7 usage shown by general browsing-based statistics sources and by Steam. The former show ~45% for W10 while Steam gives it over 60 %.

Last edited by mandarinka; 14th December 2018 at 00:35.
mandarinka is offline   Reply With Quote
Old 14th December 2018, 08:35   #1314  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 95
Quote:
Originally Posted by mandarinka View Post
Why do you think the bestest encoders haven't? Enlightened ones have stropped using frame threading.
I am unsure of the meaning of this.
My point was that there is no point in not using either frame threading, WPP (for x265) or tiling (for libaom) when the overhead is not only so low, but even very similar between the two.
Yet I have never seen WPP get the same amount of flack tiling gets, especially considering tile-threading in libdav1d can contribute up to +108% of the decoding performance on its own: https://docs.google.com/spreadsheets...gid=1238661928
SmilingWolf is offline   Reply With Quote
Old 14th December 2018, 12:34   #1315  |  Link
Kurosu
Registered User
 
Join Date: Sep 2002
Location: France
Posts: 432
Tiles will cause a coding efficiency loss, even if negligible in the big picture. But it is not such a boon either, except for encoders with particular limits, or software decoders. Same for WPP, which really is more a software decoder thing. Contrary to dav1d, your regular HEVC software decoder does not exploit the combined "threadability" of frames and tiles/WPP.

Last edited by Kurosu; 14th December 2018 at 12:39.
Kurosu is offline   Reply With Quote
Old 14th December 2018, 12:39   #1316  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,336
In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things.
On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders

Last edited by nevcairiel; 14th December 2018 at 12:43.
nevcairiel is offline   Reply With Quote
Old 14th December 2018, 13:17   #1317  |  Link
SmilingWolf
I am maddo saientisto!
 
SmilingWolf's Avatar
 
Join Date: Aug 2018
Posts: 95
Quote:
Originally Posted by nevcairiel View Post
In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things.
On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture.
True, and true. I don't even have a retort to that.

I still think that we can care about removing tiling from a libaom encoding workflow whenever the hardware goes mainstream and makes 4K decodable even on budget CPUs like v0lt's Pentium G5600, which should be 2-3 years (?), but I'm ok with the above. Hopefully in the same time rav1e will get proper psy-RD and frame-parallel encoding, too, so we won't have to care about it anyway.

My main heat for the whole tiling debate comes from excluding from early adoption (i.e. right about now) a lot of low-medium tier systems with "inappropriate" encoding settings. In my early tests libdav1d could scale much better on my processor if combined with tiling rather than simply incresing the frame-threads above a certain threshold. Hard to justify a 4MB difference in 1GB of video when said video can't be decoded in real time at all.
Still, the spreadsheet I quoted makes me think I should run the numbers again for dav1d. It has been a couple of months after all.

Last edited by SmilingWolf; 14th December 2018 at 13:35.
SmilingWolf is offline   Reply With Quote
Old 14th December 2018, 18:37   #1318  |  Link
Mierastor
Registered User
 
Join Date: Nov 2010
Posts: 15
"Intel: AV1 support not yet in Gen11 Graphics, but coming soon after"
https://www.reddit.com/r/AV1/comment..._graphics_but/

Meaning late 2020, if Intel as usual introduces new CPU generations late in the year?

Since these introductions have often only been paper launches, large-scale availability will only occur in 2021?
Mierastor is offline   Reply With Quote
Old 14th December 2018, 19:45   #1319  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,336
Thats about the time frame most here would expect hardware support. Maybe in 2020, or thereabouts.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 14th December 2018, 21:36   #1320  |  Link
Nintendo Maniac 64
Registered User
 
Nintendo Maniac 64's Avatar
 
Join Date: Nov 2009
Location: Northeast Ohio
Posts: 447
But lets be honest here - with AMD finally being a viable alternative again, who is really buying Intel for their graphics capabilities?
__________________
____HTPC____  | __Desktop PC__
2.93GHz Xeon x3470 (4c/8t Nehalem) | 4.5GHz 1.24v dual-core Haswell G3258
Radeon HD5870  | Intel iGPU      
2x2GB+2x1GB DDR3-1333 | 4x4GB DDR3-1600       
Nintendo Maniac 64 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 07:09.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.