Alliance for Open Media codecs - Page 66

utack · 13th December 2018, 14:28

Quote:

Originally Posted by mzso

Why shouldn't we like tiled encoding?

They make compression efficiency worse.The current implementation splits the frame into equal parts, and most of the times you get a split right in the center of the picture where most action takes place.
dav1d demonstrates pretty well that frame parallel decoding works fairly well, other encoders managed to get perfect frame parallel encoding done, so it just seems a lazy solution until libaom gets row_mt running well.

LigH · 13th December 2018, 16:34

New uploads: (MSYS2; MinGW32: GCC 7.4.0 / MinGW64: GCC 8.2.1)

AOM v1.0.0-1030-g7ac3eb1bb
New parameters:

Code:

            --enable-dual-filter=<arg> 	Enable dual filter (0: false, 1: true (default))
            --enable-order-hint=<arg>  	Enable order hint (0: false, 1: true (default))
            --enable-dist-wtd-comp=<arg	Enable distance-weighted compound (0: false, 1: true (default))
            --enable-masked-comp=<arg> 	Enable masked (wedge/diff-wtd) compound (0: false, 1: true (default))
            --enable-interintra-comp=<a	Enable interintra compound (0: false, 1: true (default))
            --enable-diff-wtd-comp=<arg	Enable difference-weighted compound (0: false, 1: true (default))
            --enable-interinter-wedge=<	Enable interinter wedge compound (0: false, 1: true (default))
            --enable-interintra-wedge=<	Enable interintra wedge compound (0: false, 1: true (default))
            --enable-global-motion=<arg	Enable global motion (0: false, 1: true (default))
            --enable-warped-motion=<arg	Enable local warped motion (0: false, 1: true (default))
            --enable-obmc=<arg>        	Enable OBMC (0: false, 1: true (default))

rav1e 0.1.0 (64b9f50 / 2018-12-13)

dav1d 0.1.0 (e5bca59 / 2018-12-13)

SmilingWolf · 13th December 2018, 16:52

Quote:

Originally Posted by utack

They make compression efficiency worse.

In x265, WPP hurts efficiency too. Should we stop using it?

The clip used is the F.Y.C one I described some pages ago

Code:

Cmdlines:
x265 --preset veryslow --tune ssim --crf 20 -F 1 --no-wpp -o test.x265.crf20.1F.00WPP.hevc orig.i420.y4m
x265 --preset veryslow --tune ssim --crf 20 -F 1 -o test.x265.crf20.1F.12WPP.hevc orig.i420.y4m

Sizes:
test.x265.crf20.1F.00WPP.hevc: 5566953
test.x265.crf20.1F.12WPP.hevc: 5612446 (+0.81%)

PSNR-HVS-M:
test.x265.crf20.1F.00WPP.hevc: 42.9368
test.x265.crf20.1F.12WPP.hevc: 42.9299 (-0.02%)

MS-SSIM:
test.x265.crf20.1F.00WPP.hevc: 26.3172
test.x265.crf20.1F.12WPP.hevc: 26.3112 (-0.02%)

With libaom the compression efficiency loss is very very low with an acceptable amount of tiles (in this case, 4 on a 720p clip).
I have already measured it: http://forum.doom9.org/showthread.ph...39#post1856939.
That's -0.75% space efficiency with 0.0X% loss in quality. It's even comparable to x265's WPP!

On the other hand, libaom's --frame-parallel=1 exhibits a 6% overhead. Just so that we're clear, libaom's --frame-parallel has got nothing to do with libdav1d's decoding option with the similar name, which doesn't depend on any optional characteristic of the bitstream.
You can already have row-mt WITH tiles which should work decently. Maybe combine it with chunked encoding for better overall performance.
So again, no excuses to not use tiles.

marcomsousa · 13th December 2018, 16:57

Quote:

Originally Posted by Nintendo Maniac 64

I realize I sound like a broken record at this point, but the newest Pentiums and Celerons still do not support AVX, and this even applies to the models that use the full-fat Sky/Kaby/Coffeelake cores (though with smaller cache size) such as the ever-popular 2c/4t Pentium G4560 and its direct successor the G5400.

Dav1d is already optimize AVX2 (~50% market share)
Now they will begin optimizing for SSS3 and SSE4.1 that all CPU have.
They not know if it will work fine with just this two extensions...

1) You must think that this codec will not be mainstream until there are some HW encoders (6 months to 1 more year for Big Companies have a custom HW encoder)

2) If a Celerons can't decode 1080p with dav1d, the player have two options: serve another codec, or serve the same codec with less resolution.

If you are big enough like youtube you can serve H264 to that HW and save bandwidth with the majority

Or, it's just fine to serve AV1 720p videos to Celerons, and more to the others. (they shouldn't be too picky).
For sure 8k video will be only be serve with AV1 in Youtube, like today VP9 is for >1080p.

Beelzebubu · 13th December 2018, 18:18

Quote:

Originally Posted by SmilingWolf

Ronald Bultje commented on frame parallelism being a bad thing for VP9, so not much of a surprise it was turned off by default in AV1/libaom

No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.

OK, next, decoding. This is trickier. For ffh264, for example, we classically found that frame-multithreaded decoding gives a higher speedup than slice-multi-threaded decoding per added thread. Given this pattern of frame multithreading scaling better *and* having less quality loss than within-frame alternatives in a variety of codecs, you'd expect everything to be good, right? Well, not exactly. It holds true, but only to some extend.

The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.

Wait, you're asking now, what about that statement that frame parallelism is bad in libvpx? Well, it's not what you think it is. --frame-parallel in vpxenc has nothing to do with frame multi-threading in the encoder. It's a header bit that removes the entropy dependency I just talked about. So now, it scales better when using frame multi-threading, which is why this bit is called the "frame parallelism" bit, but it also costs you all backwards entropy, incurring >1% BDRATE quality loss. However, there is no reason to do this. Hardware is not allowed to support higher resolutions with vs. without this feature, and there is no software decoder that implements frame multithreading with but not without entropy dependencies disabled. And if entropy dependencies are present, you can saturate system load anyway by simply using more threads. So the whole thing is kind of silly. Why give up quality for no gain whatsoever?

Quote:

Originally Posted by mandarinka

I can't tell how correct it is, but this was an interesting read: https://codecs.multimedia.cx/2018/12...cal-about-av1/
Author is a former libav/ffmpeg developer if you don't remember his name.

Kostya Shishkov.

Beelzebubu · 13th December 2018, 18:29

Quote:

Originally Posted by benwaggoner

Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.

It's around 50%, depending on what statistics you look at. So, I think some people have already tried to address the dav1d performance metrics, so to summarize:

single-threaded, playing 8-bits/component content on >=Haswell (i.e. AVX2=1) will give a 40-80% FPS increase when using dav1d compared to libaom;
multi-threaded, when using the right combinations of frame and tile threading (or just really large numbers) you can get several times higher FPS using dav1d compared to libaom when playing back 8-bit content on Hawell or newer (i.e. AVX2=1);
pre-Haswell (e.g. SSSE3), non-x86 (e.g. Neon), 32bit (the AVX2 assembly is 64-bit only) and 10-bits/component are not yet done. They will not be faster ATM, and possibly significantly slower. We're working on it;
Firefox has a problem integrating nasm so their version of dav1d has all assembly disabled ATM.

benwaggoner · 13th December 2018, 18:48

Quote:

Originally Posted by Beelzebubu

No, that's a mis-interpretation. Frame parallelism is great.

For encoding, the speedup is slightly better than for tile parallelism, and the quality loss per added thread is less than for tile threading. For example, in my experiments, frame-multithreading in Eve/VP9 costs 0.0% BDRATE loss for a 1.8x speedup going from 1 to 2 threads, but tile threading only gives a 1.7x speedup and has a BDRATE quality loss of around 0.5%. This pattern holds for more threads, and tends to be true across multiple codecs and encoders. Now, obviously, libaom/vpx have no frame threaded encoding so not much to be said there. But in x264, my experience from many years ago is that they switches from slice to frame multi-threaded encoding for the same reason: better scaling *and* less BDRATE quality loss. So far, so good.

Also, content may not be encoded with slices/tiles, but almost certainly will be encoded with hierarchically structured reference frames (like a I P B b structure) where the majority of frames aren't reference frames (e.g. all non-ref b-frames can be decoded in parallel as long as their reference frames are already decoded). So a performant decoder needs to have frame level parallelism, even if it also has slice/tile level as well.

Quote:

The problem in decoding of vp9/av1 is that frames depend on entropy output of the previous frame. For h264/5, cabac state resets in each frame, but this is not true for vp9/av1. So, for frame-multithreading, you need to split decoding in 2 passes, and pass 1 of the next dependent frame can only start when the previous frame finished it's pass 1 and started its pass 2. So, vp9/av1 *decoding* scale less well than h264 *decoding* when using frame multithreading. Fortunately, the system load doesn't go up either, so really what it means is that you need more threads to fully saturate a system. It's even better if you combine frame and tile threading, like what dav1d does.

So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.

Beelzebubu · 13th December 2018, 19:50

Quote:

Originally Posted by benwaggoner

So, decoding will be limited by serial decoding of entropy decoding? Do non-reference frames still update and thus serialize the entropy state? If decoding the "bbbb" in an IbbbbBbbbbP" sequence is serialized, that'll really impact decoder parallelization. but if all the non-ref b frames inherit the CABAC state of the most recently decoded reference frame, than it'll be a lot easier.

Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this).

benwaggoner · 13th December 2018, 20:00

Quote:

Originally Posted by Beelzebubu

Frames with a "similar entropy" reference each other, so a high-level P might use the previous P (which is coded 16 frames back) as its entropy reference, and a non-reference inner B frame (which might not be a reference picture at all for pixel purposes) may actually use the previous inner B-frame (which may well be the one directly before this, or usually 2 and sometimes 3 frames back) as its reference. So this certainly influences how well frame-multithreading scales, not in the worst possible way but not ideal either.

And that's why you see weird things where using 256 instead of 128 threads (I think this is 32/16 frame threads x 8 tile threads) on a 32 core leads to pretty significant speedups (like this).

Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.

Beelzebubu · 13th December 2018, 20:05

Quote:

Originally Posted by benwaggoner

Great analysis, thanks!

And huh, I can just imagine the tears of people trying to implement low-cost HW decoders for this. I can see how interframe entropy could provide a percent or two of compression efficiency, though.

I would rather have per-frame entropy and no slice requirement if I had a choice.

TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)

benwaggoner · 13th December 2018, 21:16

Quote:

Originally Posted by Beelzebubu

TBH, from what I understand from people in the relevant committees, this was proposed for HEVC also. The reason they didn't do it had nothing to do with HW, though, but was simply to keep the VoD and RTC use cases technically more similar. (Entropy dependencies are obviously disabled for RTC use cases.)

For RTC you could have backwards entropy states just fine, I think. So IPPPPPP could have each P reference the entropy state of the previous P. Error correction for lost packets would require trickiness. AV1 RTC would have the same issues.

Limiting entropy state reference to reference frames/tiles would be a lot more robust, but of reduce value. A bunch of non-ref b frames referencing the same frames probably have a lot more in common than any do to the ref-B/P/I frames they reference...

Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.

Beelzebubu · 13th December 2018, 21:20

Quote:

Originally Posted by benwaggoner

Random access would also be slowed by interframe entropy coding; it's essentially adding another layer of reference dependencies. Entropy is easier to decode, but getting to an arbitrary frame in a long GOP could require decoding the entropy state of a lot more frames than it would with a traditional IbBbP with inter-frame entropy only. With 8 b-frames, getting to an arbitrary frame of H.264/HEVC requires decoding about 1/8th of frames between the IDR and the target frame. Seems like it could be a lot worse in AV1, if I am understanding correctly.

Yes, you're correct, random access (seeking) is going to be slower because of this.

mandarinka · 14th December 2018, 00:31

Quote:

Originally Posted by SmilingWolf

In x265, WPP hurts efficiency too. Should we stop using it?

Why do you think the bestest encoders haven't?

Enlightened ones have stropped using frame threading.

Quote:

Originally Posted by benwaggoner

Do we have numbers for the installed base of AVX2 capable PCs? They've been in all new mainstream systems for several years now. I'd guess it's >50% already.

Steam is probably one of the largest datasets available but it is probably quite skewed. It covers disproportionate number of gaming-used computers, but likely almost no HTPCs or office-usage PCs. And all of those are going to watch AV1 video in browsers, even if it is just video ads. So real AVX2 penetration is likely worse than Steam shows, because of the Pentiums/Celerons and the like.
For illustration, look for example at the difference in Windows 10 versus Windows 7 usage shown by general browsing-based statistics sources and by Steam. The former show ~45% for W10 while Steam gives it over 60 %.

SmilingWolf · 14th December 2018, 08:35

Quote:

Originally Posted by mandarinka

Why do you think the bestest encoders haven't?

Enlightened ones have stropped using frame threading.

I am unsure of the meaning of this.
My point was that there is no point in not using either frame threading, WPP (for x265) or tiling (for libaom) when the overhead is not only so low, but even very similar between the two.
Yet I have never seen WPP get the same amount of flack tiling gets, especially considering tile-threading in libdav1d can contribute up to +108% of the decoding performance on its own: https://docs.google.com/spreadsheets...gid=1238661928

Kurosu · 14th December 2018, 12:34

Tiles will cause a coding efficiency loss, even if negligible in the big picture. But it is not such a boon either, except for encoders with particular limits, or software decoders. Same for WPP, which really is more a software decoder thing. Contrary to dav1d, your regular HEVC software decoder does not exploit the combined "threadability" of frames and tiles/WPP.

nevcairiel · 14th December 2018, 12:39

In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things.
On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture.

SmilingWolf · 14th December 2018, 13:17

Quote:

Originally Posted by nevcairiel

In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things.
On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture.

True, and true. I don't even have a retort to that.

I still think that we can care about removing tiling from a libaom encoding workflow whenever the hardware goes mainstream and makes 4K decodable even on budget CPUs like v0lt's Pentium G5600, which should be 2-3 years (?), but I'm ok with the above. Hopefully in the same time rav1e will get proper psy-RD and frame-parallel encoding, too, so we won't have to care about it anyway.

My main heat for the whole tiling debate comes from excluding from early adoption (i.e. right about now) a lot of low-medium tier systems with "inappropriate" encoding settings. In my early tests libdav1d could scale much better on my processor if combined with tiling rather than simply incresing the frame-threads above a certain threshold. Hard to justify a 4MB difference in 1GB of video when said video can't be decoded in real time at all.
Still, the spreadsheet I quoted makes me think I should run the numbers again for dav1d. It has been a couple of months after all.

Mierastor · 14th December 2018, 18:37

"Intel: AV1 support not yet in Gen11 Graphics, but coming soon after"
https://www.reddit.com/r/AV1/comment..._graphics_but/

Meaning late 2020, if Intel as usual introduces new CPU generations late in the year?

Since these introductions have often only been paper launches, large-scale availability will only occur in 2021?

nevcairiel · 14th December 2018, 19:45

Thats about the time frame most here would expect hardware support. Maybe in 2020, or thereabouts.

Nintendo Maniac 64 · 14th December 2018, 21:36

But lets be honest here - with AMD finally being a viable alternative again, who is really buying Intel for their graphics capabilities?

14th December 2018, 12:34	#1315 \| Link
Kurosu Registered User Join Date: Sep 2002 Location: France Posts: 432	Tiles will cause a coding efficiency loss, even if negligible in the big picture. But it is not such a boon either, except for encoders with particular limits, or software decoders. Same for WPP, which really is more a software decoder thing. Contrary to dav1d, your regular HEVC software decoder does not exploit the combined "threadability" of frames and tiles/WPP. Last edited by Kurosu; 14th December 2018 at 12:39.

14th December 2018, 12:39	#1316 \| Link
nevcairiel Registered Developer Join Date: Mar 2010 Location: Hamburg/Germany Posts: 10,348	In the long run, features that allow faster software decoding are really just wasted coding efficiency. When a codec goes mainstream, you'll have a full stack of hardware decoders, which usually don't care that much about these things. On top of that, if you look at frame threading numbers, the advantage from tile threading shrinks extremely rapidly. Comparing its speed advantage without frame threading is really only a very limited picture. __________________ LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 14th December 2018 at 12:43.

14th December 2018, 19:45	#1319 \| Link
nevcairiel Registered Developer Join Date: Mar 2010 Location: Hamburg/Germany Posts: 10,348	Thats about the time frame most here would expect hardware support. Maybe in 2020, or thereabouts. __________________ LAV Filters - open source ffmpeg based media splitter and decoders

14th December 2018, 21:36	#1320 \| Link
Nintendo Maniac 64 Registered User Join Date: Nov 2009 Location: Northeast Ohio Posts: 447	But lets be honest here - with AMD finally being a viable alternative again, who is really buying Intel for their graphics capabilities? __________________ ____HTPC____　 \|　__Desktop PC__ 2.93GHz Xeon x3470 (4c/8t Nehalem)　\|　4.5GHz 1.24v dual-core Haswell G3258 Radeon HD5870　 \|　Intel iGPU　　　　　　 2x2GB+2x1GB DDR3-1333　\|　4x4GB DDR3-1600

14th December 2018, 18:37	#1318 \| Link
Mierastor Registered User Join Date: Nov 2010 Posts: 15	"Intel: AV1 support not yet in Gen11 Graphics, but coming soon after" https://www.reddit.com/r/AV1/comment..._graphics_but/ Meaning late 2020, if Intel as usual introduces new CPU generations late in the year? Since these introductions have often only been paper launches, large-scale availability will only occur in 2021?