Doom9's Forum - View Single Post

blurred · 27th June 2018, 20:41

Response by author of the benchmark from https://encode.ru/threads/1890-Bench...ll=1#post57093

Quote:

Firstly, the AV1 range coder only uses 1 multiplication per CDF entry, the 16 is the "worst case" (keep in mind that they can be done in parallel, e.g. with SIMD, so it's actually better to use more than less as the multiply is the cheapest part in software).

For SSE2 decoding in AV1 you need 4 SIMD multiplications (_mm_mullo_epi32) + 4 comparisons (_mm_cmpgt_epi32) + combining (after _mm_movemask_ps) 4 SSE2 registers
It is unlikely that this will be faster than scalar decoding.
For AV1 hardware implementations, you need 16 32x32 multipliers, otherwise parallel multiplications are not possible.
Also 16 comparisons + other operations are additionaly required.

Quote:

Secondly, the difference is nowhere near 7x when we benchnmarked the two - rANS was faster, but by a factor of about 2.

For this benchmark and current implementations, rANS decoding is SEVEN times faster than AV1.
On ARM the scalar version is 5 times faster.
The AV1 nibble entropy coder is even slower than a bitwise range coder.

Quote:

However, the requirement to buffer and reverse the symbols was unfortunately insurmountable.
This is only required in encoding which is usually done in software.

This irrelevant argument is always used in their discussions.
The benchmark shows that TurboANXN, even with reverse encoding is more than 4 times faster than the current AOMedia AV1 encoder.

Quote:

Also keep in mind that AV1 adjusts the probabilities on a per-symbol basis.
The entropy coder CDFs are designed to make adapting the probabilities very fast (with only adds and shifts).
This puts some constraints on the design that don't exist in the linked benchmark (which uses fixed probabilities as far as I can tell).

The benchmark is using adaptive probabilities.

Quote:

There is so much clever that gets done, even in decoders.
And there are so many different kinds of parallelization, SIMD, ASIC, etcetera available.
And surprising numbers of decoders don't implement basic stuff like skipping non-reference frames when doing seeking, due to the system layer and the decoder layers not being tightly coupled enough.

This is indepedant from entropy coding. Here we are comparing the AV1 entropy coder against rANS and they are interchangeable.

Quote:

Also rANS is quite recent and higthly optimised, plus it uses 32/64bit aritmetic and SIMD instructions while daala range coder uses only 16bit aritmethic.

And you can do betwen 2 and 4 1 clock 16bit multipliers in the same number of gates that of a 32bit 1 clock multiplier.
According to the AV1 source code, 32 bits operations are used. rANS is 32 bits only.

I think the decision against rANS is politically motivated (Not-invented-here-Syndrom).
Otherwise, why not simply let the (now removed) rANS version in the repository for comparisons.
Hardware comparisons (complexity,energie consumption,costs,...) are only possible after implementing both optimized versions.

Note, we are considering here only adapative rANS. Do not confuse this with block based ANS as used in zstd,lzfse, lzturbo...