Doom9's Forum - View Single Post - Performance analysis of Short-term Memory in x264

Dark Shikari · 15th October 2011, 18:42

When posting code, can you post a git diff instead of (or in addition to) a tarball? It's vastly easier to read.

Some corrections and/or suggestions:

Page 33:
1. x264 only supports up to 10-bit, not 16-bit.
2. Newer versions of x264 support 4:2:2 and 4:4:4.
3. It might be useful to mention the differences between sliced and frame threads (and why both exist), and why frame threads is generally better.

Page 36:
1. There is one instance of x264_t per thread, not per encode.. It holds basically everything, including analysis data for the current macroblock and so forth. You are correct that it's small enough to be largely irrelevant in terms of memory management.

Page 38:
1. Maybe you want to mention the hpel data in fdec frames as a large portion? This uses more data than the original pixel data (for those frames).

Page 41:
1. --sync-lookahead is the number of frames used in the sync buffer between lookahead and encoding. --rc-lookahead is the number of lookahead frames.
2. Some of your parameters are missing hyphens.
3. It might be useful to mention some of the meanings of the parameters in your chart, e.g. that b-adapt 1 is a fast heuristic algorithm whereas b-adapt 2 is a Viterbi decision algorithm. This might be relevant because the latter requires dozens of frames to form a path, whereas the former can work on just a couple.
4. You omitted MB-tree.
5. "veryslow", not "very-slow", in the footnote.

Page 50:
1. Floating point operations are not expensive when done on a per-frame basis. Ratecontrol does thousands of them.

Page 55:
1. Foreman is often used for PSNR testing by low-quality papers, but it's not good for performance testing, especially with x264, because of its very small size. x264 can only use one thread per couple macroblock rows of frame.

General:
1. Have you considered taking advantage of this scheme to only allocate data where necessary? This is obviously impossible in x264's typical allocate-once scheme, but might be possible here. For example, in a P-frame, you don't need to allocate h->mb.mv[1].
2. Does your scheme have better or worse cache behavior? Have you tried making measurements of cache misses? Is there any difference?
3. This is one of the best papers on x264 I've ever seen. While that's not saying that much considering their typical quality, congrats.

15th October 2011, 18:42	#3 \| Link
Dark Shikari x264 developer Join Date: Sep 2005 Posts: 8,666	When posting code, can you post a git diff instead of (or in addition to) a tarball? It's vastly easier to read. Some corrections and/or suggestions: Page 33: 1. x264 only supports up to 10-bit, not 16-bit. 2. Newer versions of x264 support 4:2:2 and 4:4:4. 3. It might be useful to mention the differences between sliced and frame threads (and why both exist), and why frame threads is generally better. Page 36: 1. There is one instance of x264_t per thread, not per encode.. It holds basically everything, including analysis data for the current macroblock and so forth. You are correct that it's small enough to be largely irrelevant in terms of memory management. Page 38: 1. Maybe you want to mention the hpel data in fdec frames as a large portion? This uses more data than the original pixel data (for those frames). Page 41: 1. --sync-lookahead is the number of frames used in the sync buffer between lookahead and encoding. --rc-lookahead is the number of lookahead frames. 2. Some of your parameters are missing hyphens. 3. It might be useful to mention some of the meanings of the parameters in your chart, e.g. that b-adapt 1 is a fast heuristic algorithm whereas b-adapt 2 is a Viterbi decision algorithm. This might be relevant because the latter requires dozens of frames to form a path, whereas the former can work on just a couple. 4. You omitted MB-tree. 5. "veryslow", not "very-slow", in the footnote. Page 50: 1. Floating point operations are not expensive when done on a per-frame basis. Ratecontrol does thousands of them. Page 55: 1. Foreman is often used for PSNR testing by low-quality papers, but it's not good for performance testing, especially with x264, because of its very small size. x264 can only use one thread per couple macroblock rows of frame. General: 1. Have you considered taking advantage of this scheme to only allocate data where necessary? This is obviously impossible in x264's typical allocate-once scheme, but might be possible here. For example, in a P-frame, you don't need to allocate h->mb.mv[1]. 2. Does your scheme have better or worse cache behavior? Have you tried making measurements of cache misses? Is there any difference? 3. This is one of the best papers on x264 I've ever seen. While that's not saying that much considering their typical quality, congrats. __________________ Follow x264 development progress \| akupenguin quotes \| x264 git status ffmpeg and x264-related consulting/coding contracts \| Doom10 Last edited by Dark Shikari; 15th October 2011 at 19:11.