Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 10th August 2017, 17:57   #5521  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,484
Quote:
Originally Posted by mastrboy View Post
So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?
it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 18:34   #5522  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Originally Posted by Sagittaire View Post
it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 19:37   #5523  |  Link
Barough
Registered User
 
Barough's Avatar
 
Join Date: Feb 2007
Location: Sweden
Posts: 480
x265 v2.5+9-fdf39a97ecb8 (GCC 7.1.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

x265 [info]: HEVC encoder version x265 v2.5+9-fdf39a97ecb8
x265 [info]: build info [Windows][GCC 7.1.0][32/64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2


Code:
https://bitbucket.org/multicoreware/x265/commits/branch/default
Barough is offline   Reply With Quote
Old 10th August 2017, 20:34   #5524  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,484
Quote:
Originally Posted by Atak_Snajpera View Post
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s
Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X@stock will produce better result than 7900X@stock.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X@stock and i7-6900K@stock are on par for x265 encoding but not in x265 fhd benchmark.

if possible, reduce the instance number (2x or perhaps 3x 1080p instance will be enough and you will see that relative speed will be really higher for AMD).
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9

Last edited by Sagittaire; 10th August 2017 at 20:49.
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 20:38   #5525  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Originally Posted by Sagittaire View Post
Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X will produce really better result than 7900X.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X and i7-6900K are on par for x265 encoding but not in x265 fhd benchmark.
Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_rece...ona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.

Last edited by Atak_Snajpera; 10th August 2017 at 20:51.
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 21:12   #5526  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,484
Quote:
Originally Posted by Atak_Snajpera View Post
Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_rece...ona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.
not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9

Last edited by Sagittaire; 10th August 2017 at 21:22.
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 21:34   #5527  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Originally Posted by Sagittaire View Post
not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.
I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 23:06   #5528  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,484
Quote:
Originally Posted by Atak_Snajpera View Post
I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.
1) Well I read before that your E5-2690 8C/16T is only at 70-75% for CPU charge in 1080p x265 encoding.

2) In this condition, why use 5x encoding instance, if 1x is enough?
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9
Sagittaire is offline   Reply With Quote
Old 11th August 2017, 01:26   #5529  |  Link
adsun701
Registered User
 
Join Date: Dec 2016
Posts: 6
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
adsun701 is offline   Reply With Quote
Old 11th August 2017, 13:35   #5530  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
It looks like that 1950x in default creative mode (2 dies active) is sitting between 3.3 and 3.4GHz. While i9 7900x runs at constant 4 GHz.
Source -> https://youtu.be/Fr1ZlUu8v_Q?t=9m8s

Scalling in my benchmark is good.
Ryzen 7 1700 @ 3.7GHz (OC) = 25.5 fps
Threadripper 1950x @ 3.4GHz = 43.6 fps
Threadripper 1950x @ 3.7GHz = 47.4 fps (estimated)

Scalling factor = ~1.9x

Last edited by Atak_Snajpera; 11th August 2017 at 13:43.
Atak_Snajpera is offline   Reply With Quote
Old 11th August 2017, 21:02   #5531  |  Link
microchip8
ffx264/ffhevc author
 
microchip8's Avatar
 
Join Date: May 2007
Location: /dev/video0
Posts: 1,843
Quote:
Originally Posted by adsun701 View Post
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
you better post those patches to the x265 mail list, not here.
__________________
ffx264 || ffhevc || ffxvid || microenc
microchip8 is offline   Reply With Quote
Old 12th August 2017, 18:45   #5532  |  Link
x265_Project
Guest
 
Posts: n/a
Quote:
Originally Posted by adsun701 View Post
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
Thanks! We received your email (sent to x265contributions at multicorewareinc dot com), along with your signed Contributor License Agreement. We'll review your patch ASAP.

Tom
  Reply With Quote
Old 14th August 2017, 15:05   #5533  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by mastrboy View Post
So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?

Quote:
Originally Posted by Atak_Snajpera View Post
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen.
Quote:
Originally Posted by Atak_Snajpera View Post
You are expecting too much from 2xFMAC128 vs 2xFMAC256.
x265 has nothing to do with the FPU or the FMACs or floating point performance in general.

It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.

If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.

UMA and NUMA using x265.

UMA should be faster, but who knows.

Also make sure you saturate all 32 threads.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 14th August 2017, 18:01   #5534  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.
Are you 100% sure that FMACs are not being used in integer calculations as well? Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).
Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/i...architecture/8

Quote:
If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.
Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.

Last edited by Atak_Snajpera; 14th August 2017 at 18:06.
Atak_Snajpera is offline   Reply With Quote
Old 14th August 2017, 20:20   #5535  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Atak_Snajpera View Post
Are you 100% sure that FMACs are not being used in integer calculations as well?

Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).

Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/i...architecture/8
You seem to confuse vector SIMD integer instruction set with vector SIMD floating point instruction set.

Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265

Sandy & Ivy have only AVX which is for floating point (mainly).
So, no speedup for those processors.

Of course AVX2 has FMA3 too, which doubles the floating point throughput compared to AVX but that's a different story irrelevant to x265.

Quote:
Originally Posted by Atak_Snajpera View Post
Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.
If that becomes a reality - 60% faster than 1950X - prepare yourself to use liquid nitrogen to freeze that CPU coming directly from hell, especially if Intel is still using that mustard between the CPU and heat spreader.

And you will need around 500W for that performance.
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 15th August 2017, 12:12   #5536  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265
What specific unit in CPU is responsible for calculating AVX2 instructions? My common sense tells me that FMAC does that. After all old SSE2 can also work on integers
https://en.wikipedia.org/wiki/SSE2

Zen has 2xFMAC128 while Intel since haswell has got 2xFMAC256. x265 benchmarks clearly show AMD 16C/32T = Intel 10C/20T. I see clear correlation here.

Last edited by Atak_Snajpera; 15th August 2017 at 12:22.
Atak_Snajpera is offline   Reply With Quote
Old 15th August 2017, 18:27   #5537  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Atak_Snajpera View Post
What specific unit in CPU is responsible for calculating AVX2 instructions? My common sense tells me that FMAC does that. After all old SSE2 can also work on integers
https://en.wikipedia.org/wiki/SSE2

Zen has 2xFMAC128 while Intel since haswell has got 2xFMAC256. x265 benchmarks clearly show AMD 16C/32T = Intel 10C/20T. I see clear correlation here.
OMG! You really are a stubborn b@st@rd !

The execution units leveraged by AVX2 instruction set are 256 bit SIMD integer for ADD, MUL, SHIFT.

Integer DIV remains 128 bit.

It's the last time I'm telling you that FMACs and floating point numbers have nothing to do with integers and x265 application.

Read here about all execution units of Haswell vs Sandybridge.

http://www.realworldtech.com/haswell-cpu/4/
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 15th August 2017, 19:11   #5538  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Ok smart ass so explain us why zen architecture sucks so much in x265...
http://www.linleygroup.com/mpr/article.php?id=11666

Last edited by Atak_Snajpera; 15th August 2017 at 19:21.
Atak_Snajpera is offline   Reply With Quote
Old 15th August 2017, 19:39   #5539  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,901
Quote:
Originally Posted by Atak_Snajpera View Post
Ok smart ass so explain us why zen architecture sucks so much in x265...
http://www.linleygroup.com/mpr/article.php?id=11666
In case you didn't see it, I put a in my first sentence just to be polite with your tremendous ignorance regarding CPU architectures and ego (those two usually come together)

But now, after your reply, I can't be polite anymore.

Your comments made me laugh like no tomorrow regarding FMACs and x265, so keep on posting your thoughts after reading CPU architecture articles you don't understand.

It's so funny!

Thank you!
__________________
Win 10 x64 (19042.572) - Core i5-2400 - Radeon RX 470 (20.10.1)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 16th August 2017, 01:58   #5540  |  Link
Asmodian
Registered User
 
Join Date: Feb 2002
Location: San Jose, California
Posts: 4,406
Quote:
Originally Posted by Atak_Snajpera View Post
Ok smart ass so explain us why zen architecture sucks so much in x265...
Maybe it is due to its much lower cache bandwidth or much higher cache/memory latency? There are major differences there. Teasing out the differences in the ALUs that might impact x265 is beyond me, so if anyone can help I would appreciate it.

Skylake-X:
Data-Cache Accesses: 2x 32B read + 2x 32B write
L2 Read Bandwidth: 64B

Zen:
Data-Cache Accesses: 2x 16B read + 1x 16B write
L2 Read Bandwidth: 32B

Does the massive L2 of Skylake-X help x265 at all?
__________________
madVR options explained
Asmodian is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:13.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.