Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 10th August 2017, 17:28   #5521  |  Link
mastrboy
Registered User
 
Join Date: Sep 2008
Posts: 286
Quote:
Originally Posted by Atak_Snajpera View Post
So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?
__________________
(i have a tendency to drunk post)
mastrboy is offline   Reply With Quote
Old 10th August 2017, 17:57   #5522  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,405
Quote:
Originally Posted by mastrboy View Post
So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?
it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 18:34   #5523  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
Quote:
Originally Posted by Sagittaire View Post
it's intensive multicession encoding with x265: in this case Memory Banding limit speed for Ryzen/Threadripper. In simple 4K encoding 1950X produce better speed than 7900X (~20%).
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 19:37   #5524  |  Link
Barough
Registered User
 
Barough's Avatar
 
Join Date: Feb 2007
Location: Sweden
Posts: 218
x265 v2.5+9-fdf39a97ecb8 (GCC 7.1.0, 32 & 64-bit 8/10/12bit Multilib Windows Binaries)

x265 [info]: HEVC encoder version x265 v2.5+9-fdf39a97ecb8
x265 [info]: build info [Windows][GCC 7.1.0][32/64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2


Code:
https://bitbucket.org/multicoreware/x265/commits/branch/default
Barough is offline   Reply With Quote
Old 10th August 2017, 20:34   #5525  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,405
Quote:
Originally Posted by Atak_Snajpera View Post
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen. ThreadRipper's Memory Bandwidth is pretty good.

Source -> https://youtu.be/G9JR_v-4BaQ?t=2m2s
Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X@stock will produce better result than 7900X@stock.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X@stock and i7-6900K@stock are on par for x265 encoding but not in x265 fhd benchmark.

if possible, reduce the instance number (2x or perhaps 3x 1080p instance will be enough and you will see that relative speed will be really higher for AMD).
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9

Last edited by Sagittaire; 10th August 2017 at 20:49.
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 20:38   #5526  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
Quote:
Originally Posted by Sagittaire View Post
Rysen use slow data fabric for internal L3 cache communication between each CCX module. You have really higher latence for intel with DDR4 and L3 too. Use multiple instance is not good idea for AMD.

If you make 4K encoding with X265, you don't saturate memory controler and in this case 1950X will produce really better result than 7900X.

You have the same problem with R7 1800X and i7-6900K. In all test in 1080p, R7 1800X and i7-6900K are on par for x265 encoding but not in x265 fhd benchmark.
Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_rece...ona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.

Last edited by Atak_Snajpera; 10th August 2017 at 20:51.
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 21:12   #5527  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,405
Quote:
Originally Posted by Atak_Snajpera View Post
Similar results
http://pclab.pl/art75073-14.html
http://www.benchmark.pl/testy_i_rece...ona/28404.html

You are expecting too much from 2xFMAC128 vs 2xFMAC256.
ThreadRipper was designed to use as many processes as possible.
See benchmarks on youtube. Gaming + streaming to youtube + twitch + encoding something in Adobe Premiere.
REMEMBER! You have to use multiple x265 encoders to fully saturate all cores with very common 1080p resolution. So in practice there is no escape from that.
not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9

Last edited by Sagittaire; 10th August 2017 at 21:22.
Sagittaire is offline   Reply With Quote
Old 10th August 2017, 21:34   #5528  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
Quote:
Originally Posted by Sagittaire View Post
not really. In pclab test 1950X produce 7% better result than 7900X and in your test it's 2% better result for 7900X.

Moreover, I don't like handbrake test because this gui use heavy filter (avisynth?) and don't use directly stream for encoding. I prefer direct benchmark with high speed ffmpeg frameserver (less than 5% of CPU charge for stream decoding).

Try your benchmark with less instance (just to assure to have CPU charge at 100%) and you will see that speed will be higher. Perhaps higher for Intel CPU too.
I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.
Atak_Snajpera is offline   Reply With Quote
Old 10th August 2017, 23:06   #5529  |  Link
Sagittaire
Testeur de codecs
 
Sagittaire's Avatar
 
Join Date: May 2003
Location: France
Posts: 2,405
Quote:
Originally Posted by Atak_Snajpera View Post
I have already done that on my E5-2690 in distributed encoding mode. 5 x265/x264 encoders vs 1 x265/x264. Difference in encoding time was in margin of error.
1) Well I read before that your E5-2690 8C/16T is only at 70-75% for CPU charge in 1080p x265 encoding.

2) In this condition, why use 5x encoding instance, if 1x is enough?
__________________
Le Sagittaire ... ;-)

1- Ateme AVC or x264
2- VP7 or RV10 only for anime
3- XviD, DivX or WMV9
Sagittaire is offline   Reply With Quote
Old 11th August 2017, 01:26   #5530  |  Link
adsun701
Registered User
 
Join Date: Dec 2016
Posts: 6
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
adsun701 is offline   Reply With Quote
Old 11th August 2017, 13:35   #5531  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
It looks like that 1950x in default creative mode (2 dies active) is sitting between 3.3 and 3.4GHz. While i9 7900x runs at constant 4 GHz.
Source -> https://youtu.be/Fr1ZlUu8v_Q?t=9m8s

Scalling in my benchmark is good.
Ryzen 7 1700 @ 3.7GHz (OC) = 25.5 fps
Threadripper 1950x @ 3.4GHz = 43.6 fps
Threadripper 1950x @ 3.7GHz = 47.4 fps (estimated)

Scalling factor = ~1.9x

Last edited by Atak_Snajpera; 11th August 2017 at 13:43.
Atak_Snajpera is offline   Reply With Quote
Old 11th August 2017, 21:02   #5532  |  Link
froggy1
ffx264/ffhevc author
 
froggy1's Avatar
 
Join Date: May 2007
Location: Belgium
Posts: 1,363
Quote:
Originally Posted by adsun701 View Post
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
you better post those patches to the x265 mail list, not here.
froggy1 is offline   Reply With Quote
Old 12th August 2017, 18:45   #5533  |  Link
x265_Project
Registered User
 
x265_Project's Avatar
 
Join Date: Jul 2013
Posts: 542
Quote:
Originally Posted by adsun701 View Post
Hi here. I just created a patch that enables support for SMPTE ST 428,
SMPTE RP 431, and SMPTE EG 432 primaries. It also enables support for
SMPTE ST 2085, ICtCp, and both chroma-derived non-constant and
constant luminance matrices. They are all included in the latest spec.

Here's the link.
https://gist.github.com/Adsun701/472...0c90599ec4bb9a
Thanks! We received your email (sent to x265contributions at multicorewareinc dot com), along with your signed Contributor License Agreement. We'll review your patch ASAP.

Tom
__________________
x265 HEVC (H.265) Video Encoder ____________ Follow x265 on Facebook.
x265_Project is offline   Reply With Quote
Old 14th August 2017, 15:05   #5534  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,287
Quote:
Originally Posted by mastrboy View Post
So AMD is still way behind Intel in FPU performance pr core, or will there be future x265 optimizations for Ryzen/Threadripper that could close the gap?

Quote:
Originally Posted by Atak_Snajpera View Post
Nah. It is just 2xFMAC256 magic vs 2xFMAC128 in Zen.
Quote:
Originally Posted by Atak_Snajpera View Post
You are expecting too much from 2xFMAC128 vs 2xFMAC256.
x265 has nothing to do with the FPU or the FMACs or floating point performance in general.

It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.

If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.

UMA and NUMA using x265.

UMA should be faster, but who knows.

Also make sure you saturate all 32 threads.
__________________
Win 10 x64 (15063.540) - Core i3-4170/ iGPU HD 4400 (v.4624) - Core i5-2400 / dGPU RX 470 (v17.7.2)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 14th August 2017, 18:01   #5535  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
Quote:
It's a pure integer app using AVX2 integers not FMA3 or FADD or FMUL or any floating point in general.
Are you 100% sure that FMACs are not being used in integer calculations as well? Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).
Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/i...architecture/8

Quote:
If someone gets a Monsterripper or Killerofskyalakex 16C/32T 1950X try both modes.
Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.

Last edited by Atak_Snajpera; 14th August 2017 at 18:06.
Atak_Snajpera is offline   Reply With Quote
Old 14th August 2017, 20:20   #5536  |  Link
NikosD
Registered User
 
Join Date: Aug 2010
Location: Athens, Greece
Posts: 2,287
Quote:
Originally Posted by Atak_Snajpera View Post
Are you 100% sure that FMACs are not being used in integer calculations as well?

Haswell is noticeable faster in x265 than Sandy/IvyBridge (clock vs clock).

Looking at architecture I don't see anything special except new FMACs
http://www.anandtech.com/show/6355/i...architecture/8
You seem to confuse vector SIMD integer instruction set with vector SIMD floating point instruction set.

Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265

Sandy & Ivy have only AVX which is for floating point (mainly).
So, no speedup for those processors.

Of course AVX2 has FMA3 too, which doubles the floating point throughput compared to AVX but that's a different story irrelevant to x265.

Quote:
Originally Posted by Atak_Snajpera View Post
Chipzilla 16C/32T will destroy ThreadRipper 1950x in x265 by 1.6x factor.
If that becomes a reality - 60% faster than 1950X - prepare yourself to use liquid nitrogen to freeze that CPU coming directly from hell, especially if Intel is still using that mustard between the CPU and heat spreader.

And you will need around 500W for that performance.
__________________
Win 10 x64 (15063.540) - Core i3-4170/ iGPU HD 4400 (v.4624) - Core i5-2400 / dGPU RX 470 (v17.7.2)
HEVC decoding benchmarks
H.264 DXVA Benchmarks for all
NikosD is offline   Reply With Quote
Old 15th August 2017, 12:12   #5537  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 6,108
Quote:
Haswell and above have AVX2 instruction set which enables 256 bit vector SIMD integer instructions leveraged by x265
What specific unit in CPU is responsible for calculating AVX2 instructions? My common sense tells me that FMAC does that. After all old SSE2 can also work on integers
https://en.wikipedia.org/wiki/SSE2

Zen has 2xFMAC128 while Intel since haswell has got 2xFMAC256. x265 benchmarks clearly show AMD 16C/32T = Intel 10C/20T. I see clear correlation here.

Last edited by Atak_Snajpera; 15th August 2017 at 12:22.
Atak_Snajpera is offline   Reply With Quote
Old 18th August 2017, 15:15   #5538  |  Link
LigH
German doom9/Gleitz SuMo
 
LigH's Avatar
 
Join Date: Oct 2001
Location: Germany, rural Altmark
Posts: 4,766
x265 2.5+11-d58761d8db4a

supports some new SMPTE-ST/RP/EG colorimetry options and a new split RD skip command* (documented only in full help):

Code:
   --[no-]splitrd-skip           Enable skipping split RD analysis when sum of split CU rdCost larger than none split CU rdCost for Intra CU. Default disabled

   --colorprim <string>          Specify color primaries from undef, bt709, bt470m, bt470bg, smpte170m,
                                 smpte240m, film, bt2020, smpte-st-428, smpte-rp-431, smpte-eg-432. Default undef

   --colormatrix <string>        Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,
                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c, smpte-st-2085, chroma-nc, chroma-c, ictcp. Default undef
* If I understood the patch comment in the mailing list correctly, it should speed up intra split cost calculation a little while possibly preserving identical output.
__________________

German doom9 / Gleitz video board
CQME – change the Matrix!
BeSweet 1.5b31 All In One | HeadAC3he 0.24a13

Rémoulade is spoiled
LigH is offline   Reply With Quote
Old 19th August 2017, 03:55   #5539  |  Link
burfadel
Registered User
 
Join Date: Aug 2006
Posts: 2,051
Yes, the splitRD-skip looks interesting, I wouldn't be surprised that if in the future it isn't enabled by default. I guess that comes down to user reports, or maybe they're waiting on the possibility of it being extended to inter-CU?
burfadel is offline   Reply With Quote
Old Yesterday, 06:00   #5540  |  Link
littlepox
Registered User
 
Join Date: Nov 2012
Posts: 205
Quote:
Originally Posted by LigH View Post
* If I understood the patch comment in the mailing list correctly, it should speed up intra split cost calculation a little while possibly preserving identical output.
Should that be the case, we are probably going to see this option removed while the skip is integrated in the code very soon.
littlepox is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:16.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2017, vBulletin Solutions Inc.