Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 23rd November 2019, 16:25   #1  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
X265 slow encoding

hi, ive been doing a few encodes lately on my lab-encoder and im getting abmysal encoding speeds connected to the ffmpeg+x265 not using all the cpu capacity:

HW: Dual Intel Silver 4118 (20cores + 20HT cores), RHEL 7.7
sourcefile is a 422hq Prores file
ffmpeg is the latest gitpull from yesterday

Result: encoding at 1fps speed 0.04x
Machine has a load of 2 (40cores..??), almost all cpu cores are idle

./ffmpeg -loglevel verbose -i file_p25.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y xtemp3_P6slow_nolimit_max.ts

Anyone know whats going on?
TEB is offline   Reply With Quote
Old 23rd November 2019, 16:32   #2  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Add --ctu 16 or just just ripbot264 in distributed encoding mode
Atak_Snajpera is offline   Reply With Quote
Old 23rd November 2019, 17:13   #3  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
thx! It seems that by moving from veryslow to normal, i get a much better utilization of the cores..
TEB is offline   Reply With Quote
Old 24th November 2019, 00:50   #4  |  Link
Greenhorn
Registered User
 
Join Date: Apr 2018
Posts: 61
Quote:
Originally Posted by TEB View Post
hi, ive been doing a few encodes lately on my lab-encoder and im getting abmysal encoding speeds connected to the ffmpeg+x265 not using all the cpu capacity:

HW: Dual Intel Silver 4118 (20cores + 20HT cores), RHEL 7.7
sourcefile is a 422hq Prores file
ffmpeg is the latest gitpull from yesterday

Result: encoding at 1fps speed 0.04x
Machine has a load of 2 (40cores..??), almost all cpu cores are idle

./ffmpeg -loglevel verbose -i file_p25.mov -strict -1 -vf format=yuv420p10 -codec:v libx265 -x265-params keyint=100:min-keyint=100:no-open-gop=1 -level 4.1 -preset veryslow -crf 16 -profile:v main10 -y xtemp3_P6slow_nolimit_max.ts

Anyone know whats going on?
If you want to encode at the higher presets, you can also try enabling --pmode for a large boost to utilization. (It'll actually decrease performance at the lower presets.)
Greenhorn is offline   Reply With Quote
Old 19th December 2019, 08:40   #5  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by Greenhorn View Post
If you want to encode at the higher presets, you can also try enabling --pmode for a large boost to utilization. (It'll actually decrease performance at the lower presets.)
I tried the following:

Code:
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265  -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1 -level 4.1 -preset veryslow -crf 16  -profile:v main10 -y  test-veryslow.ts
Source = Prores HQ422 10bit movie trailer

With and without PMODE (given the syntax is correct) im getting 0.2x (ca. 14% system usage)

From the log:

x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 5 / wpp(17 rows)+pmode
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 100 / 100 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 1
x265 [info]: References / ref-limit cu / depth : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao

Last edited by TEB; 19th December 2019 at 09:04.
TEB is offline   Reply With Quote
Old 19th December 2019, 13:51   #6  |  Link
kuchikirukia
Registered User
 
Join Date: Oct 2014
Posts: 476
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.

Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.

Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)

Last edited by kuchikirukia; 19th December 2019 at 14:11.
kuchikirukia is offline   Reply With Quote
Old 21st December 2019, 10:43   #7  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,558
Quote:
Originally Posted by kuchikirukia View Post
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.

Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.

Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)
It might be memory access speed, plus the Xeon has absolutely massive internal caches. x265 is extremely sensitive to access speed, even more so than x264 was back when it was still considered slow.
foxyshadis is offline   Reply With Quote
Old 21st December 2019, 20:53   #8  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 322
Quote:
Originally Posted by kuchikirukia View Post
Have you tried not using ffmpeg? ffmpeg is useful because it can do anything, but it doesn't do anything well.

Using MeGUI on Windows (calling the x265 binary) I have no problem hitting 100% load my 8-threaded i7 4790 at --preset veryslow on a 1080p Blu-ray.

Though interestingly I barely exceed 1FPS in 10 bit. Have processors come so far that my 3.8GHz 4/8 Haswell can nearly be equaled by two cores of a 3GHz Xeon?
Passmark shows that Xeon as being reasonably behind my i7 single-threaded. (70% of the speed)
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference. And you dont get much more then 8t saturation with veryslow with such a big ctu size for 1080p, and since his xeon will run at sub 3Ghz its not that weird that you get similar speeds since he cant use the thread advantage!

And tbh veryslow is literally very slow, to the point were its almost unusable (especially after the latest preset changes). I would say that 'slower' is the lowest "usable" preset atm.

For reference, this is what i get with an Xeon GOLD 6126 (12C/24T)
--veryslow 0,8fps (25-40% utilization)

--slower --ctu 32 --merange 26 3fps (100% utilization)

almost a 4x speed increase.

edit. Also keep in mind that the source have a large effect on speed, you cannot do a direct comparison without using the same files.

Last edited by excellentswordfight; 21st December 2019 at 21:06.
excellentswordfight is offline   Reply With Quote
Old 22nd December 2019, 08:17   #9  |  Link
kuchikirukia
Registered User
 
Join Date: Oct 2014
Posts: 476
Quote:
Originally Posted by excellentswordfight View Post
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference.
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
kuchikirukia is offline   Reply With Quote
Old 22nd December 2019, 09:43   #10  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by excellentswordfight View Post
Wow, thats somewhat of an different setup from TS... You are talking about saturating 8threads and TS 40, its quite a big difference. And you dont get much more then 8t saturation with veryslow with such a big ctu size for 1080p, and since his xeon will run at sub 3Ghz its not that weird that you get similar speeds since he cant use the thread advantage!

And tbh veryslow is literally very slow, to the point were its almost unusable (especially after the latest preset changes). I would say that 'slower' is the lowest "usable" preset atm.

For reference, this is what i get with an Xeon GOLD 6126 (12C/24T)
--veryslow 0,8fps (25-40% utilization)

--slower --ctu 32 --merange 26 3fps (100% utilization)

almost a 4x speed increase.

edit. Also keep in mind that the source have a large effect on speed, you cannot do a direct comparison without using the same files.

Mind explaining what --ctu 32 and --merange 26 means?
TEB is offline   Reply With Quote
Old 22nd December 2019, 09:44   #11  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by kuchikirukia View Post
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
Not sure where 2 came in as i have a 128 core cpu ? Or am i missing something?
TEB is offline   Reply With Quote
Old 22nd December 2019, 17:29   #12  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 322
Quote:
Originally Posted by TEB View Post
Mind explaining what --ctu 32 and --merange 26 means?
--ctu specify the maxiumum CU size, the default value is rather large and is mostly beneficial for high res (UHD) material and it has a large effect on parallelism at lower res. It can be reduced for greater parallelism without any big effect on compression. I usually leave it at 64 for 1080p, and go for 32 at 720p and bellow, but if you are looking at using more threads and still use single encoding, this is one of the key parameters.

--merange sets the motion search range, the default value (57) is calculated based on the default CTU value of 64. The doc explains it rather well:
Quote:
The default is derived from the default CTU size (64) minus the luma interpolation half-length (4) minus maximum subpel distance (2) minus one extra pixel just in case the hex search method is used. If the search range were any larger than this, another CTU row of latency would be required for reference frames.
All presets below medium use star search, so using the same logic with a cu size of 32 would get you 26.

Last edited by excellentswordfight; 22nd December 2019 at 17:44.
excellentswordfight is offline   Reply With Quote
Old 23rd December 2019, 05:30   #13  |  Link
kuchikirukia
Registered User
 
Join Date: Oct 2014
Posts: 476
Quote:
Originally Posted by TEB View Post
Not sure where 2 came in as i have a 128 core cpu ? Or am i missing something?
Quote:
Originally Posted by TEB View Post
Machine has a load of 2 (40cores..??), almost all cpu cores are idle
A load of 2 means you'd be at 100% load on a dual-core, when the x265 Windows binary will max my 8 threads. (load of 8)

While the veryslow preset doesn't scale out to the 40 threads of your system, it should be able to do 8, and my guess as to why you can't hit that would be an issue with your ffmpeg build.
kuchikirukia is offline   Reply With Quote
Old 23rd December 2019, 13:59   #14  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by excellentswordfight View Post
--ctu specify the maxiumum CU size, the default value is rather large and is mostly beneficial for high res (UHD) material and it has a large effect on parallelism at lower res. It can be reduced for greater parallelism without any big effect on compression. I usually leave it at 64 for 1080p, and go for 32 at 720p and bellow, but if you are looking at using more threads and still use single encoding, this is one of the key parameters.

--merange sets the motion search range, the default value (57) is calculated based on the default CTU value of 64. The doc explains it rather well:

All presets below medium use star search, so using the same logic with a cu size of 32 would get you 26.
Thx for the insight !!

Code:
 ./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265  -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange:26 -level 4.1 -preset veryslow -crf 16  -profile:v main10 -y  test-veryslow.ts
Doesnt seem to trigger the change tho:

Code:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 5 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge         : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 23 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt        : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress            : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
[mpegts @ 0x793f4c0] service 1 using PCR in pid=256, pcr_period=83ms
[mpegts @ 0x793f4c0] muxrate VBR, sdt every 500 ms, pat/pmt every 100 ms
TEB is offline   Reply With Quote
Old 23rd December 2019, 23:13   #15  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,558
Quote:
Originally Posted by TEB View Post
Thx for the insight !!

Code:
 ./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265  -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange:26 -level 4.1 -preset veryslow -crf 16  -profile:v main10 -y  test-veryslow.ts
Doesnt seem to trigger the change tho:

Code:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 5 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge         : star / 57 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 23 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt        : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress            : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
[mpegts @ 0x793f4c0] service 1 using PCR in pid=256, pcr_period=83ms
[mpegts @ 0x793f4c0] muxrate VBR, sdt every 500 ms, pat/pmt every 100 ms
You probably meant to put merange=26, not merange:26.
foxyshadis is offline   Reply With Quote
Old 24th December 2019, 09:51   #16  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 322
Quote:
Originally Posted by kuchikirukia View Post
I'm talking about saturating 8 threads vs his 2. You showed ~8 threaded too.
While it doesn't look like he's going to see anything close to a 4x speedup if he fixes his threadedness issue, if it's a reasonable gain it may turn out a difference between running one to two encodes on each CPU vs five on each.
Quote:
Originally Posted by kuchikirukia View Post
A load of 2 means you'd be at 100% load on a dual-core, when the x265 Windows binary will max my 8 threads. (load of 8)

While the veryslow preset doesn't scale out to the 40 threads of your system, it should be able to do 8, and my guess as to why you can't hit that would be an issue with your ffmpeg build.
Well thats either a typo or not really the case, cause the speed he is seeing is in line with using more like 8-12T (which is normal as well with 1080p on default settings), cuase its in line with both your haswell-s cpu and my xeon gold. So no real reason in dwelling on that.

Last edited by excellentswordfight; 24th December 2019 at 09:53.
excellentswordfight is offline   Reply With Quote
Old 24th December 2019, 14:18   #17  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by foxyshadis View Post
You probably meant to put merange=26, not merange:26.
Jepp, corrected now!
TEB is offline   Reply With Quote
Old 24th December 2019, 14:23   #18  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
UPDATE:



Code:
x265 [info]: HEVC encoder version 3.2+2-82a66ce12955
x265 [info]: build info [Linux][GCC 6.3.0][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 64 threads
x265 [info]: Thread pool created using 64 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 5 / wpp(34 rows)+pmode
x265 [info]: Coding QT: max CU size, min CU size : 32 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge         : star / 26 / 4 / 5
x265 [info]: Keyframe min / max / scenecut / bias: 100 / 100 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt        : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress            : CRF-16.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 rskip
x265 [info]: tools: signhide tmvp b-intra strong-intra-smoothing deblock sao
Code:
./ffmpeg -loglevel verbose -i ARCHIVE.mov -strict -1 -vf format=yuv420p10 -codec:v libx265  -x265-params keyint=100:min-keyint=100:no-open-gop=1:pmode=1:ctu=32:merange=26 -level 4.1 -preset veryslow -crf 16  -profile:v main10 -y  test-veryslow.ts
System Load: 5min avg 21
FPS encoding: 7fps

A load of 21 on a 128cored cpu is a tad low Any more tips to improve it and not move to lower quality profiles?

TEST1:
I tested medium preset for the fun of it, but i still had like 25ish load and ca 44 fps..
So in other words, higher framerate but the load isnt all that great..

TEST2:
I spawned 4 encoding instances like the one over in veryslow mode and i got a load of ca. 97

Last edited by TEB; 24th December 2019 at 14:36.
TEB is offline   Reply With Quote
Old 24th December 2019, 14:40   #19  |  Link
microchip8
ffx264/ffhevc author
 
microchip8's Avatar
 
Join Date: May 2007
Location: /dev/video0
Posts: 1,843
Try with pme=1 (parallel motion estimation), but I doubt it'll suddenly saturate all threads. I think you're pushing the threading of x265 itself
__________________
ffx264 || ffhevc || ffxvid || microenc
microchip8 is offline   Reply With Quote
Old 25th December 2019, 00:32   #20  |  Link
TEB
Registered User
 
Join Date: Feb 2003
Location: Palmcoast of Norway
Posts: 363
Quote:
Originally Posted by froggy1 View Post
Try with pme=1 (parallel motion estimation), but I doubt it'll suddenly saturate all threads. I think you're pushing the threading of x265 itself
Tried, same result
I also tried from another source (HEVC), same result..
TEB is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:57.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.