Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 8th June 2025, 19:48   #1  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,236
x264 - x86_64 vs ARM 64 The ultimate encoding battle

Hi there,
up until a few years ago, if someone came to me and asked about encoding with x264 on an ARM CPU I would have looked at him with a weird face as I always thought that ARM CPUs were supposed to be used in mobile devices like in smartphones as their main purpose was to be extremely power efficient and last for a long time even when connected to a battery. In other words, I didn't see their use ever becoming a thing on desktop computers, let alone in a server running in a datacenter. Yet, ARM powered laptops have become a thing, more and more people have been using ARM CPUs as their daily drivers, be it via the Qualcomm CPUs on Windows and Linux or the Apple M CPUs on MacOS. Software got better with more support outside of the mobile space and this of course recently included frameservers like Avisynth and VapourSynth, decoders like libav, encoders like x264 and of course FFMpeg, so I thought: it's time for a comparison.

In particular, when it comes to x264, there are manually written intrinsics in assembly for both x86_64 and ARM 64, in fact we have SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512 and FMA from the x86 side and NEON from the ARM side.

Code:
const x264_cpu_name_t x264_cpu_names[] =
{
#if ARCH_X86 || ARCH_X86_64
//  {"MMX",         X264_CPU_MMX},  // we don't support asm on mmx1 cpus anymore
#define MMX2 X264_CPU_MMX|X264_CPU_MMX2
    {"MMX2",        MMX2},
    {"MMXEXT",      MMX2},
    {"SSE",         MMX2|X264_CPU_SSE},
#define SSE2 MMX2|X264_CPU_SSE|X264_CPU_SSE2
    {"SSE2Slow",    SSE2|X264_CPU_SSE2_IS_SLOW},
    {"SSE2",        SSE2},
    {"SSE2Fast",    SSE2|X264_CPU_SSE2_IS_FAST},
    {"LZCNT",       SSE2|X264_CPU_LZCNT},
    {"SSE3",        SSE2|X264_CPU_SSE3},
    {"SSSE3",       SSE2|X264_CPU_SSE3|X264_CPU_SSSE3},
    {"SSE4.1",      SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
    {"SSE4",        SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4},
    {"SSE4.2",      SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42},
#define AVX SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42|X264_CPU_AVX
    {"AVX",         AVX},
    {"XOP",         AVX|X264_CPU_XOP},
    {"FMA4",        AVX|X264_CPU_FMA4},
    {"FMA3",        AVX|X264_CPU_FMA3},
    {"BMI1",        AVX|X264_CPU_LZCNT|X264_CPU_BMI1},
    {"BMI2",        AVX|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2},
#define AVX2 AVX|X264_CPU_FMA3|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2|X264_CPU_AVX2
    {"AVX2",        AVX2},
    {"AVX512",      AVX2|X264_CPU_AVX512},
#undef AVX2
#undef AVX
#undef SSE2
#undef MMX2
    {"Cache32",         X264_CPU_CACHELINE_32},
    {"Cache64",         X264_CPU_CACHELINE_64},
    {"SlowAtom",        X264_CPU_SLOW_ATOM},
    {"SlowPshufb",      X264_CPU_SLOW_PSHUFB},
    {"SlowPalignr",     X264_CPU_SLOW_PALIGNR},
    {"SlowShuffle",     X264_CPU_SLOW_SHUFFLE},
    {"UnalignedStack",  X264_CPU_STACK_MOD4},
#elif ARCH_PPC
    {"Altivec",         X264_CPU_ALTIVEC},
#elif ARCH_ARM
    {"ARMv6",           X264_CPU_ARMV6},
    {"NEON",            X264_CPU_NEON},
    {"FastNeonMRC",     X264_CPU_FAST_NEON_MRC},
#elif ARCH_AARCH64
    {"ARMv8",           X264_CPU_ARMV8},
    {"NEON",            X264_CPU_NEON},
    {"DotProd",         X264_CPU_DOTPROD},
    {"I8MM",            X264_CPU_I8MM},
    {"SVE",             X264_CPU_SVE},
    {"SVE2",            X264_CPU_SVE2},
#elif ARCH_MIPS
    {"MSA",             X264_CPU_MSA},
#elif ARCH_LOONGARCH
    {"LSX",             X264_CPU_LSX},
    {"LASX",            X264_CPU_LASX},
#endif
    {"", 0},
};
To make this comparison fair and avoid a "David vs Goliath" benchmark, I've picked two EC2 which are identical in terms of cores/thread and RAM, in particular:

x86_64
c6i.2xlarge 8c/8th 16GB RAM

ARM 64
c6g.2xlarge 8c/8th 16GB RAM

In other words, we have two Virtual Machines where the x86 one is powered by an Intel Xeon Platinum 8375C (Ice Lake) host, while the ARM 64 one is powered by a Graviton 2 which uses the ARMv8 Neoverse-N1 cores.

For the test, Linux was used, in particular Ubuntu 24.04 running FFMpeg 6.1.1 Stable. Each EC2 had a 2TB attached storage to perform the calculations, so that the benchmark essentially consisted in:

1) Spinning up the EC2
2) Transferring a mezzanine file from an S3 bucket to the 2TB attached storage of the EC2
3) Triggering the encode to create the final output files
4) Delivering those files back to S3
5) Shut down the EC2

The power up / power down times have then been taken out of the total job as well as the file transferring times in order to end up only with the actual computation time.

A total of 7 sources were used and in all cases the input file was a standard XDCAM-50 file with DolbyE Italian, DolbyE Original, PCM Stereo Italian, PCM Stereo Original. In particular:

Video:
FULL HD 1920x1080 MPEG-2 High 4:2:2 Profile, Level High 50 Mbit/s yv16 25i TFF BT709 SDR

Audio:
Track1 DolbyE 5.1 44800Hz 20bit Italian
Track2 DolbyE 5.1 44800Hz 20bit Original
Track3 PCM 2.0 48000Hz 24bit Italian
Track4 PCM 2.0 48000Hz 24bit Original

The 44800Hz in the DolbyE tracks refers to the internal sampling rate for that stream at 25fps (1792 samples * 25 frame per seconds = 44800 Hertz) which is always resampled to 48000Hz when played back on an hardware decoder.


The encoding job consisted in 6 steps

Step 1: Encoding the video
FULL HD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR

Step 2: Encoding the audio in AAC
Track1 AAC 5.1 550 kbit/s 48000Hz Italian
Track2 AAC 5.1 550 kbit/s 48000Hz Original
Track3 AAC 2.0 384 kbit/s 48000Hz Italian
Track4 AAC 2.0 384 kbit/s 48000Hz Original

Step 3: Encode the audio in Opus as a proxy
Track1 Proxy: Opus Mono 64 kbit/s Italian
Track2 Proxy: Opus Mono 64 kbit/s Original

Step 4: Encoding the video in H.264 as a proxy with watermark + mux the already encoded audio
SD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR

Step 5: Muxing the FULL HD video and the 5.1 AAC audio in MP4

Step 6: Extract a low resolution thumbnail from the middle of the video and encode it in JPEG

The command line used is reported as follows:

Quote:
#BT709
#Video only
ffmpeg -i $inputSpec:myInput -map 0:v -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -crf 25 -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -vf "sidedata=delete,metadata=delete,bwdif=mode=0arity=0:deint=0,scale=w=1920:h=1080:flags=lanczos:sws_dither=ed,format=yuv420p,setfield=prog,setsar=1:1,fps=25" -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1verscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -an -f mp4 -y $jobOutputFolder:Video_Only.mp4

#CH.1-2 DolbyE 5.1 - CH.3-4 DolbyE 5.1 - CH.5-6 stereo - CH.7-8 stereo audio track
#Extract DolbyE track 1 and 2
ffmpeg -i $inputSpec:myInput -map 0:1 -acodec copy -f u8 -y $jobOutputFolder:stream1.u8
ffmpeg -i $inputSpec:myInput -map 0:2 -acodec copy -f u8 -y $jobOutputFolder:stream2.u8
#Encoding stereo track 3 and 4
ffmpeg -i $inputSpec:myInput -map 0:3 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh56.m4a
ffmpeg -i $inputSpec:myInput -map 0:4 -c:a aac -b:a 384k -ar 48000 -y $jobOutputFolder:myOutputCh78.m4a
#Extract each channel of DolbyE 5.1 ITA and DolbyE 5.1 ORI
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ITA_FL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ITA_FR.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ITA_CC.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ITA_LFE.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ITA_SL.wav
ffmpeg -i $jobOutputFolder:stream1.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ITA_SR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.0:0.0.0 -y $jobOutputFolder:ORI_FL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.1:0.0.0 -y $jobOutputFolder:ORI_FR.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.2:0.0.0 -y $jobOutputFolder:ORI_CC.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.3:0.0.0 -y $jobOutputFolder:ORI_LFE.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.4:0.0.0 -y $jobOutputFolder:ORI_SL.wav
ffmpeg -i $jobOutputFolder:stream2.u8 -acodec pcm_s24le -ar 48000 -ac 1 -map_channel 0.0.5:0.0.0 -y $jobOutputFolder:ORI_SR.wav
#Audio only
ffmpeg -i $jobOutputFolder:ITA_FL.wav -i $jobOutputFolder:ITA_FR.wav -i $jobOutputFolder:ITA_CC.wav -i $jobOutputFolder:ITA_LFE.wav -i $jobOutputFolder:ITA_SL.wav -i $jobOutputFolder:ITA_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh12.m4a
ffmpeg -i $jobOutputFolder:ORI_FL.wav -i $jobOutputFolder:ORI_FR.wav -i $jobOutputFolder:ORI_CC.wav -i $jobOutputFolder:ORI_LFE.wav -i $jobOutputFolder:ORI_SL.wav -i $jobOutputFolder:ORI_SR.wav -filter_complex "[0:a][1:a][2:a][3:a][4:a][5:a]join=inputs=6:channel_layout=5.1:map=0.0-FL|1.0-FR|2.0-FC|3.0-LFE|4.0-BL|5.0-BR[a]" -map "[a]" -c:a aac -b:a 550k -ar 48000 -y $jobOutputFolder:myOutputCh34.m4a

#Audio for proxy
ffmpeg -i $jobOutputFolder:myOutputCh12.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh12_Transcribe.ogg
ffmpeg -i $jobOutputFolder:myOutputCh34.m4a -ac 1 -c:a libopus -b:a 64k -y $jobOutputFolder:myOutputCh34_Transcribe.ogg

#Muxed mono audio with TC & Watermark
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:0 -map 1:0 -map 2:0 -vf "fps=25,scale=w=1024:h=576:flags=lanczos:sws_dither=ed,setfield=prog,setsar=1:1","drawtext=\timecode='10\:00\:00\:00':timecode_rate=25:x=(w-tw)/2:y=h-(1*lh):fontcolor=white@1:fontsize=25:box=1:boxcolor=black@0.6","drawtext=\text='Internal Use Only':x=(w-text_w)/2:y=(h-text_h)/2:fontcolor=white@0.1:fontsize=125:line_spacing=100" -c:v libx264 -profile:v high -level:v 4.1 -refs 4 -pix_fmt yuv420p -crf 25 -x264opts "opencl:keyint=25:force_cfr=1:deblock=-1,-1:aud=1verscan=show:colorprim=bt709:fullrange=off:transfer=bt709:colormatrix=bt709" -color_primaries bt709 -color_trc bt709 -colorspace bt709 -color_range tv -field_order progressive -brand mp42 -max_muxing_queue_size 700 -map_metadata -1 -metadata creation_time=now -ignore_chapters 1 -ignore_unknown -write_tmcd 0 -movflags faststart -c:a copy -f mp4 -y $jobOutputFolder:Subtitling_Proxy.mp4

#Muxed audio and video
ffmpeg -i $jobOutputFolder:Video_Only.mp4 -i $jobOutputFolder:myOutputCh12.m4a -i $jobOutputFolder:myOutputCh34.m4a -map 0:v -map 1:a -map 2:a -c:v copy -c:a copy -f mp4 -y $jobOutputFolder:my_Muxed_Output.mp4


#Thumbnail
ffmpeg -ss 01:02:36.280 -i $jobOutputFolder:Video_Only.mp4 -vf "thumbnail=300,scale=w=240:h=136,setsar=1:1" -sws_flags lanczos -frames:v 1 -y $jobOutputFolder:thumb.jpg

One last note to keep in mind is that using an ARM CPU is 20% cheaper than using an x86 one, which means that, in theory, if the ARM CPU was as fast as the x86 one, then it would potentially save 20% of the cost.

Spoiler alert: this didn't happen.


Benchmark results:

Movie 1:
Title: Nope
Duration: 02:05:12:16
c6i.2xlarge x86 Encoding Duration: 2h 24m 6s
c6g.2xlarge ARM Encoding Duration: 3h 25m 43s
x86 cost: $7.50
ARM cost: $8.55

Result: ARM was 42.77% slower and 14.07% more expensive


Movie 2:
Title: Novocaine
Duration: 01:45:19:02
c6i.2xlarge x86 Encoding Duration: 1h 51m 47s
c6g.2xlarge ARM Encoding Duration: 2h 48m 10s
x86 cost: $6.30
ARM cost: $7.58

Result: ARM was 50.44% slower - 20.38% more expensive


Movie 3:
Title: Absolutely anything
Duration: 01:22:16:21
c6i.2xlarge x86 Encoding Duration: 1h 30m 19s
c6g.2xlarge ARM Encoding Duration: 2h 11m 10s
x86 cost: $4.94
ARM cost: $5.74

Result: ARM was 45.23% slower - 16.26% more expensive



Movie 4:
Title: Catch me if you can
Duration: 02:15:11:07
c6i.2xlarge x86 Encoding Duration: 3h 4m 15s
c6g.2xlarge ARM Encoding Duration: 4h 15m 25s
x86 cost: $8.11
ARM cost: $8.99

Result: ARM was 38.62% slower - 10.89% more expensive



Movie 5:
Title: Me before you
Duration: 01:45:57:00
c6i.2xlarge x86 Encoding Duration: 1h 51m 27s
c6g.2xlarge ARM Encoding Duration: 2h 43m 48s
x86 cost: $6.69
ARM cost: $7.87

Result: ARM was 47.08% slower - 17.72% more expensive



Movie 6:
Title: The lucky one
Duration: 01:36:54:00
c6i.2xlarge x86 Encoding Duration: 1h 56m 43s
c6g.2xlarge ARM Encoding Duration: 2h 46m 20s
x86 cost: $7.00
ARM cost: $7.97

Result: ARM was 42.51% slower - 13.98% more expensive




Movie 7:
Title: Shattered
Duration: 01:30:59:24
c6i.2xlarge x86 Encoding Duration: 1h 49m 22s
c6g.2xlarge ARM Encoding Duration: 2h 40m 6s
x86 cost: $6.56
ARM cost: $7.68

Result: ARM was 46.39% slower - 17.22% more expensive



In other words, on average, using an ARM CPU resulted in a 44.72% slowdown compared to the equivalent x86 CPU and, when we factor in the cost, despite it being 20% cheaper to run, the fact that it takes much longer to encode makes it actually 15.78% more expensive to run in real terms.
FranceBB is offline   Reply With Quote
Old 9th June 2025, 09:18   #2  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 374
Very cool test, thank you for sharing that. But given that Graviton2 has cores derived from Cortex-A76, which as very far away from cutting edge ARM-performance, although very interesting to see that they have worse performance/dollar, it would be interesting to see from a performance standpoint how Graviton4 instances perform.

Does anyone know if there is any significant difference between x264 and x265 when it comes to ARM and NEON optimization? Ive seen quite a bit for x265, but I dont follow x264 development that much anymore.

Last edited by excellentswordfight; 9th June 2025 at 09:26.
excellentswordfight is offline   Reply With Quote
Old 9th June 2025, 10:01   #3  |  Link
Z2697
Registered User
 
Join Date: Aug 2024
Posts: 576
A test on newer generation of Graviton would be great!
And there's even Mac?
Z2697 is offline   Reply With Quote
Old 9th June 2025, 10:57   #4  |  Link
rwill
Registered User
 
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
Looks like a Decoder and I/O benchmark to me. Not really x264 specific.
__________________
My github...
rwill is offline   Reply With Quote
Old 9th June 2025, 16:45   #5  |  Link
Z2697
Registered User
 
Join Date: Aug 2024
Posts: 576
Quote:
Originally Posted by rwill View Post
Looks like a Decoder and I/O benchmark to me. Not really x264 specific.
I'd assume a 50Mbps MPEG2 source won't be a bottleneck?
Z2697 is offline   Reply With Quote
Old 10th June 2025, 05:46   #6  |  Link
rwill
Registered User
 
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
Quote:
Originally Posted by Z2697 View Post
I'd assume a 50Mbps MPEG2 source won't be a bottleneck?
No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.
__________________
My github...
rwill is offline   Reply With Quote
Old 10th June 2025, 06:19   #7  |  Link
Z2697
Registered User
 
Join Date: Aug 2024
Posts: 576
Quote:
Originally Posted by rwill View Post
No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.
He does the encoding twice for each title, in a low frequency virtual cores cloud instance.
Yes, maybe it still don't add up perfectly, but makes some sense I guess.
Z2697 is offline   Reply With Quote
Old 10th June 2025, 08:22   #8  |  Link
rwill
Registered User
 
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
Quote:
Originally Posted by Z2697 View Post
He does the encoding twice for each title, in a low frequency virtual cores cloud instance.
Yes, maybe it still don't add up perfectly, but makes some sense I guess.
Yes I have seen that he encodes at FHD and something close to QHD. I can read ffmpeg scripts. Thank you.

An 8 core Graviton 2 instance should clock at around 2.5Ghz and should be well faster than my 16Gb Raspberry PI at x264 if the majority of time is spend x264 encoding. But according to FranceBB numbers it is hardly.
__________________
My github...
rwill is offline   Reply With Quote
Old 10th June 2025, 13:12   #9  |  Link
Z2697
Registered User
 
Join Date: Aug 2024
Posts: 576
Quote:
Originally Posted by rwill View Post
Yes I have seen that he encodes at FHD and something close to QHD. I can read ffmpeg scripts. Thank you.

An 8 core Graviton 2 instance should clock at around 2.5Ghz and should be well faster than my 16Gb Raspberry PI at x264 if the majority of time is spend x264 encoding. But according to FranceBB numbers it is hardly.
Don't forget that it will almost certainly have to share the CPU resource with other instances running on the same host.
But yeah, I'm not to disagree that this test includes too much noise, more than the title suggests - a x264 encoding battle.

What's the x264 medium speed of your Raspberry PI?
Z2697 is offline   Reply With Quote
Old 10th June 2025, 15:02   #10  |  Link
rwill
Registered User
 
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
Quote:
Originally Posted by Z2697 View Post
Don't forget that it will almost certainly have to share the CPU resource with other instances running on the same host.
But yeah, I'm not to disagree that this test includes too much noise, more than the title suggests - a x264 encoding battle.

What's the x264 medium speed of your Raspberry PI?
Its a Raspberry Pi 5 so hardly more advanced than Graviton 2.

For some Hollywood Movie intro pan over some detailed desert + person with some light grain:

Code:
x264 --input-res 1920x1080 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1920x1080_8b.yuv
-->
encoded 120 frames, 21.73 fps, 2398.65 kb/s
and for the smaller resolution

Code:
x264 --input-res 1024x576 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1024x576_8b.yuv
-->
encoded 120 frames, 69.77 fps, 862.13 kb/s
*edit*
x264 is:
Code:
x264 0.164.3095 baee400
built on Apr 12 2023, gcc: 12.2.0
__________________
My github...

Last edited by rwill; 10th June 2025 at 15:09.
rwill is offline   Reply With Quote
Old 11th June 2025, 19:30   #11  |  Link
GeoffreyA
Registered User
 
Join Date: Jun 2024
Location: South Africa
Posts: 365
Thanks for this test, FranceBB. It would be interesting if we could get a fair test between x86 and Apple's CPUs, which, as far as I'm aware, are the best ARM implementation.
GeoffreyA is offline   Reply With Quote
Old 12th June 2025, 18:36   #12  |  Link
j7n
Registered User
 
j7n's Avatar
 
Join Date: Apr 2006
Posts: 167
Is the "cost" chosen arbitrarily by the provider of the virtual machine, or does it reflect the price of the computer plus electricity costs? Do you actually get charged by time used, $8 for three hours or so? That would get unsustainable real quick.
j7n is offline   Reply With Quote
Old 12th June 2025, 22:33   #13  |  Link
FranceBB
Broadcast Encoder
 
FranceBB's Avatar
 
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,236
Quote:
Originally Posted by j7n View Post
Is the "cost" chosen arbitrarily by the provider of the virtual machine, or does it reflect the price of the computer plus electricity costs?
It's set by the provider of the virtual machines. In this case, the tests were run via SDVI on Amazon's infrastructure, so it includes the cost of the Elastic Block Storage.

Quote:
Originally Posted by j7n View Post
That would get unsustainable real quick.
Yep... But the "benefit" of the cloud running open source software is the scalability. This is the current situation in Prod (it's 10PM on a Thursday) for Avisynth for instance:



Basically, when you deploy something, a "golden" AMI is created. You can think about the AMI as a .ovf file containing the virtual machine and its configuration. When those are deployed you can have an elastic farm so that you start with 1 instance which is shut down. In the case of FFAStrans (FFMpeg Avisynth Transcoder), for instance, when a job comes, this EC2 spins up, the file is transferred from the S3 bucket to the 2TB attached storage, the rest_service starts, a POST is triggered so that it imports the workflow, then another POST is made to trigger the workflow and a series of GET are made afterwards to get the status. Once the job is over and the file is encoded, it gets transferred from the 2TB attached storage to the S3 bucket and the EC2 is shut down. Clearly, if more jobs come and more EC2 are required, those will be created automatically dynamically and - as you can see from the screenshot - I currently have 87 EC2 created of which 7 are running and executing jobs in this very moment and up to 620 can be created (I can eventually increase this limit). Each EC2 has a "grace period" of 1 week so that if it's not used and it stays shut down for more than 1 week, it gets automatically deleted and it will eventually be recreated if needed.

Obviously this is all fun and games 'cause we're only talking about machines with CPU and RAM, but if we were to include dedicated resources like a GPU then you could end up in a situation in which your machine won't be created for a long time 'cause there may not be any availability in the region so you have to wait until some other AWS customer finishes using it. With CPU only machines, however, this never happens.


The advantage of using open source software, in general, compared to closed source software you buy a license for is that for closed source software you can only deploy as many instances as the licenses you bought. Like, suppose you bought 10 licenses from company A, then you can deploy 10 provisioned instances which will always be there, they will power up and down, you don't have to wait for them to be created, but you won't be able to scale to, let's say, the 11th one, 'cause you don't have the license for it. Some providers offer you the option to use their cloud offering instead, so instead of buying the license, you never buy it and you use those as a sort of "pay-as-you-go", which allows you to scale, however every time you trigger a job you pay a bit more than you would have paid 'cause they're actually also charging the license.


This is more general, though, but as far as encoding is concerned, unless you need something incredibly specific like, I don't know, the Dolby Media Encoder to encode DAMF in AC4 etc or things like FAB to mux .stl subtitles in an .mxf container as a 436m track to carry Teletext Subtitles or some other peculiar use case not covered by open source software, you're better off with open source, which is why I'm proud not just to be using Avisynth but also of the fact that the company I work for is directly contributing to x265 as they're one of the Multicoreware partners (and have been for a very long time).

Last edited by FranceBB; 12th June 2025 at 22:39.
FranceBB is offline   Reply With Quote
Old 13th June 2025, 09:34   #14  |  Link
excellentswordfight
Lost my old account :(
 
Join Date: Jul 2017
Posts: 374
Quote:
Originally Posted by GeoffreyA View Post
Thanks for this test, FranceBB. It would be interesting if we could get a fair test between x86 and Apple's CPUs, which, as far as I'm aware, are the best ARM implementation.
Phoronix has a decent test as they have a test with M4 (standard) and AMD Strix point (the best and most efficient x86 models available that are in a somewhat similar power range). My guesstimate is that M4 Pro has about the same x265 performance as HX 370. So I think it looks like Zen5 and M4 in general has about the same performance for x265. But as you can see, Strix Point has 2x the performance/w over its desktop parts, so its very hard to extrapolate this with more performance focused designs (M4 Max in Mac Studio or Server CPUs).



Last edited by excellentswordfight; 13th June 2025 at 09:58.
excellentswordfight is offline   Reply With Quote
Old Yesterday, 05:56   #15  |  Link
Ritsuka
Registered User
 
Join Date: Mar 2007
Posts: 104
x265 got so many Neon optimisations after 3.6, making that benchmark meaningless. The latest x265 master branch should be ~ 50% or more faster on Arm than 3.6.
Ritsuka is offline   Reply With Quote
Old Yesterday, 12:42   #16  |  Link
GeoffreyA
Registered User
 
Join Date: Jun 2024
Location: South Africa
Posts: 365
Quote:
Originally Posted by excellentswordfight View Post
Phoronix has a decent test as they have a test with M4 (standard) and AMD Strix point (the best and most efficient x86 models available that are in a somewhat similar power range). My guesstimate is that M4 Pro has about the same x265 performance as HX 370. So I think it looks like Zen5 and M4 in general has about the same performance for x265. But as you can see, Strix Point has 2x the performance/w over its desktop parts, so its very hard to extrapolate this with more performance focused designs (M4 Max in Mac Studio or Server CPUs).


Thanks, excellentswordfight. I wonder if the x86 branch's having more SIMD implemented is leading to a weaker showing for the M4.
GeoffreyA is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:30.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2025, vBulletin Solutions Inc.