Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
![]() |
#1 | Link | |
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,236
|
x264 - x86_64 vs ARM 64 The ultimate encoding battle
Hi there,
up until a few years ago, if someone came to me and asked about encoding with x264 on an ARM CPU I would have looked at him with a weird face as I always thought that ARM CPUs were supposed to be used in mobile devices like in smartphones as their main purpose was to be extremely power efficient and last for a long time even when connected to a battery. In other words, I didn't see their use ever becoming a thing on desktop computers, let alone in a server running in a datacenter. Yet, ARM powered laptops have become a thing, more and more people have been using ARM CPUs as their daily drivers, be it via the Qualcomm CPUs on Windows and Linux or the Apple M CPUs on MacOS. Software got better with more support outside of the mobile space and this of course recently included frameservers like Avisynth and VapourSynth, decoders like libav, encoders like x264 and of course FFMpeg, so I thought: it's time for a comparison. In particular, when it comes to x264, there are manually written intrinsics in assembly for both x86_64 and ARM 64, in fact we have SSE, SSE2, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512 and FMA from the x86 side and NEON from the ARM side. Code:
const x264_cpu_name_t x264_cpu_names[] = { #if ARCH_X86 || ARCH_X86_64 // {"MMX", X264_CPU_MMX}, // we don't support asm on mmx1 cpus anymore #define MMX2 X264_CPU_MMX|X264_CPU_MMX2 {"MMX2", MMX2}, {"MMXEXT", MMX2}, {"SSE", MMX2|X264_CPU_SSE}, #define SSE2 MMX2|X264_CPU_SSE|X264_CPU_SSE2 {"SSE2Slow", SSE2|X264_CPU_SSE2_IS_SLOW}, {"SSE2", SSE2}, {"SSE2Fast", SSE2|X264_CPU_SSE2_IS_FAST}, {"LZCNT", SSE2|X264_CPU_LZCNT}, {"SSE3", SSE2|X264_CPU_SSE3}, {"SSSE3", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3}, {"SSE4.1", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4}, {"SSE4", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4}, {"SSE4.2", SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42}, #define AVX SSE2|X264_CPU_SSE3|X264_CPU_SSSE3|X264_CPU_SSE4|X264_CPU_SSE42|X264_CPU_AVX {"AVX", AVX}, {"XOP", AVX|X264_CPU_XOP}, {"FMA4", AVX|X264_CPU_FMA4}, {"FMA3", AVX|X264_CPU_FMA3}, {"BMI1", AVX|X264_CPU_LZCNT|X264_CPU_BMI1}, {"BMI2", AVX|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2}, #define AVX2 AVX|X264_CPU_FMA3|X264_CPU_LZCNT|X264_CPU_BMI1|X264_CPU_BMI2|X264_CPU_AVX2 {"AVX2", AVX2}, {"AVX512", AVX2|X264_CPU_AVX512}, #undef AVX2 #undef AVX #undef SSE2 #undef MMX2 {"Cache32", X264_CPU_CACHELINE_32}, {"Cache64", X264_CPU_CACHELINE_64}, {"SlowAtom", X264_CPU_SLOW_ATOM}, {"SlowPshufb", X264_CPU_SLOW_PSHUFB}, {"SlowPalignr", X264_CPU_SLOW_PALIGNR}, {"SlowShuffle", X264_CPU_SLOW_SHUFFLE}, {"UnalignedStack", X264_CPU_STACK_MOD4}, #elif ARCH_PPC {"Altivec", X264_CPU_ALTIVEC}, #elif ARCH_ARM {"ARMv6", X264_CPU_ARMV6}, {"NEON", X264_CPU_NEON}, {"FastNeonMRC", X264_CPU_FAST_NEON_MRC}, #elif ARCH_AARCH64 {"ARMv8", X264_CPU_ARMV8}, {"NEON", X264_CPU_NEON}, {"DotProd", X264_CPU_DOTPROD}, {"I8MM", X264_CPU_I8MM}, {"SVE", X264_CPU_SVE}, {"SVE2", X264_CPU_SVE2}, #elif ARCH_MIPS {"MSA", X264_CPU_MSA}, #elif ARCH_LOONGARCH {"LSX", X264_CPU_LSX}, {"LASX", X264_CPU_LASX}, #endif {"", 0}, }; x86_64 c6i.2xlarge 8c/8th 16GB RAM ARM 64 c6g.2xlarge 8c/8th 16GB RAM In other words, we have two Virtual Machines where the x86 one is powered by an Intel Xeon Platinum 8375C (Ice Lake) host, while the ARM 64 one is powered by a Graviton 2 which uses the ARMv8 Neoverse-N1 cores. For the test, Linux was used, in particular Ubuntu 24.04 running FFMpeg 6.1.1 Stable. Each EC2 had a 2TB attached storage to perform the calculations, so that the benchmark essentially consisted in: 1) Spinning up the EC2 2) Transferring a mezzanine file from an S3 bucket to the 2TB attached storage of the EC2 3) Triggering the encode to create the final output files 4) Delivering those files back to S3 5) Shut down the EC2 The power up / power down times have then been taken out of the total job as well as the file transferring times in order to end up only with the actual computation time. A total of 7 sources were used and in all cases the input file was a standard XDCAM-50 file with DolbyE Italian, DolbyE Original, PCM Stereo Italian, PCM Stereo Original. In particular: Video: FULL HD 1920x1080 MPEG-2 High 4:2:2 Profile, Level High 50 Mbit/s yv16 25i TFF BT709 SDR Audio: Track1 DolbyE 5.1 44800Hz 20bit Italian Track2 DolbyE 5.1 44800Hz 20bit Original Track3 PCM 2.0 48000Hz 24bit Italian Track4 PCM 2.0 48000Hz 24bit Original The 44800Hz in the DolbyE tracks refers to the internal sampling rate for that stream at 25fps (1792 samples * 25 frame per seconds = 44800 Hertz) which is always resampled to 48000Hz when played back on an hardware decoder. The encoding job consisted in 6 steps Step 1: Encoding the video FULL HD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR Step 2: Encoding the audio in AAC Track1 AAC 5.1 550 kbit/s 48000Hz Italian Track2 AAC 5.1 550 kbit/s 48000Hz Original Track3 AAC 2.0 384 kbit/s 48000Hz Italian Track4 AAC 2.0 384 kbit/s 48000Hz Original Step 3: Encode the audio in Opus as a proxy Track1 Proxy: Opus Mono 64 kbit/s Italian Track2 Proxy: Opus Mono 64 kbit/s Original Step 4: Encoding the video in H.264 as a proxy with watermark + mux the already encoded audio SD H.264 Profile High Level 4.1 Ref 4 CRF 25 4:2:0 Limited TV Range 8bit planar BT709 SDR Step 5: Muxing the FULL HD video and the 5.1 AAC audio in MP4 Step 6: Extract a low resolution thumbnail from the middle of the video and encode it in JPEG The command line used is reported as follows: Quote:
One last note to keep in mind is that using an ARM CPU is 20% cheaper than using an x86 one, which means that, in theory, if the ARM CPU was as fast as the x86 one, then it would potentially save 20% of the cost. Spoiler alert: this didn't happen. Benchmark results: Movie 1: Title: Nope Duration: 02:05:12:16 c6i.2xlarge x86 Encoding Duration: 2h 24m 6s c6g.2xlarge ARM Encoding Duration: 3h 25m 43s x86 cost: $7.50 ARM cost: $8.55 Result: ARM was 42.77% slower and 14.07% more expensive Movie 2: Title: Novocaine Duration: 01:45:19:02 c6i.2xlarge x86 Encoding Duration: 1h 51m 47s c6g.2xlarge ARM Encoding Duration: 2h 48m 10s x86 cost: $6.30 ARM cost: $7.58 Result: ARM was 50.44% slower - 20.38% more expensive Movie 3: Title: Absolutely anything Duration: 01:22:16:21 c6i.2xlarge x86 Encoding Duration: 1h 30m 19s c6g.2xlarge ARM Encoding Duration: 2h 11m 10s x86 cost: $4.94 ARM cost: $5.74 Result: ARM was 45.23% slower - 16.26% more expensive Movie 4: Title: Catch me if you can Duration: 02:15:11:07 c6i.2xlarge x86 Encoding Duration: 3h 4m 15s c6g.2xlarge ARM Encoding Duration: 4h 15m 25s x86 cost: $8.11 ARM cost: $8.99 Result: ARM was 38.62% slower - 10.89% more expensive Movie 5: Title: Me before you Duration: 01:45:57:00 c6i.2xlarge x86 Encoding Duration: 1h 51m 27s c6g.2xlarge ARM Encoding Duration: 2h 43m 48s x86 cost: $6.69 ARM cost: $7.87 Result: ARM was 47.08% slower - 17.72% more expensive Movie 6: Title: The lucky one Duration: 01:36:54:00 c6i.2xlarge x86 Encoding Duration: 1h 56m 43s c6g.2xlarge ARM Encoding Duration: 2h 46m 20s x86 cost: $7.00 ARM cost: $7.97 Result: ARM was 42.51% slower - 13.98% more expensive Movie 7: Title: Shattered Duration: 01:30:59:24 c6i.2xlarge x86 Encoding Duration: 1h 49m 22s c6g.2xlarge ARM Encoding Duration: 2h 40m 6s x86 cost: $6.56 ARM cost: $7.68 Result: ARM was 46.39% slower - 17.22% more expensive In other words, on average, using an ARM CPU resulted in a 44.72% slowdown compared to the equivalent x86 CPU and, when we factor in the cost, despite it being 20% cheaper to run, the fact that it takes much longer to encode makes it actually 15.78% more expensive to run in real terms. |
|
![]() |
![]() |
![]() |
#2 | Link |
Lost my old account :(
Join Date: Jul 2017
Posts: 374
|
Very cool test, thank you for sharing that. But given that Graviton2 has cores derived from Cortex-A76, which as very far away from cutting edge ARM-performance, although very interesting to see that they have worse performance/dollar, it would be interesting to see from a performance standpoint how Graviton4 instances perform.
Does anyone know if there is any significant difference between x264 and x265 when it comes to ARM and NEON optimization? Ive seen quite a bit for x265, but I dont follow x264 development that much anymore. Last edited by excellentswordfight; 9th June 2025 at 09:26. |
![]() |
![]() |
![]() |
#6 | Link |
Registered User
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
|
No, but isn't the default preset of x264 'medium'? I do not know really what that script did the whole time for a couple of hours but sure enough it was not H.264 encoding. Also the numbers are all over the place and do not check out.
__________________
My github... |
![]() |
![]() |
![]() |
#7 | Link | |
Registered User
Join Date: Aug 2024
Posts: 576
|
Quote:
Yes, maybe it still don't add up perfectly, but makes some sense I guess. |
|
![]() |
![]() |
![]() |
#8 | Link | |
Registered User
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
|
Quote:
An 8 core Graviton 2 instance should clock at around 2.5Ghz and should be well faster than my 16Gb Raspberry PI at x264 if the majority of time is spend x264 encoding. But according to FranceBB numbers it is hardly.
__________________
My github... |
|
![]() |
![]() |
![]() |
#9 | Link | |
Registered User
Join Date: Aug 2024
Posts: 576
|
Quote:
But yeah, I'm not to disagree that this test includes too much noise, more than the title suggests - a x264 encoding battle. What's the x264 medium speed of your Raspberry PI? |
|
![]() |
![]() |
![]() |
#10 | Link | |
Registered User
Join Date: Dec 2013
Location: Berlin, Germany
Posts: 463
|
Quote:
For some Hollywood Movie intro pan over some detailed desert + person with some light grain: Code:
x264 --input-res 1920x1080 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1920x1080_8b.yuv --> encoded 120 frames, 21.73 fps, 2398.65 kb/s Code:
x264 --input-res 1024x576 --input-depth 8 --fps 24 --crf 24 -o trash.264 snip_1024x576_8b.yuv --> encoded 120 frames, 69.77 fps, 862.13 kb/s x264 is: Code:
x264 0.164.3095 baee400 built on Apr 12 2023, gcc: 12.2.0
__________________
My github... Last edited by rwill; 10th June 2025 at 15:09. |
|
![]() |
![]() |
![]() |
#12 | Link |
Registered User
Join Date: Apr 2006
Posts: 167
|
Is the "cost" chosen arbitrarily by the provider of the virtual machine, or does it reflect the price of the computer plus electricity costs? Do you actually get charged by time used, $8 for three hours or so? That would get unsustainable real quick.
|
![]() |
![]() |
![]() |
#13 | Link | |
Broadcast Encoder
Join Date: Nov 2013
Location: Royal Borough of Kensington & Chelsea, UK
Posts: 3,236
|
Quote:
Yep... But the "benefit" of the cloud running open source software is the scalability. This is the current situation in Prod (it's 10PM on a Thursday) for Avisynth for instance: ![]() Basically, when you deploy something, a "golden" AMI is created. You can think about the AMI as a .ovf file containing the virtual machine and its configuration. When those are deployed you can have an elastic farm so that you start with 1 instance which is shut down. In the case of FFAStrans (FFMpeg Avisynth Transcoder), for instance, when a job comes, this EC2 spins up, the file is transferred from the S3 bucket to the 2TB attached storage, the rest_service starts, a POST is triggered so that it imports the workflow, then another POST is made to trigger the workflow and a series of GET are made afterwards to get the status. Once the job is over and the file is encoded, it gets transferred from the 2TB attached storage to the S3 bucket and the EC2 is shut down. Clearly, if more jobs come and more EC2 are required, those will be created automatically dynamically and - as you can see from the screenshot - I currently have 87 EC2 created of which 7 are running and executing jobs in this very moment and up to 620 can be created (I can eventually increase this limit). Each EC2 has a "grace period" of 1 week so that if it's not used and it stays shut down for more than 1 week, it gets automatically deleted and it will eventually be recreated if needed. Obviously this is all fun and games 'cause we're only talking about machines with CPU and RAM, but if we were to include dedicated resources like a GPU then you could end up in a situation in which your machine won't be created for a long time 'cause there may not be any availability in the region so you have to wait until some other AWS customer finishes using it. With CPU only machines, however, this never happens. The advantage of using open source software, in general, compared to closed source software you buy a license for is that for closed source software you can only deploy as many instances as the licenses you bought. Like, suppose you bought 10 licenses from company A, then you can deploy 10 provisioned instances which will always be there, they will power up and down, you don't have to wait for them to be created, but you won't be able to scale to, let's say, the 11th one, 'cause you don't have the license for it. Some providers offer you the option to use their cloud offering instead, so instead of buying the license, you never buy it and you use those as a sort of "pay-as-you-go", which allows you to scale, however every time you trigger a job you pay a bit more than you would have paid 'cause they're actually also charging the license. This is more general, though, but as far as encoding is concerned, unless you need something incredibly specific like, I don't know, the Dolby Media Encoder to encode DAMF in AC4 etc or things like FAB to mux .stl subtitles in an .mxf container as a 436m track to carry Teletext Subtitles or some other peculiar use case not covered by open source software, you're better off with open source, which is why I'm proud not just to be using Avisynth but also of the fact that the company I work for is directly contributing to x265 as they're one of the Multicoreware partners (and have been for a very long time). Last edited by FranceBB; 12th June 2025 at 22:39. |
|
![]() |
![]() |
![]() |
#14 | Link | |
Lost my old account :(
Join Date: Jul 2017
Posts: 374
|
Quote:
Last edited by excellentswordfight; 13th June 2025 at 09:58. |
|
![]() |
![]() |
![]() |
#16 | Link | |
Registered User
Join Date: Jun 2024
Location: South Africa
Posts: 365
|
Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|