Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 25th May 2015, 20:50   #1  |  Link
TerryMasters
Registered User
 
Join Date: May 2015
Posts: 2
The x264 Dual CPU Conundrum

Hello everyone! I originally posted this in a few other places since there was a five day waiting period to make threads here... it put a very important project on hold, but having not found a solution to this problem I now get to ask those widely considered the experts. The slightest bit of help will be greatly appreciated because I know not everyone has this problem, but I also know I'm not the only one trying to fix it.

The post:

Quote:
I have a dual CPU setup and noticed it's only using one of them while encoding x264 video regardless of whether or not affinity is set or a threads flag is calling for it to use more cores. Is there something unique I have to do in order to get the x264 encoder to recognize the second CPU?

Edit- These guys seem to have had the same problem: http://www.servethehome.com/intel-xe...-review-power/

Edit 2- Some information and findings below.

-----------------------------------------------------

The chips in question are two Xeon v3 series.

The problem occurs with both live streaming software such as XSplit or Open Broadcaster (aka OBS), as well as x264cli. It's my understanding that live streams are one pass encodings and as such can use less cores, but I'd be lying if I said it wasn't strangely coincidental that the less cores it may use just happen to be the exact amount left alone from processor two in tests both on my system and others (servethehome benchmark).

Through asking around, gathering data and trial and error I'm starting to piece together that something might be wrong with the way x264 handles multiple threads.
  • - The system in question houses 48 threads; 24 physical cores across two processors (12 a piece). By x264 standards, a default preset with multithreading enabled should autodetect the correct "threads" count by using (PhysicalCores)x1.5 - which in my case is 36. On my system, x264 autodetects and sets threads=48, despite only utilizing 24, and instead of the 36 it should have set the flag to.
  • - On occasion, x264 will autodetect and set threads=72.
  • - Disabling the only processor x264 is using allows the codec to properly run on the other, previously unused CPU just fine. Setting the affinity in Windows 8.1 Pro 64 to disable what would be just the HT cores on both processors (leaving 24 physical cores available instead of 48 logical) allows encodings to take place spread across both processors albeit with strange results - one CPU only topped out at 25% usage across 12 cores while the other CPU's 12 cores maxed out completely.
  • - The idea has come up that x264 may only utilize up to 24 threads, which would explain why only half my computing power is being used. This is not true, as the servethehome benchmark posted above shows the same problem with two 6 core CPUs - 12 threads are working the encoding while the other 12 remain idle.
  • - Another idea was tossed around that - while encoding does make an incredibly minimal impact on the second, seemingly unused CPU - it may actually be mistaking the entire secondary processor as one additional thread, which would explain the 1-2% usage increase across its cores during encoding.
  • - Yet another question that was thrown around is whether or not the encoder has enough to process - meaning that it's only using 24 threads because it doesn't need any others. This appears to be untrue, as the program being used to live stream reports "High CPU Usage!" and starts dropping frames when choosing a slower preset all the while still only utilizing one CPU - if it were a metaphorical lack of workload, the multithreaded aspects of x264 would have spread to the other unused cores when the preset slowed down, increasing it.
  • - The idea, as Kichigai states, that (servethehomes) "...comment was that the additional cores were under-utilized, which suggests that there could have been a bottleneck elsewhere in the system, such as SSD speed or decoding speed of the content to be transcoded" could be true, sans a few details: 1.) The live streaming application sends the encoding over a network specifically designed to handle a plethora of different encoding styles and bandwidths. This in combination with the local recordings should theoretically rule out a disk problem - and 2.) The decoding speed of the content would be through the data received from the capture card, in this case an Avermedia 1080p 60fps USB 3.0 device, yet the same problem occurs on Decklink cards - namely 4K Extremes - which are PCIE based devices designed to handle much more than 1080/60. Could it still be possible? Of course, but based on this data I don't believe it's likely.

This, along with new information I'm continuously debating is slowly becoming more and more supporting evidence that something in x264's current implementation may need to be fixed or changed, if at all possible, in regards to better supporting these systems. Unless there are other x264 dll's specifically catering to dual socket systems that I've just not been made aware of, this would imply that nearly half of the processing power in these machines is utterly going to waste. If there were someone I could reach out to that still works on and maintains this codec, I would love to share these findings with them in an effort to improve x264 (if need be). I'm currently researching multithreading in parallels - that this might be why these systems have problems - but the fact that "threads" doesn't override it/there isn't an option to un-limit them seems like something that was either overlooked or could be improved. Otherwise, without it, this could become confirmation that x264 cannot properly utilize the second CPU in a dual CPU system. I would truthfully hate to see that happen.


- ServeTheHome's 6-Core Dual Xeon Benchmark Utilization - http://www.servethehome.com/wp-conte...zation-HDQ.png

- My personal 24-Core Dual Xeon Benchmark Utilization - Test 1: http://imgur.com/rXGvwat / Test 2: http://imgur.com/sYy0l70 (the first picture shows the last two threads of the first CPU being used properly, above the second CPUs threads being barely used at all. Only settings: preset=medium)
TerryMasters is offline   Reply With Quote
Old 25th May 2015, 21:28   #2  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Have a look at this thread, I think the problem there is similar to yours.
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 26th May 2015, 18:59   #3  |  Link
TerryMasters
Registered User
 
Join Date: May 2015
Posts: 2
I appreciate the link, really I do, as I mentioned before any help is truly greatly appreciated. I've been going through it and another like it for a while now, and maybe it's hard for me to understand what's going on because his english didn't appear to be that good... but that thread actually made me feel worse. From what I gather he's networking multiple computers together in order to speed up an encoding job. Which is great, but also means we have the ability to network six machines together to help with an encoding but can't properly utilize the second CPU in the same machine.

I have some new information to share as well:
  • - My primary focus is the live streaming software (XSplit and OBS). Though regular encodings will also be done on this machine, my main concern right now is getting the best possible quality/performance with the aforementioned tools; This means that things such as dual encodings or networked encodings aren't really options for me. I am aware of the great number of other things I can do, but my top priority is getting the highest quality video with the lowest possible bitrate through better compression via the CPU.
  • - I'd also like to point out that After Effects, using the h264 codec, seems to have no problem properly utilizing both processors, and all threads.
  • - I don't want to make light of the fact that using 24 threads is impressive - I'm very aware of what one of these processors can achieve and don't want to underplay that in any way. However, it would be wrong to overlook the fact that nearly an entire CPU is going to waste in the shadow of those feats - and compression performance is critical for this particular application, as it highly decides who can and cannot watch the streaming video meaning every little bit helps in the greatest sense of the phrase.
  • - That being said, x264 only utilizing one CPU puts me in a very strange and unfortunate situation:

    Quote:
    If I leave it as is, x264 is essentially only using one processor - 12 cores and 12 threads for a total of 24 "threads". This is opposed to the 24 cores and 24 additional threads it could be using for a total of 48 threads, even with a slight degradation in quality. The reason this is a problem is because going into the BIOS and disabling hyper threading allows x264 to use 24 cores instead of threads, meaning I'm getting better performance out of restarting my server and disabling HT than I would if I let my machine run as intended.

With as much research as I've been doing, there just doesn't seem to be any escaping the fact that a great amount of processing power is being thrown away.
TerryMasters is offline   Reply With Quote
Old 26th May 2015, 22:00   #4  |  Link
foxyshadis
ангел смерти
 
foxyshadis's Avatar
 
Join Date: Nov 2004
Location: Lost
Posts: 9,558
Stop by #x264dev on Freenode IRC, you can directly talk to the devs there.

Dual CPU is a more difficult problem than a single CPU with tons of cores. x264 does support it, but isn't particularly optimized for high-end situations where memory pools are split, whereas x265 includes explicit support for NUMA pools (and AE probably does too). So if you have NUMA pools, you may be getting hosed by memory transfer latency and bandwidth, starving the CPUs of anything to encode; even if not, memory bandwidth is still a strong contender for the bottleneck. With NUMA, typically you would start one x264 per CPU, set its affinity to the cores on just that CPU, and lock it to use only the memory pool for that CPU. Unfortunately, if you're encoding a single stream you can't use two x264s.

So far you're kind of guessing and grasping at straws, and I don't think your analysis is correct that it's ignoring a whole CPU. I'd strongly recommend installing XPerf (basic instructions here) and checking out this post for how to set it up to analyze the recording; with that you should have a much better idea of the real distribution and utilization in the system. A good free low-level analyzer particularly suited to memory latency analysis is AMD CodeXL, which does work on Intel as long as you only use Time-Based sampling. Unfortunately, Intel's VTune costs almost a thousand dollars.

Unfortunately, there are no easy answers we can give you, so everything will be based on tedious tuning. Hopefully with some more data, we might be able to come up with some recommendations.
foxyshadis is offline   Reply With Quote
Reply

Tags
dual, streaming, twitch, x264, xeon

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 15:20.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.