Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 4th August 2010, 22:04   #41  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
The amount of uninformed blustering here is breathtaking.

Have you guys ever written a GPU program?
Guest is offline   Reply With Quote
Old 4th August 2010, 22:20   #42  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
The amount of uninformed blustering here is breathtaking.

Have you guys ever written a GPU program?
Nope . For myself , I'm talking entirly in theories . Can be right or wrong . I'm just woundering how can we magically speed things while serving the same quality of x264 .

Dark Shikari had already state that we need to build the whole encoder from scratch . So I think it'd be best if we wait for H.265 and build x265 from the ground up to harness the GPU .

Corrections are welcome as always
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 22:26   #43  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 590
Quote:
Originally Posted by TheImperial2004 View Post
That seems a good idea . But !

" just store the computations performed on GPU somewhere "

I don't think there will be other place to store them other than HDD . And we all know what that might mean , Yes , Lag . For storing -let's say- 512MB segment every 10-30 seconds , I believe that the HDD will be the bottleneck here .

Your idea is great but I can't see it will improve encoding speed "magically" in the near future , especially when the HDD is involved .
4 GB of RAM is nowadays common, 8 GB of RAM is not that unusual and almost all motherboards support it.

SSD drives are also getting more and more common and cheap, a 40-80 GB SSD drive is now 100-150$ and can sustain 100-150MB/s writes easily. Dumping 512 MB of data to RAM and then in less than 10 seconds to a hard drive should be doable (it's doable in theory even on a regular drive, mine do 60-70MB easily, but probably won't do if you read data from it at the same time).
And, of course, this is without RAID.

But remember, I was talking about uploading 512 MB of data to video card and then dump the results of the processing... that doesn't necessarily mean it will be 512 MB of results, it could easily be just 40-50 MB of data.
Of course, if it takes less time to process the data than uploading and downloading it from the card it's not worth it.


neuron2: I never claimed to be an expert, I'm not, I'm barely able to code websites and do occasional conversions...

I'm just writing my thoughts so other can explain why it won't work or it wouldn't be feasible and I'll learn something out of it, wouldn't I?

After all, it's a forum here and that's the definition of a forum, a place where people can discuss things.
mariush is offline   Reply With Quote
Old 4th August 2010, 22:35   #44  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
But remember, I was talking about uploading 512 MB of data to video card and then dump the results of the processing... that doesn't necessarily mean it will be 512 MB of results, it could easily be just 40-50 MB of data.
Of course, if it takes less time to process the data than uploading and downloading it from the card it's not worth it.
Now I see . The issue here is synth. : Coding a thread to do this and then wait for it to finish and return a value to feed other threads is a nightmare -at least for me- . I tried to code some threaded apps in C# and I pulled my hair easily . Thats about a simple app . How about an encoder with thousands of lines ? It would be a no-go , unless we made a new encoder from scratch .

Quote:
I'm just writing my thoughts so other can explain why it won't work or it wouldn't be feasible and I'll learn something out of it, wouldn't I?

After all, it's a forum here and that's the definition of a forum, a place where people can discuss things.
Sure , we are all here to learn . "Never too old to learn" . Feel free to express your thoughts and correct others .

Corrections are welcome
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 22:38   #45  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,032
Quote:
Originally Posted by TheImperial2004 View Post
Nope . For myself , I'm talking entirly in theories . Can be right or wrong . I'm just woundering how can we magically speed things while serving the same quality of x264.
You never get any speed-up for free. And certainly GPU's won't "magically" make your program faster!

Porting software to CUDA/OpenCL isn't simple at all. Getting a non-trivial software running on the GPU will be though task. Not to mention all the work that has to be done to optimize it for speed.

Also there is absolutely no guarantee that your software will run any faster (more efficient) on the GPU than it does on the CPU. It may or may not work.

If your problem isn't highly parallel, it won't fit on the GPU. But even if your problem is highly parallel in theory, then you still have to come up with a smart parallel algorithm that works on the real hardware.

See also:
http://forum.doom9.org/showpost.php?...&postcount=192

Also the this example shows how complex it is to optimize something as simple as a "parallel reduction" on CUDA:
http://developer.download.nvidia.com.../reduction.pdf


Quote:
Originally Posted by TheImperial2004 View Post
Dark Shikari had already state that we need to build the whole encoder from scratch . So I think it'd be best if we wait for H.265 and build x265 from the ground up to harness the GPU.
Note necessarily the whole encoder, but a significant part.

You can't "move" a single DSP function to the GPU (even if it is a LOT faster there), because the delay for CPU -> GPU -> CPU data transfer would nullify the speed-up.

Instead you must "move" (read: re-implement) complete algorithms on the GPU, so there will be enough "calculations per data transfer" to legitimate the transfer delay.

(Furthermore we don't have any indication that H.265 will be any easier or harder to implement on a GPU)
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 4th August 2010 at 22:46.
LoRd_MuldeR is offline   Reply With Quote
Old 4th August 2010, 22:46   #46  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Hi Mulder , long time no see

Quote:
You never get any speed-up for free. And certainly GPU's won't "magically" make your program faster!
Thats what commercials are advertising : "Stunning speed with crystal clearity" , and I was woundering how did they do this ? Magic ? Of course no , Speedy encoder is speedy because of less calculations --> Bad output .

Quote:
You must move complete (read: re-implement) complete algorithms on the GPU.
Add to that the nightmarish API that they are providing for the developer

Quote:
(Furthermore we don't have any indication that H.265 will be any easier or harder to implement on a GPU)
The problem is the API of the CUDA/Stream , not the Application/Specification as I see things .
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 22:46   #47  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
Quote:
Originally Posted by mariush View Post
I'm just writing my thoughts so other can explain why it won't work or it wouldn't be feasible and I'll learn something out of it, wouldn't I?

After all, it's a forum here and that's the definition of a forum, a place where people can discuss things.
That's fine until people start to say things like this from a position of almost total ignorance:

Quote:
Am I the only one who believes that all "minor" x264 development should be postponed and all efforts should be focused on developing a way to offload the ME "at least" to the GPU?
If people have the right to advertise their ignorance (in the sense of lack of relevant knowledge and not in any derogatory sense), then I suppose I have the right to note it.

Carry on!
Guest is offline   Reply With Quote
Old 4th August 2010, 22:51   #48  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
then I suppose I have the right to note it.
Of course you do . But I wasn't advertising , I was just thinking in a loud voice
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 22:54   #49  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,032
Quote:
Originally Posted by TheImperial2004 View Post
Thats what commercials are advertising : "Stunning speed with crystal clearity" , and I was woundering how did they do this ? Magic ? Of course no , Speedy encoder is speedy because of less calculations --> Bad output.
That is marketing blabber, of course. You shouldn't take it seriously at all. Instead believe only what you see with your own eyes and/or measure on your own hardware.

With all those "big" companies working on GPU-accelerated H.264 encoders and still not one of them can compete with x264 in a proper "quality per speed" comparison, there are only two conclusions:

Either all those companies are completely incompetent -or- GPU's aren't as suitable for video encoding as the GPU vendors try to make us believe. Decide yourself
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 4th August 2010 at 22:56.
LoRd_MuldeR is offline   Reply With Quote
Old 4th August 2010, 23:01   #50  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Either all those companies are completely incompetent -or- GPU's aren't the suitable for video encoding, as the GPU vendors try to make us believe. Decide yourself
I suppose the latter choice is not true to a certain extent .
__________________
AviDemux Windows Builds

Last edited by TheImperial2004; 18th August 2010 at 04:40.
TheImperial2004 is offline   Reply With Quote
Old 4th August 2010, 23:02   #51  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
You are misquoting me. Is it intentional?
Guest is offline   Reply With Quote
Old 4th August 2010, 23:05   #52  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
You are misquoting me. Is it intentional?
No , not at all .

Sorry I'll edit my post .
__________________
AviDemux Windows Builds

Last edited by TheImperial2004; 18th August 2010 at 04:41.
TheImperial2004 is offline   Reply With Quote
Old 5th August 2010, 17:28   #53  |  Link
ForceX
Registered User
 
Join Date: Oct 2006
Posts: 150
Quote:
Originally Posted by TheImperial2004 View Post
I'm just woundering , if we are to offload "everything" to the GPU , how can a 600-700 MHz GPU be faster than a 3.0+ GHz CPU ? Isn't everything we are looking for is clock speeds ?
The clockspeed war has been over since the pentium 4 days. Now it is about adding more cores, performing more instructions per clock and increasing efficiency of the programs to scale better across many cores.

One thing you must understand is that GPU and CPU are completely different architectures. GPUs are only good at some tasks and in those tasks they really excel because they are BUILT that way. On ice, your 60 mph snowmobile is always gonna outperform your 180 mph sports car because snowmobiles are built to run on snow. The MHz in these figures have little relevance, as GPUs use huge pipelines and thousands of (yes, more than a thousand) shader processors to run their task. Trying to compare that with 6 core CPU is foolishness.

Even if you completely ignore architecture differences, merely a few instruction sets for accelerating certain tasks can have huge impact.

The 19 fold increase in encryption performance of the 6 core intel over the older 4 core model is mainly because the AES accelerating instruction set was implemented. So even within the same architecture, clockspeed might not always be the most decisive factor of performance.
ForceX is offline   Reply With Quote
Old 5th August 2010, 19:07   #54  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
One thing you must understand is that GPU and CPU are completely different architectures.
I realized this . But isn't that because they are completely different architectures would make it harder for a coder to make them work in a harmoney ? After all , we are talking about what a coder may run into when he decides to port a CPU-only code into CPU-GPU one .

Also , we know -to a certain extent- that CUDA API is so difficult to code for , let alone porting existing code into it . thats what we heard from experts . How about Open-CL ? anyone experminted with it ?
__________________
AviDemux Windows Builds

Last edited by TheImperial2004; 5th August 2010 at 19:09.
TheImperial2004 is offline   Reply With Quote
Old 6th August 2010, 02:16   #55  |  Link
aegisofrime
Registered User
 
Join Date: Apr 2009
Posts: 455
Quote:
Originally Posted by TheImperial2004 View Post
I realized this . But isn't that because they are completely different architectures would make it harder for a coder to make them work in a harmoney ? After all , we are talking about what a coder may run into when he decides to port a CPU-only code into CPU-GPU one .

Also , we know -to a certain extent- that CUDA API is so difficult to code for , let alone porting existing code into it . thats what we heard from experts . How about Open-CL ? anyone experminted with it ?
It's ridiculous how difficult CUDA is. I was flabbergasted at the complexity of a simple "Hello World" CUDA program.
aegisofrime is offline   Reply With Quote
Old 6th August 2010, 02:57   #56  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
It's not difficult. I wrote an NV12 to RGB24 conversion (with configurable 601/709 coefficients) plus host transfer in two days. And that was starting with very little knowledge of CUDA. The code is so simple that I'm embarrassed that it took me that long (although a lot of that time was working out the correct YUV->RGB equations and optimizing the implementation).

So speak for yourself!

Last edited by Guest; 6th August 2010 at 05:47.
Guest is offline   Reply With Quote
Old 6th August 2010, 04:59   #57  |  Link
Maccara
Registered User
 
Join Date: Dec 2001
Posts: 145
Quote:
Originally Posted by TheImperial2004 View Post
Also , we know -to a certain extent- that CUDA API is so difficult to code for , let alone porting existing code into it . thats what we heard from experts . How about Open-CL ? anyone experminted with it ?
Who are these experts claiming API makes it difficult? Overcoming API peculiarities is the trivial part (if someone is having difficulties with API already I'm sorry to say, but that's no expert).

It's the algorithms that can be hard to make efficient (not porting per se). "Anything" can be made to run on GPU, but it's a different story if it reaps any benefits (even if it runs faster on GPU).
Maccara is offline   Reply With Quote
Old 6th August 2010, 07:25   #58  |  Link
ForceX
Registered User
 
Join Date: Oct 2006
Posts: 150
Quote:
Originally Posted by TheImperial2004 View Post
I realized this . But isn't that because they are completely different architectures would make it harder for a coder to make them work in a harmoney ? After all , we are talking about what a coder may run into when he decides to port a CPU-only code into CPU-GPU one .
Harmony is not the issue here, GPUs and graphics APIs have been made from the get go to make them work in harmony with the CPU. When you run a game, it offloads the whole graphics calculations on the GPU and runs the AI on the CPU and it is in perfect harmony as long as they are on similar tier of power. The problem is that GPUs were never made very "programmable", while CPUs are hugely so. The new APIs like CUDA and OpenCL adds that layer of programmability. The problem is doing the coding so it can run "efficiently" on the GPU. GPUs can crunch through highly parallel workload like they're nothing but when it comes to doing a lot of different operations on same data then they are crippled.

The fact is that x264 is already highly optimized for CPUs and that is because programmers have been optimizing the compilers and routines for CPUs for decades, whereas GPGPU computing is very new and frankly there are very few people who have any sort of expertise on how to optimize the code for GPU processing, and the lack of documentations and established experiments doesn't help.

The SIMD extensions in the current CPUs are tailored to accelerate media processing, while GPUs don't offer such specific optimization capabilities-yet. Using x264 on CPU is like putting on a nicely fitted dress on a not too gorgeous girl, but the dress makes her look like a princess. Putting x264 on GPU right now is like putting an overly large dress on a prettier girl, although this girl is more beautiful, she is still going to look laughable.

CUDA is not so complex, per say. It involves doing a lot of other things before you can get a result. It's like wrapping your hand around your neck and back before you put that candy in your hand in your mouth, when you could just do it straight. But that added procedures are just the facts of GPU computing now.
ForceX is offline   Reply With Quote
Old 6th August 2010, 13:35   #59  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,923
Quote:
Originally Posted by ForceX View Post
CUDA is not so complex, per say. It involves doing a lot of other things before you can get a result. It's like wrapping your hand around your neck and back before you put that candy in your hand in your mouth, when you could just do it straight. But that added procedures are just the facts of GPU computing now.
Nonsense, IMHO. And your silly analogies are laughable.
Guest is offline   Reply With Quote
Old 6th August 2010, 15:36   #60  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,032
Quote:
Originally Posted by neuron2 View Post
It's not difficult. I wrote an NV12 to RGB24 conversion (with configurable 601/709 coefficients) plus host transfer in two days. And that was starting with very little knowledge of CUDA. The code is so simple that I'm embarrassed that it took me that long (although a lot of that time was working out the correct YUV->RGB equations and optimizing the implementation).

So speak for yourself!
That really is an extremely simplistic example. And the problem fits perfectly on CUDA, as the RGB color value of a pixel only depends on the YUV color value of the same pixel, it does not depend on any other pixels. That's as "local" and "parallalizable" as it can be. I guess you didn't even use the Shared memory for that. There aren't much problems like that in the real world, unfortunately.

As soon as you do something slightly more complex and you try to do it in a way that runs "fast" on CUDA, things get much more ugly. Especially if you need to store intermediate data in "shared" memory, but shared memory is too small. Also all the "memory access pattern" things are very complex. You need to take care which threads (of a block) run in the same Warp and which memory addresses (banks) they access.

Again I want to point to this example:
http://developer.download.nvidia.com.../reduction.pdf

(And remember, all they implement is a simple Vector reduction! At the end they have a bunch of code, code that really isn't trivial to understand, while in plain C this would be ~3 lines of code ^^)
__________________
There was of course no way of knowing whether you were being watched at any given moment.
How often, or on what system, the Thought Police plugged in on any individual wire was guesswork.



Last edited by LoRd_MuldeR; 6th August 2010 at 15:42.
LoRd_MuldeR is offline   Reply With Quote
Reply

Tags
encoder, gpu, h.264

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:26.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.