Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > MPEG-4 AVC / H.264

Reply
 
Thread Tools Search this Thread Display Modes
Old 6th August 2010, 16:31   #61  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,901
You're knocking down a straw man. I was responding to:

"It's ridiculous how difficult CUDA is. I was flabbergasted at the complexity of a simple "Hello World" CUDA program."

That is a problem much simpler than mine. I distinguish between the difficulty of parallelizing algorithms and the difficulty of prgramming using the CUDA API. Do not conflate them.

Now show me your parallel data reduction in 3 lines of C code, or admit that you too are blustering.

Last edited by Guest; 6th August 2010 at 16:35.
Guest is offline   Reply With Quote
Old 6th August 2010, 16:53   #62  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by neuron2 View Post
You're knocking down a straw man. I was responding to:

"It's ridiculous how difficult CUDA is. I was flabbergasted at the complexity of a simple "Hello World" CUDA program. That is a problem much simpler than mine."
I would disagree here. CUDA cannot use printf() or similar, so the "Hello world!" of CUDA looks a bit different than what a classical "Hello world!" looks like. Probably the most popular example ("Hello world!") program for CUDA that is shown as the very first example in all the "getting started" guides is the parallel matrix multiplication - already a bit more complex than your color conversion, as it isn't as "local".


Quote:
Originally Posted by neuron2 View Post
I distinguish between the difficulty of parallelizing algorithms and the difficulty of prgramming using the CUDA API. Do not conflate them.
Well, obviously you do not count all the "memory access pattern" mess (different for global and shared memory!) as a part of the CUDA API. Well, I agree that this is not really part of the API itself, but it's a very important aspect of the hardware. And if you ignore that hardware aspect, your program will never run fast on CUDA. So understanding the API to get "something" running on CUDA is one thing, understanding all the hardware aspects that have to be taken into account to get your program fast (and that is all reason why we use CUDA) is a completely different thing. If we look at the CUDA programmers guide, there is about ~10% API documentation and about ~90% hardware aspects ("performance guide"). Compared to that, programming for the CPU requires MUCH less knowledge of the hardware (up to some point, of course).


Quote:
Originally Posted by neuron2 View Post
Now show me your parallel data reduction in 3 lines of C code, or admit that you too are blustering.
I didn't say the C code will be parallel too

For C code thread-parallelization is some kind of "extra boost", that in the optimal case will give you 2x or 4x speed-up, but still is optional. For CUDA thread-parallelization is essential. Using only a single thread on CUDA would be like making the code 1000x slower. Anyway, with C + OpenMP I would simply put a simple #pragma in front of my "3 lines" loop and the parallelization is there...
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 6th August 2010 at 16:59.
LoRd_MuldeR is offline   Reply With Quote
Old 6th August 2010, 17:18   #63  |  Link
Guest
Guest
 
Join Date: Jan 2002
Posts: 21,901
Quote:
Originally Posted by LoRd_MuldeR View Post
I didn't say the C code will be parallel too
That is my point, which you fail to see because you appear more interested in having a debate with me. I'm not interested in that so I'll bow out.

Last edited by Guest; 6th August 2010 at 19:39.
Guest is offline   Reply With Quote
Old 17th August 2010, 14:55   #64  |  Link
hust_xcl
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by TheImperial2004 View Post
Totally agree . But !

I believe that the major issue here is to synth. the data between two different entities . What if the GPU is just too fast for the CPU to keep up with ? Of course we will need the CPU to do some calculations . If the CPU is 9x slower than the GPU , then whats the point ? In that case , the GPU will have to wait for the CPU to respond and complete its part of the job , *only* then the GPU will continue doing its part . Lagging is the major issue here . Feel free to correct me though
I agree too. Frequently transferring data between cpu and gpu will cost too much time.
Instead, data should be transferred by large pieces and with low frequency.
hust_xcl is offline   Reply With Quote
Old 17th August 2010, 15:27   #65  |  Link
hust_xcl
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by royia View Post
Crossing my fingers for you.
Just for knowledge, are you aiming for Open CL or CUDA?

I wish I could help :-).
iRoyia, I have to suspend the work because the emulation mode does not support shared memory. while my old computer can only run in that mode. After I buy a new nv card, I will restart.
IMHO,
1.Interpolation can be optimized by cuda.
2. Full search is more suitable than diamond search to be used to offload me on gpu.
My algorithm is:
(1) Transfer an original frame and a reference frame to gpu
(2) For 7 search modes (16x16, 16x8, 8x16…), each one employs a full search of 8x8 search range around original point(0,0) on gpu.
(3) transfer all the mvs back to cpu
(4) in the analysis process, cpu makes use of the mvs ( predicated mv should be near 0) calculated by gpu.
If the mv is on the border of 8x8 search range, a refined search should be employed to enhance the search result.
hust_xcl is offline   Reply With Quote
Old 17th August 2010, 15:32   #66  |  Link
hust_xcl
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by Dark Shikari View Post
That doesn't magically mean that it can be made faster on a GPU.
Dark Shikari:
Is there anyone working for it ? I hope i can give some help.
hust_xcl is offline   Reply With Quote
Old 18th August 2010, 04:39   #67  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Full search is more suitable than diamond search to be used to offload me on gpu.
I failed to see this ... How could a full search be faster than a diamond one ? Even for a GPU ... ? Unless you meant something else by "more suitable" ...
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 18th August 2010, 04:48   #68  |  Link
ajp_anton
Registered User
 
ajp_anton's Avatar
 
Join Date: Aug 2006
Location: Stockholm/Helsinki
Posts: 805
Offloading a full search makes more sense than offloading a diamond search.
ajp_anton is offline   Reply With Quote
Old 18th August 2010, 04:58   #69  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Offloading a full search makes more sense than offloading a diamond search.
So -in theory- a full search won't give a big performance hit compared to a diamond one (For a GPU) ? Or did you mean that a full search is easier to offload than other searching algorithms ?
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 18th August 2010, 17:48   #70  |  Link
Sharktooth
Mr. Sandman
 
Sharktooth's Avatar
 
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
he means diamond search is already pretty fast and there's is little or no need to offload it to the GPU... while full search is very time consuming and offloading it to the GPU would give a much, much bigger advantage.
Sharktooth is offline   Reply With Quote
Old 18th August 2010, 18:51   #71  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
he means diamond search is already pretty fast and there's is little or no need to offload it to the GPU... while full search is very time consuming and offloading it to the GPU would give a much, much bigger advantage.
hmm . That makes sense , I've never used diamond before -Always UMH- and it is fast enough on my Q8200 . Would like to see what full search can do for quality
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 18th August 2010, 18:53   #72  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by TheImperial2004 View Post
hmm . That makes sense , I've never used diamond before -Always UMH- and it is fast enough on my Q8200 . Would like to see what full search can do for quality
Look at the built-in x264 presets. Nothing uses ESA or even TESA, except for the "placebo" preset. That should tell you what to expect from a "full" search
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊
LoRd_MuldeR is offline   Reply With Quote
Old 18th August 2010, 21:01   #73  |  Link
TheImperial2004
C# Addict
 
TheImperial2004's Avatar
 
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
Quote:
Look at the built-in x264 presets. Nothing uses ESA or even TESA, except for the "placebo" preset. That should tell you what to expect from a "full" search
Seems to be "less useful" for everyday encodes , even for archiving .
__________________
AviDemux Windows Builds
TheImperial2004 is offline   Reply With Quote
Old 19th August 2010, 03:51   #74  |  Link
Sharktooth
Mr. Sandman
 
Sharktooth's Avatar
 
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
ESA/TESA gain in respect to UMH is not significant and both are very slow. however, offloading the workload to the GPU would mean you get that marginal quality gain for free and you would also give some speed since no ME is done by the CPU.
Sharktooth is offline   Reply With Quote
Old 19th August 2010, 03:58   #75  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by Sharktooth View Post
ESA/TESA gain in respect to UMH is not significant and both are very slow. however, offloading the workload to the GPU would mean you get that marginal quality gain for free and you would also give some speed since no ME is done by the CPU.
Only if the GPU is fast enough to keep up with the CPU.
Dark Shikari is offline   Reply With Quote
Old 19th August 2010, 04:02   #76  |  Link
Sharktooth
Mr. Sandman
 
Sharktooth's Avatar
 
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
sure, otherwise it will slow it down...
Sharktooth is offline   Reply With Quote
Old 19th August 2010, 07:40   #77  |  Link
hust_xcl
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by Sharktooth View Post
he means diamond search is already pretty fast and there's is little or no need to offload it to the GPU... while full search is very time consuming and offloading it to the GPU would give a much, much bigger advantage.
Consider parallel me for all blocks on gpu, diamond search may cost more time than full search. Eg. 10 blocks, one of them search 80 points; even other only search 5 points, gpu should wait all the blocks to finish each me. In that case, diamond search will cost more time than 8x8 full search
hust_xcl is offline   Reply With Quote
Old 19th August 2010, 09:02   #78  |  Link
schweinsz
Registered User
 
Join Date: Nov 2005
Posts: 497
Quote:
Originally Posted by hust_xcl View Post
(4) in the analysis process, cpu makes use of the mvs ( predicated mv should be near 0) calculated by gpu.
If the mv is on the border of 8x8 search range, a refined search should be employed to enhance the search result.
I believe that the GPU for H.264 encoding matters only if the GPU can do interpolation, integer-pel and sub-pel ME, mode decision, transform and quantization.
If you only offload the integer-pel ME and interpolation, it is less significant.
schweinsz is offline   Reply With Quote
Old 19th August 2010, 09:45   #79  |  Link
Dark Shikari
x264 developer
 
Dark Shikari's Avatar
 
Join Date: Sep 2005
Posts: 8,666
Quote:
Originally Posted by hust_xcl View Post
Consider parallel me for all blocks on gpu, diamond search may cost more time than full search. Eg. 10 blocks, one of them search 80 points; even other only search 5 points, gpu should wait all the blocks to finish each me. In that case, diamond search will cost more time than 8x8 full search
That's not the reason that diamond is the problem. You can simply limit the number of iterations to avoid that.

The reason diamond is problematic is that to get even remotely decent performance, you have to have coalesced loads for the GPU threads.
Dark Shikari is offline   Reply With Quote
Old 19th August 2010, 11:27   #80  |  Link
hust_xcl
Registered User
 
Join Date: Jul 2010
Posts: 11
Quote:
Originally Posted by Dark Shikari View Post
That's not the reason that diamond is the problem. You can simply limit the number of iterations to avoid that.

The reason diamond is problematic is that to get even remotely decent performance, you have to have coalesced loads for the GPU threads.
you are right!
hust_xcl is offline   Reply With Quote
Reply

Tags
encoder, gpu, h.264

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:53.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.