Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
6th August 2010, 16:31 | #61 | Link |
Guest
Join Date: Jan 2002
Posts: 21,901
|
You're knocking down a straw man. I was responding to:
"It's ridiculous how difficult CUDA is. I was flabbergasted at the complexity of a simple "Hello World" CUDA program." That is a problem much simpler than mine. I distinguish between the difficulty of parallelizing algorithms and the difficulty of prgramming using the CUDA API. Do not conflate them. Now show me your parallel data reduction in 3 lines of C code, or admit that you too are blustering. Last edited by Guest; 6th August 2010 at 16:35. |
6th August 2010, 16:53 | #62 | Link | |||
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,251
|
Quote:
Quote:
Quote:
For C code thread-parallelization is some kind of "extra boost", that in the optimal case will give you 2x or 4x speed-up, but still is optional. For CUDA thread-parallelization is essential. Using only a single thread on CUDA would be like making the code 1000x slower. Anyway, with C + OpenMP I would simply put a simple #pragma in front of my "3 lines" loop and the parallelization is there...
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ Last edited by LoRd_MuldeR; 6th August 2010 at 16:59. |
|||
17th August 2010, 14:55 | #64 | Link | |
Registered User
Join Date: Jul 2010
Posts: 11
|
Quote:
Instead, data should be transferred by large pieces and with low frequency. |
|
17th August 2010, 15:27 | #65 | Link | |
Registered User
Join Date: Jul 2010
Posts: 11
|
Quote:
IMHO, 1.Interpolation can be optimized by cuda. 2. Full search is more suitable than diamond search to be used to offload me on gpu. My algorithm is: (1) Transfer an original frame and a reference frame to gpu (2) For 7 search modes (16x16, 16x8, 8x16…), each one employs a full search of 8x8 search range around original point(0,0) on gpu. (3) transfer all the mvs back to cpu (4) in the analysis process, cpu makes use of the mvs ( predicated mv should be near 0) calculated by gpu. If the mv is on the border of 8x8 search range, a refined search should be employed to enhance the search result. |
|
18th August 2010, 04:39 | #67 | Link | |
C# Addict
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
|
Quote:
__________________
AviDemux Windows Builds |
|
18th August 2010, 04:58 | #69 | Link | |
C# Addict
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
|
Quote:
__________________
AviDemux Windows Builds |
|
18th August 2010, 17:48 | #70 | Link |
Mr. Sandman
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
|
he means diamond search is already pretty fast and there's is little or no need to offload it to the GPU... while full search is very time consuming and offloading it to the GPU would give a much, much bigger advantage.
__________________
MPEG-4 ASP Custom Matrices: EQM V1(old), EQM AutoGK Sharpmatrix (aka EQM V2), EQM V3HR (updated 01/10/2004), EQM V3LR, EQM V3ULR (updated 04/02/2005), EQM V3UHR (updated 17/12/2004) and EQM V3EHR (updated 05/10/2004) Info about my ASP matrices. MPEG-4 AVC Custom Matrices: EQM AVC-HR Info about my AVC matrices My x264 builds. Mooo!!! |
18th August 2010, 18:51 | #71 | Link | |
C# Addict
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
|
Quote:
__________________
AviDemux Windows Builds |
|
18th August 2010, 18:53 | #72 | Link |
Software Developer
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,251
|
Look at the built-in x264 presets. Nothing uses ESA or even TESA, except for the "placebo" preset. That should tell you what to expect from a "full" search
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊ |
18th August 2010, 21:01 | #73 | Link | |
C# Addict
Join Date: Oct 2008
Location: Saudi Arabia
Posts: 114
|
Quote:
__________________
AviDemux Windows Builds |
|
19th August 2010, 03:51 | #74 | Link |
Mr. Sandman
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
|
ESA/TESA gain in respect to UMH is not significant and both are very slow. however, offloading the workload to the GPU would mean you get that marginal quality gain for free and you would also give some speed since no ME is done by the CPU.
__________________
MPEG-4 ASP Custom Matrices: EQM V1(old), EQM AutoGK Sharpmatrix (aka EQM V2), EQM V3HR (updated 01/10/2004), EQM V3LR, EQM V3ULR (updated 04/02/2005), EQM V3UHR (updated 17/12/2004) and EQM V3EHR (updated 05/10/2004) Info about my ASP matrices. MPEG-4 AVC Custom Matrices: EQM AVC-HR Info about my AVC matrices My x264 builds. Mooo!!! |
19th August 2010, 04:02 | #76 | Link |
Mr. Sandman
Join Date: Sep 2003
Location: Haddonfield, IL
Posts: 11,768
|
sure, otherwise it will slow it down...
__________________
MPEG-4 ASP Custom Matrices: EQM V1(old), EQM AutoGK Sharpmatrix (aka EQM V2), EQM V3HR (updated 01/10/2004), EQM V3LR, EQM V3ULR (updated 04/02/2005), EQM V3UHR (updated 17/12/2004) and EQM V3EHR (updated 05/10/2004) Info about my ASP matrices. MPEG-4 AVC Custom Matrices: EQM AVC-HR Info about my AVC matrices My x264 builds. Mooo!!! |
19th August 2010, 07:40 | #77 | Link |
Registered User
Join Date: Jul 2010
Posts: 11
|
Consider parallel me for all blocks on gpu, diamond search may cost more time than full search. Eg. 10 blocks, one of them search 80 points; even other only search 5 points, gpu should wait all the blocks to finish each me. In that case, diamond search will cost more time than 8x8 full search
|
19th August 2010, 09:02 | #78 | Link | |
Registered User
Join Date: Nov 2005
Posts: 497
|
Quote:
If you only offload the integer-pel ME and interpolation, it is less significant.
__________________
The Next Generation Internet Video Codec project.[/url]. |
|
19th August 2010, 09:45 | #79 | Link | |
x264 developer
Join Date: Sep 2005
Posts: 8,666
|
Quote:
The reason diamond is problematic is that to get even remotely decent performance, you have to have coalesced loads for the GPU threads. |
|
Tags |
encoder, gpu, h.264 |
Thread Tools | Search this Thread |
Display Modes | |
|
|