Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
|
|
Thread Tools | Search this Thread | Display Modes |
|
4th September 2011, 22:26 | #1 | Link |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
Intel QuickSync Decoder - HW accelerated FFDShow decoder with video processing
Updated June 22nd 2013
Hi, My name is Eric Gur and I've taken upon myself a side project at my Intel position to make the Intel SandyBridge (or newer) hardware accelerated video decoding technology freely accessible to everyone. The project name is Intel QuickSync Decoder. To do so, I decided to embed the Intel QuickSync technology introduced in SandyBridge into the widely popular FFDShow video decoder filter. Nowadays, the Intel QuickSync Decoder is officially integrated in FFDShow, LAV Video Decoder and PotPlayer. Main features * HW decode using Intel's high performance QuickSync engine. * Decodes H264, MPEG2, VC-1, WMV9. DVD playback not supported. * HW deinterlacing -auto or forced, with half or full (50/60p) output rate * HW denoise and detail filters * Soft 3:2 pulldown on marked streams. * Support variable frame rate streams. * Support headless iGPU (Intel GPU disconnected from display) on Windows 8 and newer. If your system meets the requirements, I'd appreciate stability feedback with assorted quality and sources of video content. To report a bug report or feature request, please post in this thread. If something is broken, please provide me with a detailed report including (after reading the known issues section below) : 1. Hardware (CPU, GPUs) 2. Software (OS, driver version, player, splitter, etc.) 3. Access to the offending content. Share via your favorite file share sites. Limit content to <100MB. Requirements: 1. SandyBridge (2nd Generation Core i3/i5/i7/celeron/pentium) or newer. Older platforms will not work and no plans to support them. 2. Latest Intel graphic drivers. Intel GPU must either be the primary GPU, extended display or use Lucid Virtu. 3. Windows 7 (32/64) or newer OS. Should work in Vista but I can't test this. Known Issues: * Jumpy playback or heavy corruption on many clips are the result of drivers obtained from Windows Update. Download drivers from your OEM website or directly from Intel's download center. Some versions of Lucid Virtu will cause video playback in 64 bit player to display frames out of order. * Frame rate is wrong or incorrect aspect ratio: Haali Media Splitter is sending corrupt time stamps or aspect ratio. LAV splitter is recommended. * After a seek in a TS file, a corruption is seen for a few frames. LAV splitter known issue. * Resolutions greater than 1080p aren't supported in SandyBridge. Installation: 1. An ffdshow installer is supplied. 2. Open FFDShow configuration dialog and select 'Intel Quicksync' from the codec page for the desired formats (H264/VC1/MPEG2). Version 0.45 is out with the following changes: * Bugfix - frames were sometime treated as interlaced. * Bugfix - time stamps are passed 'as is' when TS manipulation is off. * Bugfix - time stamps handling was causing A/V delay. * Changed: AnnexB type packets (AVC in TS files) is not pre-processed and sent to the HW decoder directly. May break a broken clip or two but save many others. * Sync with MSDK 2014 files. * FFDShow: r4531 Downloads * For the latest cutting edge FFDShow builds download my builds Intel QuickSync Decoder SourceForge home page * FFDShow-tryout site * LAV Splitter builds
__________________
Eric Gur, Processor Application Engineer for Overclocking and CPU technologies Intel QuickSync Decoder author Intel Corp. Last edited by egur; 28th June 2014 at 14:14. |
5th September 2011, 00:45 | #3 | Link |
Registered User
Join Date: Jan 2010
Posts: 75
|
egur, this is very good to know, i have some questions:
1) Does SB have specific problems with DXVA interfaces what it needs specific quicksync support?, it's known to crash MPC-HC and ffdshow DXVA as well 2) What about the Pentium Gxxx series?, since they don't have quicksync... |
5th September 2011, 07:38 | #4 | Link | |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
Quote:
2) Regarding the Pentium brand, I don't know. If someone has it, please let me know. |
|
5th September 2011, 09:00 | #6 | Link | |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
Quote:
SandyBridge has 2-4 cores an integrated GPU, integrated memory controller and integrated PCIe controller. Pentium 4 doesn't have the HW needed and will definitely not work. My build of FFDshow might work on Core 2 Duo/Quad and i3/i5/i7 if and only if there's an Intel integrated GPU (can be found in many laptops and low end desktops). This wasn't tested though. It will not work on AMD processors either as they do not have compatible HW. My build should work on future processors with Intel graphics such as IvyBridge and Haswell. |
|
5th September 2011, 10:28 | #7 | Link | |
Registered User
Join Date: Sep 2009
Location: Sydney, Australia
Posts: 1,073
|
Quote:
Tested on Windows 7 (64-bit) in MPC-HC (32-bit). Last edited by namaiki; 5th September 2011 at 10:42. |
|
5th September 2011, 11:59 | #8 | Link |
Registered User
Join Date: Apr 2002
Location: Germany
Posts: 4,926
|
Nice work Eric though i guess it wont do any better then Intels own Decoder sample in IMSDK 3 ?
@ least for Mpeg-2 it seems questionable if the hassle with different setups is worth it from my meassuring it saves somewhere 1W on my Core I5-2400 compared to ffmpegs decoder, though you will have all the hassle with Mpeg-2 Studio 4:2:2 switching as the Intel Decoder is same as Nvidias also not capable of doing this with DXVA Of course it looks totally different for H.264 (there is the biggest save compared to the Worlds most Performant Software Decoders, but again if we come to the 10 Bit 4:2:2 and 4:4:4 or Lossless area everything fals apart again) but also VC-1 im not sure at least WMV3 seems not to perform much better on Quicksync then again Libavcodecs decoder on the CPU http://forum.doom9.org/showthread.ph...85#post1523685 a follow up on that terminating overhead further http://forum.doom9.org/showthread.ph...92#post1523692 though it's cool that you (Intel) now also want to optimize based on samples like Nvidia did in the early days first thing you should look @ this sample http://forum.doom9.org/showthread.ph...93#post1523293 i tried alot but i don't get it stable with EVR and Intels Decoder (it doesn't matter which splitter the tree pan doesn't get smooth hardware decoded also with Microsofts DTV-Decoder no go, the only solution for this sample is the Lav based Framework on EVR it gets perfectly smooth then perfectly telecined) and then there is my issue with my sample.ts (also telecined though H.264) on EVR custom but im not so sure if this is a Intel fault though Software decoding again works fine but Hardware fails with EVR Custom see a Video of this issue http://mirror05.x264.nl/CruNcher/mpc-hc/ (Btw made with Quicksync ) <- Fixed with FFdshow for Quicksync Intel Driver is = 8.15.10.2476 (Windows 7 64 bit) Im trying your decoder now with all this PS: You should mention that it's 32 Bit in your post Superb news my sample.ts (H.264) (EVR Custom) issue is history with this, perfect telecined 23.976 Perfect awesome it doesn't allow Mpeg-2 Studio Profile connection and so fallbacks like it should be This is the most awesome Decoder for Quicksync currently (except overhead being not DXVA2 Native is huuuge depending on stream see here after bugs http://forum.doom9.org/showthread.ph...06#post1523906) Though the correct telecine to 24.30 (evil_tree Mpeg-2 1080i 29.970 sample) is problematic also with it on EVR it does 0.30 fps to much it seems (interlace flags off) Really tricky this is what it should look like in the end (works only on EVR normal) else you wont get the tree pan smooth Default Telecine works perfect even on EVR Custom It also likes to crash with several *.ts files in combination with Lav Splitter (those crashy ones work fine with the Internal MPC-HC ts splitter) http://forum.doom9.org/showthread.php?t=156191 Yep it crashes a lot with Lav Splitter No Vsync no Exclusive mode nothing just Aero and Quicksync (again you can nicely see the jitter the Stats and Graph Rendering causes current EVR Custom OSD overhead)
__________________
all my compares are riddles so please try to decipher them yourselves :) It is about Time Join the Revolution NOW before it is to Late ! http://forum.doom9.org/showthread.php?t=168004 Last edited by CruNcher; 5th September 2011 at 18:48. |
5th September 2011, 17:11 | #9 | Link | |
Registered User
Join Date: Apr 2002
Location: Germany
Posts: 4,926
|
Major issues with VC-1 in *.ts either Sync problems or Rendering issues (different VC-1 Interlace encoding mixed modes)
Sync Issues: Rendering Issues: (This Problem Nvidia fixed ages ago ) It also crashes for both with Lav Splitter had to switch to MPC-HC Internal Splitter Incorrect Telecine again Lav-Splitter->Lav-Audio->FFdshow quicksync : (Incorrect) MPC-HC Internal->Lav-Audio->FFdshow quicksync : (Correct) Though i slowly wonder if this is DXVA2 hardware Playback also because MPC-HC doesn't show any DXVA2 information (or more something like Nvdias NVcuvid API own Intel API but even for that it would be heavy overhead, just for Playback purpose ??) as i get much much lower CPU utilization with Microsofts DTV-Decoder (DXVA2) on H.264 streams ???? (lets see 4 girls is coming ) Yeah really heavy that overhead on this small HD2000 compared to Microsofts DXVA2 ffdshow-quicksync overhead: Native DXVA2 is still the way to go (imho we just need a better optimized playback framework for Quicksync and not only for it ) Though will be really interesting to compare vs Nvcuvid overhead Quote:
__________________
all my compares are riddles so please try to decipher them yourselves :) It is about Time Join the Revolution NOW before it is to Late ! http://forum.doom9.org/showthread.php?t=168004 Last edited by CruNcher; 5th September 2011 at 19:11. |
|
5th September 2011, 21:21 | #10 | Link |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
CruNcher:
First, thanks a lot for your analysis. That's the best way to get my little SW running properly... I'd like to explain what I did in FFDShow. I used the Intel Media SDK v3 beta 3 Direct Show filters sample code. Stripped most of it, fixed several bugs, cleaned it up, some refactoring, put some inline documentation and created a DLL that exports an interface. My code doesn't use any secret APIs or secret driver GUIDs and doesn't contain any algorithms. It's quite simple and not very big. Intel's Media SDK uses DXVA1/2 to communicate with the driver/HW (that's what I've heard anyway). What it does is somewhat abstract the horrible DXVA API making this task easier (but not easy!) and use less code. The (relatively) high CPU usage is caused by one thing - memory copying from the GPU to system memory. I'll try to reduce this by trying to do VPP (DXVA/MSDK video post processing) to a system memory buffer. Hopefully the driver will do the copy faster than memcpy(). My idea with FFDShow is to have a 1 stop decoder that's low on power and high on quality. I want to abstract the HW acceleration and hopefully don't lose too much because of the above frame copying. I used a profiler to check where the CPU spends its time and most of the time is copying the frame to system memory. A large chunk (25-50%) goes into the renderer's code somewhere. No clue as to why. Just using DXVA to decode isn't trivial as different splitters behave differently and give different data and maybe the HW decoders aren't following the various specs to the letter. Microsoft's documentation isn't clear enough on how to write things properly. Theoretically they could have created a DXVA decoder themselves, but they didn't. Same goes to Intel/AMD/Nvidia. My own CPU usage analysis shows that on low/medium bitrates, libavcodec uses less CPU than my implementation, but when bitrates are high (I have only one 26Mbps clip) the CPU usage stays about the same in my decoder and rises in libavcodec. BTW, if someone know how to copy a frame from the GPU quickly I'd like to know. Since there's no PCIe traffic going on a solution is bound to be found. |
5th September 2011, 21:27 | #11 | Link | |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,366
|
Quote:
They also have one for VC-1, the WMVideo Decoder DMO, but for some reason this one only uses DXVA in WMP, it must be locked down somehow. Of course their decoders are "pure" DXVA, which means they don't copy stuff back from the GPU, it remains in there until it is displayed - avoiding the memcpy problem.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders |
|
6th September 2011, 17:11 | #12 | Link |
Registered User
Join Date: Mar 2006
Posts: 1,049
|
AFAIR from old PCI times (seems that PCIe is only extension to PCI) reading from PCI device to memory was much slower than writing from PCI device to memory - if there is chance to make PCIe device transaction initiatior and order that PCIe device will write to system memory should IMHO faster than reading from device.
|
5th September 2011, 20:50 | #13 | Link | |
Software Developer
Join Date: Oct 2001
Location: Israel
Posts: 1,008
|
The major issue here is the overhead the driver adds for memory copies.
John Carmack (ID Software) wrote about it in this interview. Quote:
But... you can't get direct access to that memory. The way the driver provides access to this memory is 1000's of percent slower than if the driver were able to point to the real memory address and let you just copy the image directly. Last edited by Blight; 5th September 2011 at 20:52. |
|
5th September 2011, 20:57 | #14 | Link |
Registered Developer
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,366
|
The main problem here is actually copying stuff back from the GPU memory to the CPU/System memory, which only NVIDIA seems to have really managed to optimize properly for CUDA. Its not a task a game needs, which is why AMD never really cared to invest in it (and therefor is really slow with it). Intel doesn't seem to get that much performance either on the GPU -> CPU copys.
Its probably true that drivers are holding back the true potential of the current and next gen hardware.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders Last edited by nevcairiel; 5th September 2011 at 21:02. |
5th September 2011, 22:20 | #15 | Link | |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
Quote:
The reason for the slowness as far as I've heard (aside from the PCIe latency and BW) is that the GPU stores surfaces differently than the CPU. A GPU in many cases needs to work on blocks or tiles (e.g. 8x8 16x16, etc.) and if those pixels are sequential in physical memory then they are read/written much faster and provide higher cache hits as well as efficient cache prefetching. So when a CPU tries to read several bytes each time (inner loop of memcpy) there's a lot of address translations and the memory controller needs to set up the DDR again and again for different pages. |
|
6th September 2011, 00:00 | #16 | Link | ||||
Registered User
Join Date: Apr 2002
Location: Germany
Posts: 4,926
|
Quote:
We already had a similar Discussion on Beyond3d and nobody really want's to go to Assembler Style Code the GPU directly anymore, so yeah it's up @ Microsoft and the Vendors to improve this Quote:
Quote:
Quote:
Also it makes it much easier to adapt to new Renderer that doesn't support DXVA and use full capabilities without being limited
__________________
all my compares are riddles so please try to decipher them yourselves :) It is about Time Join the Revolution NOW before it is to Late ! http://forum.doom9.org/showthread.php?t=168004 Last edited by CruNcher; 6th September 2011 at 00:31. |
||||
6th September 2011, 16:30 | #17 | Link |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
CruNcher:
Regarding the "evil trees" clip. I get very strange results from different splitters. The LAV splitter reports 59.94 fps while haali and the Gabest MPEG splitter report 29.97. All splitters produce a cadence of P B T P B .... (progressive, bottom first, top first) and all of them start past the zero time stamp (something like 4 missing frames). I'll dig into this to make sure I behave properly on all of them. I need a VC1 clip that crashes - like you reported, currently I don't have crashing content. Also, what source filters are used for VC1 (.wmv), the WM ASF Reader freezes too much (regardless of decoder). I've fixed the seeking issue and now seeks are instantaneous without artifacts. I also fixed MPEG2 sequence header initialization which will seek corruption. I'll release a new build in a day or two. |
6th September 2011, 16:47 | #18 | Link |
Registered User
Join Date: Feb 2010
Posts: 364
|
Not trying to get you down or anything, but why integrate it into ffdshow while LAV Video & LAV CUVID Decoder are the new rising stars around the neighborhood?
Nev (the developer) said he's planning to integrate the two one day (which makes sense; like CoreAVC), and I think it would be wonderful if he'll have a patch available adding SB acceleration as well. It will make it the best video decoder hands down. What I'm trying to say: think ahead. forward. ffdshow is slowly fading w/ each step LAV Filters take. I believe the day where codec packs use LAV Filters (instead of Haali & ffdshow) is not that far away. Or maybe I'm the only one who has noticed it? Last edited by Superb; 6th September 2011 at 16:50. |
6th September 2011, 21:26 | #19 | Link | |
QuickSync Decoder author
Join Date: Apr 2011
Location: Atlit, Israel
Posts: 916
|
Quote:
__________________
Eric Gur, Processor Application Engineer for Overclocking and CPU technologies Intel QuickSync Decoder author Intel Corp. |
|
6th September 2011, 21:38 | #20 | Link |
Registered User
Join Date: Feb 2010
Posts: 364
|
That's great news. Btw, you might wanna look at VLC's git repository... They use DXVA2 acceleration and copy the frames back too. (under modules\codec\avcodec\dxva2.c)
Last edited by Superb; 6th September 2011 at 21:51. |
Tags |
ffdshow, h264, intel, mpeg2, quicksync, vc1, zoom player |
Thread Tools | Search this Thread |
Display Modes | |
|
|