View Single Post
Old 28th May 2016, 17:34   #1  |  Link
przemoc
Registered User
 
Join Date: Sep 2010
Posts: 2
MP4 delay - start_pts vs media time from edit list table entry in moov.trak.edts.elst

Hi!

I'd like to understand some MP4 and AAC-related stuff and ffmpeg behavior regarding it.

I'm transcoding 14.5 secs footage (50fps; 696000 48kHz audio samples) huffyuv+pcm_s16le from MKV into h264+aac to MP4 using latest stable ffmpeg 3.0.1 (Zeranoe's Win64 static build).

Code:
ffmpeg -i 30-notes-huffyuv.mkv ^
-pix_fmt:v yuv420p ^
-c:v libx264 -profile:v high -preset:v fast ^
-sc_threshold:v 0 -g:v 25 -bf:v 2 -crf:v 18 ^
-c:a aac -profile:a aac_low -b:a 384k -cutoff:a 22000 ^
30-notes.mp4
When we look at ffprobe's -show_streams output:
Code:
[STREAM]
index=0
codec_name=h264
codec_long_name=H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10
profile=High
codec_type=video
codec_time_base=1/100
codec_tag_string=avc1
codec_tag=0x31637661
width=1920
height=1080
coded_width=1920
coded_height=1088
has_b_frames=2
sample_aspect_ratio=1:1
display_aspect_ratio=16:9
pix_fmt=yuv420p
level=42
color_range=N/A
color_space=unknown
color_transfer=unknown
color_primaries=unknown
chroma_location=left
timecode=N/A
refs=4
is_avc=true
nal_length_size=4
id=N/A
r_frame_rate=50/1
avg_frame_rate=50/1
time_base=1/12800
start_pts=0
start_time=0.000000
duration_ts=185600
duration=14.500000
bit_rate=15634211
max_bit_rate=N/A
bits_per_raw_sample=8
nb_frames=725
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=1
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
TAG:language=und
TAG:handler_name=VideoHandler
[/STREAM]
[STREAM]
index=1
codec_name=aac
codec_long_name=AAC (Advanced Audio Coding)
profile=LC
codec_type=audio
codec_time_base=1/48000
codec_tag_string=mp4a
codec_tag=0x6134706d
sample_fmt=fltp
sample_rate=48000
channels=2
channel_layout=stereo
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/48000
start_pts=-1024
start_time=-0.021333
duration_ts=697024
duration=14.521333
bit_rate=235170
max_bit_rate=384000
bits_per_raw_sample=N/A
nb_frames=681
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=1
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
TAG:language=und
TAG:handler_name=SoundHandler
[/STREAM]
we can see that:
  • video:
    start_pts=0
    start_time=0.000000
  • audio:
    start_pts=-1024 (accomodating AAC priming - ffmpeg's native aac encoder delay is 1024 samples long)
    start_time=-0.021333 (1024/48000)
which is all fine.
But if we look into internals of this MP4 via Elecard Video Format Analyzer or via dump from mp4box (I used latest stable 0.6.1 Win64 build):
Code:
mp4box -std -diso 30-notes.mp4 | egrep -v "\<(CompositionOffsetEntry|SyncSampleEntry|SampleToChunkEntry|SampleSizeEntry|ChunkEntry)\>" 
<?xml version="1.0" encoding="UTF-8"?>
<!--MP4Box dump trace-->
<IsoMediaFile Name="30-notes.mp4">
<FileTypeBox MajorBrand="isom" MinorVersion="512">
<BoxInfo Size="32" Type="ftyp"/>
<BrandEntry AlternateBrand="isom"/>
<BrandEntry AlternateBrand="iso2"/>
<BrandEntry AlternateBrand="avc1"/>
<BrandEntry AlternateBrand="mp41"/>
</FileTypeBox>
<FreeSpaceBox size="0">
<BoxInfo Size="8" Type="free"/>
</FreeSpaceBox>
<MediaDataBox dataSize="28763883">
<BoxInfo Size="28763891" Type="mdat"/>
</MediaDataBox>
<MovieBox>
<BoxInfo Size="18880" Type="moov"/>
<MovieHeaderBox CreationTime="0" ModificationTime="0" TimeScale="1000" Duration="14522" NextTrackID="3">
<BoxInfo Size="108" Type="mvhd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</MovieHeaderBox>
<TrackBox>
<BoxInfo Size="12723" Type="trak"/>
<TrackHeaderBox CreationTime="0" ModificationTime="0" TrackID="1" Duration="14500" Width="1920.00" Height="1080.00">
<Matrix m11="0x00010000" m12="0x00000000" m13="0x00000000" 								m21="0x00000000" m22="0x00010000" m23="0x00000000" 								m31="0x00000000" m32="0x00000000" m33="0x40000000"/><BoxInfo Size="92" Type="tkhd"/>
<FullBoxInfo Version="0" Flags="0x3"/>
</TrackHeaderBox>
<EditBox>
<BoxInfo Size="36" Type="edts"/>
<EditListBox EntryCount="1">
<BoxInfo Size="28" Type="elst"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<EditListEntry Duration="14500" MediaTime="512" MediaRate="1"/>
</EditListBox>
</EditBox>
<MediaBox>
<BoxInfo Size="12587" Type="mdia"/>
<MediaHeaderBox CreationTime="0" ModificationTime="0" TimeScale="12800" Duration="185600" LanguageCode="und">
<BoxInfo Size="32" Type="mdhd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</MediaHeaderBox>
<HandlerBox Type="vide" Name="VideoHandler" reserved1="0" reserved2="data:application/octet-string,000000000000000000000000">
<BoxInfo Size="45" Type="hdlr"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</HandlerBox>
<MediaInformationBox>
<BoxInfo Size="12502" Type="minf"/>
<VideoMediaHeaderBox>
<BoxInfo Size="20" Type="vmhd"/>
<FullBoxInfo Version="0" Flags="0x1"/>
</VideoMediaHeaderBox>
<DataInformationBox><BoxInfo Size="36" Type="dinf"/>
<DataReferenceBox>
<BoxInfo Size="28" Type="dref"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<URLDataEntryBox>
<!--Data is contained in the movie file-->
<BoxInfo Size="12" Type="url "/>
<FullBoxInfo Version="0" Flags="0x1"/>
</URLDataEntryBox>
</DataReferenceBox>
</DataInformationBox>
<SampleTableBox>
<BoxInfo Size="12438" Type="stbl"/>
<SampleDescriptionBox>
<BoxInfo Size="154" Type="stsd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<AVCSampleEntryBox DataReferenceIndex="1" Width="1920" Height="1080" XDPI="4718592" YDPI="4718592" BitDepth="24">
<BoxInfo Size="138" Type="avc1"/>
<AVCConfigurationBox>
<AVCDecoderConfigurationRecord configurationVersion="1" AVCProfileIndication="100" profile_compatibility="0" AVCLevelIndication="42" nal_unit_size="4" chroma_format="0" luma_bit_depth="0" chroma_bit_depth="0">
<SequenceParameterSet size="27" content="data:application/octet-string,6764002AACD940780227E5C044000003000400000301903C60C658"/>
<PictureParameterSet size="6" content="data:application/octet-string,68EAE08CB22C"/>
</AVCDecoderConfigurationRecord>
<BoxInfo Size="52" Type="avcC"/>
</AVCConfigurationBox>
</AVCSampleEntryBox>
</SampleDescriptionBox>
<TimeToSampleBox EntryCount="1">
<BoxInfo Size="24" Type="stts"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<TimeToSampleEntry SampleDelta="256" SampleCount="725"/>
<!-- counted 725 samples in STTS entries -->
</TimeToSampleBox>
<CompositionOffsetBox EntryCount="665">
<BoxInfo Size="5336" Type="ctts"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<!-- counted 725 samples in CTTS entries -->
</CompositionOffsetBox>
<SyncSampleBox EntryCount="29">
<BoxInfo Size="132" Type="stss"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</SyncSampleBox>
<SampleToChunkBox EntryCount="93">
<BoxInfo Size="1132" Type="stsc"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<!-- counted 724 samples in STSC entries (could be less than sample count) -->
</SampleToChunkBox>
<SampleSizeBox SampleCount="725">
<BoxInfo Size="2920" Type="stsz"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</SampleSizeBox>
<ChunkOffsetBox EntryCount="679">
<BoxInfo Size="2732" Type="stco"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</ChunkOffsetBox>
</SampleTableBox>
</MediaInformationBox>
</MediaBox>
</TrackBox>
<TrackBox>
<BoxInfo Size="5943" Type="trak"/>
<TrackHeaderBox CreationTime="0" ModificationTime="0" TrackID="2" Duration="14522" AlternateGroupID="1" Volume="1.00">
<BoxInfo Size="92" Type="tkhd"/>
<FullBoxInfo Version="0" Flags="0x3"/>
</TrackHeaderBox>
<EditBox>
<BoxInfo Size="36" Type="edts"/>
<EditListBox EntryCount="1">
<BoxInfo Size="28" Type="elst"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<EditListEntry Duration="14500" MediaTime="1024" MediaRate="1"/>
</EditListBox>
</EditBox>
<MediaBox>
<BoxInfo Size="5807" Type="mdia"/>
<MediaHeaderBox CreationTime="0" ModificationTime="0" TimeScale="48000" Duration="697024" LanguageCode="und">
<BoxInfo Size="32" Type="mdhd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</MediaHeaderBox>
<HandlerBox Type="soun" Name="SoundHandler" reserved1="0" reserved2="data:application/octet-string,000000000000000000000000">
<BoxInfo Size="45" Type="hdlr"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</HandlerBox>
<MediaInformationBox>
<BoxInfo Size="5722" Type="minf"/>
<SoundMediaHeaderBox>
<BoxInfo Size="16" Type="smhd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</SoundMediaHeaderBox>
<DataInformationBox><BoxInfo Size="36" Type="dinf"/>
<DataReferenceBox>
<BoxInfo Size="28" Type="dref"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<URLDataEntryBox>
<!--Data is contained in the movie file-->
<BoxInfo Size="12" Type="url "/>
<FullBoxInfo Version="0" Flags="0x1"/>
</URLDataEntryBox>
</DataReferenceBox>
</DataInformationBox>
<SampleTableBox>
<BoxInfo Size="5662" Type="stbl"/>
<SampleDescriptionBox>
<BoxInfo Size="106" Type="stsd"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<MPEGAudioSampleDescriptionBox DataReferenceIndex="1" SampleRate="48000" Channels="2" BitsPerSample="16">
<BoxInfo Size="90" Type="mp4a"/>
<MPEG4ESDescriptorBox>
<BoxInfo Size="54" Type="esds"/>
<FullBoxInfo Version="0" Flags="0x0"/>
 <ES_Descriptor ES_ID="es2" binaryID="2" >
  <decConfigDescr>
   <DecoderConfigDescriptor objectTypeIndication="64" streamType="5" maxBitrate="384000" avgBitrate="235170" >
    <decSpecificInfo>
     <DecoderSpecificInfo type="auto" src="data:application/octet-string,%11%90%56%E5%00" />
    </decSpecificInfo>
   </DecoderConfigDescriptor>
  </decConfigDescr>
  <slConfigDescr>
   <SLConfigDescriptor >
    <predefined value="2" />
    <custom >
    </custom>
   </SLConfigDescriptor>
  </slConfigDescr>
 </ES_Descriptor>
</MPEG4ESDescriptorBox>
</MPEGAudioSampleDescriptionBox>
</SampleDescriptionBox>
<TimeToSampleBox EntryCount="2">
<BoxInfo Size="32" Type="stts"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<TimeToSampleEntry SampleDelta="1024" SampleCount="680"/>
<TimeToSampleEntry SampleDelta="704" SampleCount="1"/>
<!-- counted 681 samples in STTS entries -->
</TimeToSampleBox>
<SampleToChunkBox EntryCount="2">
<BoxInfo Size="40" Type="stsc"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<!-- counted 681 samples in STSC entries (could be less than sample count) -->
</SampleToChunkBox>
<SampleSizeBox SampleCount="681">
<BoxInfo Size="2744" Type="stsz"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</SampleSizeBox>
<ChunkOffsetBox EntryCount="679">
<BoxInfo Size="2732" Type="stco"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</ChunkOffsetBox>
</SampleTableBox>
</MediaInformationBox>
</MediaBox>
</TrackBox>
<UserDataBox>
<BoxInfo Size="98" Type="udta"/>
<MetaBox>
<BoxInfo Size="90" Type="meta"/>
<FullBoxInfo Version="0" Flags="0x0"/>
<HandlerBox Type="mdir" Name="" reserved1="0" reserved2="data:application/octet-string,6170706C0000000000000000">
<BoxInfo Size="33" Type="hdlr"/>
<FullBoxInfo Version="0" Flags="0x0"/>
</HandlerBox>
<ItemListBox>
<BoxInfo Size="45" Type="ilst"/>
<ToolBox value="Lavf57.25.100" >
<FullBoxInfo Version="0" Flags="0x1"/>
<BoxInfo Size="37" Type=".too"/>
</ToolBox>
</ItemListBox>
</MetaBox>
</UserDataBox>
</MovieBox>
</IsoMediaFile>
then we can see that:
  • video:
    moov.trak.edts.elst[0].media_time = 512
    moov.trak.mdia.mdhd.timescale=12800
    moov.trak.mdia.mdhd.duration=185600 (i.e. 14.5 secs)
  • audio:
    moov.trak.edts.elst[0].media_time = 1024
    moov.trak.mdia.mdhd.timescale=48000
    moov.trak.mdia.mdhd.duration=697024 (i.e. 14.5 secs + 1024 samples)
It confuses me, because it suggests that video playback starts at 512 or 512/12800 = 0.04 s, i.e. 40 ms into video stream, yet encoded video stream is not longer by that value and ffprobe clearly shows start_pts = 0.

1. What I am missing here? Am I looking at media time incorrectly, i.e. it has some other meaning that I think it has?

I have also some bonus questions:

2. Isn't AAC encoder required to produce full access units (typically having 1024 samples)? 697024/1024 = 680.6875 is not an integer.

3. I know that padding info (for start and end) can be stored within ITUNSMPB tag, but ffmpeg is not using that, adhering (I hope so) to ISO only, so where is this tail padding stored? Or is moov.trak.mdia.mdhd.duration allowed to be lower that real media duration (which would be divisible by 1024)?

4. If ffmpeg is using ISO way of delaying AAC audio (instead of iTunes way), then shouldn't it also add sample group (sgpd) with roll distance set to -1, as edit list (elst) is not enough for signaling encoder delay?

Ok, that's all for my first post on doom9.
przemoc is offline   Reply With Quote