Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles
Register FAQ Calendar Today's Posts Search

Reply
 
Thread Tools Search this Thread Display Modes
Old 25th September 2011, 23:54   #1  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
SubExtractor - New Sub Ocr App

I've released an app to extract subs from (non-encrypted, on hard drive) DVDs and convert to Advanced Substation Alpha or SRT format. It can also convert sup (PGS) and sub/idx formats to same. I wrote this because I hate the blocky, too-high-on-the-screen look of regular DVD subtitles and wanted to re-encode my DVD collection in h264/aac/assa with mkv containment.

http://subextractor.codeplex.com/

It's a wizard-style app, allowing you to pick program chains, angles, audio and subtitle tracks from a DVD folder and create mpg, d2v and bin (my own data format similar to sub/idx combined) files for each. DGIndex is used to help line up the subs to the video since DVD programs often have discontinuities that mess up sync. The mpg and d2v files created is great for further re-encoding of DVDs to h264 using a tool like MeGui.

The OCR is pretty basic, just exact pattern matching of the characters. The starting OCR database is good though so most DVDs should require manual matching of just a few characters. Some characters like i, l, I, '.', and o must be manually matched for every DVD since they have a lot of false positives. Some Bluray sup files can be tedious to OCR since the Bluray authors used scaled-up fonts, which means there ends up being 5 or more bit pattern matches for each character. Persistence pays off though if you get one of those files, just keep matching.

The line and word layout functions are pretty sophisticated and should give good results unless the characters are very unusual (vertical or upside-down text is bad).

Last edited by Tappen; 28th December 2012 at 21:18.
Tappen is offline   Reply With Quote
Old 27th September 2011, 00:45   #2  |  Link
nibus
Telewhining
 
Join Date: Mar 2010
Posts: 272
Very nice, I'll give this a shot. Can it export to .srt?
nibus is offline   Reply With Quote
Old 27th September 2011, 01:59   #3  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
Yes it can also export to srt, though of course that's a much more limited format (no colors, positioning, etc).

Also, the first 3 steps of the wizard are kind of like a easier to use version of ifoedit: they produce an mpg (mpeg-2 program stream) file of just the angles and tracks from the dvd you want to re-encode.
Tappen is offline   Reply With Quote
Old 27th September 2011, 02:02   #4  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Do italic letters work correctly with .srt output? I tested one .sup file but I didn't have any success. I'll test more tommorrow.

Last edited by nautilus7; 27th September 2011 at 11:39.
nautilus7 is offline   Reply With Quote
Old 27th September 2011, 08:41   #5  |  Link
nibus
Telewhining
 
Join Date: Mar 2010
Posts: 272
I ran Ice Age 3 through it and I must say, it was painless. Worked extremely quick and I can't find any OCR errors. This is definitely my favorite subtitle OCR utility! Well done!

A few ideas -

1) Being able to type the text instead of clicking it would be nice, but not a huge deal as the recognition is excellent.

2) My default "save" directory was in the "My Videos" folder. It would probably be easier if it defaulted to the current working directory.

3) The other issue is on some subtitles the alignment is a little off. Not a huge deal - but it would be nice if there was a feature that allowed you to "align" text blocks to the same left-side position.

Here's an example:



edit: also the ability to set the OCR bin file to the program directory for portable use.

Last edited by nibus; 27th September 2011 at 08:53.
nibus is offline   Reply With Quote
Old 27th September 2011, 12:01   #6  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Quote:
Originally Posted by nautilus7 View Post
Do italic letters work correctly with .srt output? I tested one .sup file but I didn't have any success. I'll test more tommorrow.
Tested one more file. Both .srt files created with your application don't contain italic formatting. The <i> and </i> tags are omitted.

Also both .ass files can't be loaded in aegisub. I get "error processing line: style: blah blah blah".

Samples: http://www.mediafire.com/?eh78xxcdoc9siw0

Finally: What about subtitles in other than English languages? How do I insert foreign letters?
nautilus7 is offline   Reply With Quote
Old 27th September 2011, 13:06   #7  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
nautilus7: I'll look into your issues, thanks for the source files. srt output is a feature I didn't work on much so I'm not surprised I missed some things. Should be easy fixes though

nibus: good suggestions.

1. I've thought about adding a "enter matching text manually" textbox myself. Hopefully I can do it without messing up the flow of the ocr

2. I worry that the files will be installed in a directory where the user doesn't have write access without Windows bringing up a UAC dialog so I went with My Videos. Maybe I should check if the current directory is writable and make that the default if so.

3. I have an option to "Exactly Position every Line" when creating ass files which will turn off the processing that allows text which is centered and in the lower 3rd of the screen to use the default position of ass renderers. But that doesn't solve the left alignment problem. I could add a "left-align" checkbox but then all text would be left aligned and probably (since the source and dest will have different widths) make things look bad in a different way.
Tappen is offline   Reply With Quote
Old 27th September 2011, 14:52   #8  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
nautilus7: your 2 issues should be fixed with 1007 release
Tappen is offline   Reply With Quote
Old 27th September 2011, 16:16   #9  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Working like charm. thanks!

What about non-English languages?
nautilus7 is offline   Reply With Quote
Old 27th September 2011, 16:44   #10  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
nibus:

I added the ability to enter the OCR character match manually in 1008. See how you like the UI.

I couldn't figure out how to deal with UAC in Windows Vista and 7 reliably to change the output and OcrMap directories to the current app directory when it's sensible to do so. If you install (copy) into a "Program Files" sub-directory for those operating systems Windows secretly moves files created by the program elsewhere - very hard for the user to find. So I haven't changed the default directories yet.
Tappen is offline   Reply With Quote
Old 27th September 2011, 16:46   #11  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
nautilus7: certainly localization is a feature I want to add. I know how to do it, not too hard in .Net, but just haven't as yet. We'll see what the response looks like in a month or so.
Tappen is offline   Reply With Quote
Old 27th September 2011, 16:58   #12  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Ok, I see. But let me say this: Your program is currently the only in development ocr program that can read blu-ray subs (the other is suprip but is dead) and from the 2nd public release it can output perfect english subs, at least, something that suprip is not able to do till now... So i think response will go high! :P
nautilus7 is offline   Reply With Quote
Old 28th September 2011, 16:13   #13  |  Link
mastrboy
Registered User
 
Join Date: Sep 2008
Posts: 365
Quote:
Originally Posted by nautilus7 View Post
Ok, I see. But let me say this: Your program is currently the only in development ocr program that can read blu-ray subs (the other is suprip but is dead) and from the 2nd public release it can output perfect english subs, at least, something that suprip is not able to do till now... So i think response will go high! :P
http://www.nikse.dk/SubtitleEdit can also read SUP files, and is very much alive and still being developed on...
mastrboy is offline   Reply With Quote
Old 28th September 2011, 16:51   #14  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Nice! Wasn't aware of this. I'll have a look.

@Tappen a few suggestions:

Almost every time SubExtractor finds ." or ," letter combination in italic writing, it puts a space between them. Maybe some optimization can be done there so the user don't have to fix the space with the "advanced word spacing" feature.
Also some times an unwated space is placed after 1 (also in italic).
nautilus7 is offline   Reply With Quote
Old 28th September 2011, 18:39   #15  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
The accuracy of the space detection depends on the font kerning, and so is different for every font the disc subtitle authors use. I haven't found . , or 1 characters in italics to have a lot of problems with the samples I've OCR'd, but it's fair to say that it's very rare for there to be a space in front of . or , and very common to have a space after the same, so maybe I'll tilt the base adjustments by 1 pixel in that direction.

How much are you having to "advanced word spacing" the left and right adjustments around those characters to fix the problem?
Tappen is offline   Reply With Quote
Old 28th September 2011, 19:11   #16  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
Hi, the samples I sent the other day demonstrate this problem. They both use arial font. The problems were fixed by moving 2 units (pixels?) IIRC.
nautilus7 is offline   Reply With Quote
Old 29th September 2011, 00:38   #17  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
nautilus7 I don't see the problems when I run your .sup files. I don't see any extra spaces in front of periods or commas, or after 1s. Did you accidentally un-check "1080p Adjustments" on the Create Subtitles page? I notice that your *.ass files have the DVD (480p) default margins and font sizes instead of Bluray (doubled) values.
Tappen is offline   Reply With Quote
Old 29th September 2011, 00:55   #18  |  Link
nautilus7
Registered User
 
nautilus7's Avatar
 
Join Date: Jan 2006
Location: Athens, Greece
Posts: 1,518
In eng.sup file i sent you, you can see the following:




Space between . and " in italic writing.
Space after 1 in italic writing.

In watchmen.dc.eng.sup file i sent you, you can see:

Space between . and " in italic writing.
Space between , and " in italic writing.

Last edited by nautilus7; 29th September 2011 at 00:58.
nautilus7 is offline   Reply With Quote
Old 29th September 2011, 01:16   #19  |  Link
Tappen
Registered User
 
Join Date: Dec 2006
Posts: 196
I see. I think the problem is with " rather than . or , for the first issue. Not much I can do as many subtitle fonts (whatever the Bluray or DVD authors used to generate the bitmaps that I'm OCRing, not the font we're using in the output files) have tighter spacing around " and 1 italic characters than we're seeing here. Fixing your problem would probably break a bunch of other sup files. I'm just going to have to admit that I can't do perfect word spacing. Personally I usually run the subs I produce through the Aegisub spell checker to catch and fix any repeated errors. It would be great to be perfect but I don't think it's going to happen.

One thing I'm considering is that I've seen quite a few errors with numbers. I might auto-adjust the spacing rules so that 2 numbers next to each other can't have a space in between. It's a really visually jarring error that may be worth some extra work to avoid.

I've also considered a rule where I automatically add a space before the 1st, 3rd, etc. double-quotes, and remove any space after them, and do the reverse for the 2nd, 4th etc. double-quotes. But sometimes quotes don't work exactly like that - they're continued from the previous subtitle and the order is reversed. I'd hate to deliberately mess up those cases.

Last edited by Tappen; 29th September 2011 at 02:13.
Tappen is offline   Reply With Quote
Old 2nd October 2011, 11:26   #20  |  Link
Thunderbolt8
Registered User
 
Join Date: Sep 2006
Posts: 2,197
would it be possible to change the order in which the different characters gets asked to be orc'ed sticks to horizontal lines?
e.g. when a subtitle consists of two or more lines, then all characters from the words of the first line are asked to be recognized first and only then characters from the next line.
atm, the program keeps going on a vertical axis and this is quite irritating.

also, the programm seems to halt when I choose a character from the windows character map which is not listed in your programm among those few characters presented on screen (it does not crash, but I cannot seems to proceed unless I undo that choice and choose one of those characters you suggest with your list; in this special case its the 'em dash')
__________________
Laptop Lenovo Legion 5 17IMH05: i5-10300H, 16 GB Ram, NVIDIA GTX 1650 Ti (+ Intel UHD 630), Windows 10 x64, madVR (x64), MPC-HC (x64), LAV Filter (x64), XySubfilter (x64) (K-lite codec pack)

Last edited by Thunderbolt8; 2nd October 2011 at 11:31.
Thunderbolt8 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 23:59.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.