Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 27th October 2011, 03:23   #21  |  Link
MajorX
Registered User
 
Join Date: Mar 2010
Posts: 52
Thanks Nikse555
I have some problem in timing with some subtitles.
when i use OCR...First---I extract subtitles(*.VOBSUB) from video then use it in OCR ..it shows some start time & end time problem like if the original sub have,
Stat Time --> End Time 00:00:13,097 --> 00:00:19,185
OCR shows 00:00:13,097 --> 00:00:17,185
but if i use subtitles direct from video it shows correct start time & end time in OCR.
MajorX is offline   Reply With Quote
Old 28th October 2011, 22:33   #22  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
I have another feature request, or maybe you know of a configuration file I can edit so that a replacement is always performed.

I would like to replace ’ with ' because ’ shows up very weirdly (last word is supposed to be: didn't ):

xekon is offline   Reply With Quote
Old 29th October 2011, 05:44   #23  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@MajorX: This is hard to say why without the actual sub... The ocr window has a check box weather to use time codes from .idx file or from .sub file.

@xekon: Works here in latest version: http://www.nikse.dk/SubtitleEdit.zip
Nikse555 is offline   Reply With Quote
Old 29th October 2011, 08:01   #24  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
WOW! you weren't kidding about it going faster! just did a couple more episodes and its zooming through the lines much faster!

edit: odd new bug:

I'm sorry!

was detected as:
Code:
I.m
s
0
r
rY
!

Last edited by xekon; 29th October 2011 at 08:19.
xekon is offline   Reply With Quote
Old 29th October 2011, 15:33   #25  |  Link
Anakunda
Registered User
 
Join Date: Jan 2010
Posts: 330
Hello there!
I feel like having trouble with OCR. Recognizing from SUP format, tried both methods and both have significant inaccuracies:
In the pattern comparison mode, the engine totally ignores differencies between letters 'i' and 'l', and 'c' and 'o' and 'e'. All the letters are assigned the character that was assigned by the first occurence of on of letters from "same" group. For example. 1st subtitle contains word more, the wizard stops at o and I assign it o. When it passes over e, it doesnot ask again for letter even if that s 1st "e" in subtitles and assigns it automatically 'o'. That's very bad. I don't know if that's a result of some auto corrections made by SE, but seems to get wrong assigned even if I turn off all the auto corrections on the right side.
That's about character comparison method. Tesseract seems to work better but has considerable flaws too:
Some characters are auto uppercased even if they are in lowercase in the source matrix, especially it concerns 's', 'z', 'c' and 'a'. All occurences of these letters seem uppercased regardless on case in the original matrix if they stand as standalone letter or 1st letter in word. All of s, z, c and a's are kept lowercase if in middle a word.
PPlease give me some suggestions to make functional at least one of the methods, so that most words are recognized properly and don't need to correct by spell checker. The uppercase problem even doesnot seem repairable by spell checker processing!
Thank U !
Anakunda is offline   Reply With Quote
Old 29th October 2011, 17:19   #26  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
I have another feature request, could we have a checkbox to omit all <i> </i> tags, they are being used for only half lines when the whole line is italic, they are also being used when there are no italic lines at all.

Right now after I rip a sub I am going through and doing find/replace to delete them all, but it would be great to have that as a feature in Subtitle Edit.

very often !! gets detected as ll

Is this something that can be fixed? or is there something I can do to help with the detection of exclamation points? or do I have to wait till tesseract is updated?

EDIT: on a side note, whatever you did for MS MODI OCR seems to have worked. and it definitely does help!

here is an example of the ll instead of " or !!



Last edited by xekon; 30th October 2011 at 09:15.
xekon is offline   Reply With Quote
Old 30th October 2011, 09:27   #27  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
OMG OMG OMG! The programmer in me has just thought of a VERY COOL feature you could add!

call it a visual tool for super fast comparison. (OCR can only get so good, and if you want to verify perfect subs, this is a good way to do it.)

The goal should always be perfect OCR on the first sweep, but visually checking the subs afterwards is just to verify, and the quicker you can do that the better.

Let me know what you think of this idea, I am sure it would actually be something that would be pretty fun to program.

Please let me know what you think because i think it would be AWESOME!

I am drawing an illustration in Photoshop now.

EDIT: ok to illustrate my idea... OCR a .SUP file. then use the arrow key to go down line by line, reading the text, and then looking at the image to compare and see that they are the same.

Now, that is not exactly quick, the brain has to think more, it has to remember more, and your eyes have to move and focus on more than one area, below is my idea:

Basically, use an opengl or directx library that can overlay text, or any library that looks like it will work to overlay text with transparency. And size the text to roughly overlay the SUB image with like a 50-60% transparency. The letters dont have to line up perfectly, anywhere close will allow you to quickly with just a glance tell if the sub and text match visually. (basically you read the sub line ONLY once, and your brain looks for discrepancies as you do it. versus reading two or three times, and moving your eye between locations, and also having to remember and hope you remember correctly.)

I think for somebody that visually checks there OCR for their subs, this would probably speed up the process for them 200%+

see how easy it is to see that they match:


here is one that passed the OCR, but is incorrect:


here is another one that passed the OCR, but is incorrect (depending on the library used you could even apply a border/stroke to the outside of the letters)


here is another, there is probably one that passes through the ocr, green light and all, in every episode, you just have to look carefully (you might even be able to adjust the thickness of the characters, so that they usually fall within the bounds of the SUB image character outlines):

Last edited by xekon; 30th October 2011 at 10:51.
xekon is offline   Reply With Quote
Old 31st October 2011, 07:56   #28  |  Link
Chetwood
Registered User
 
Chetwood's Avatar
 
Join Date: Nov 2001
Posts: 1,104
Looks impressive but I think it's overkill. Why not simly have a small window showing the item and an editable text window below that shows the OCRed text. In case they don't match simply alter the text and move on to the next item.
__________________

MultiMakeMKV: MakeMKV batch processing (Win)
MultiShrink
: DVD Shrink batch processing
Offizieller Übersetzer von DVD Shrink deutsch
Chetwood is offline   Reply With Quote
Old 31st October 2011, 19:20   #29  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by MajorX View Post
Thanks Nikse555
I have some problem in timing with some subtitles.
when i use OCR...First---I extract subtitles(*.VOBSUB) from video then use it in OCR ..it shows some start time & end time problem like if the original sub have,
Stat Time --> End Time 00:00:13,097 --> 00:00:19,185
OCR shows 00:00:13,097 --> 00:00:17,185
but if i use subtitles direct from video it shows correct start time & end time in OCR.
My guess would be that the application you ripped the vobsub with did not use time codes from the mkv container, but rather used the time codes in the sub file itself (the time codes in idx and sub file are exactly alike).
Nikse555 is offline   Reply With Quote
Old 31st October 2011, 19:43   #30  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
Nikse555 please let me know what you think of my idea, if its not something your interested in, then I will try adding it. I just noticed Subtitle Edit is open source.

Could I please have a copy of the source code that is as current as: http://www.nikse.dk/SubtitleEdit.zip

the one on code.google.com is October 14.

Last edited by xekon; 31st October 2011 at 20:03.
xekon is offline   Reply With Quote
Old 31st October 2011, 19:47   #31  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
Quote:
Originally Posted by Chetwood View Post
Looks impressive but I think it's overkill. Why not simly have a small window showing the item and an editable text window below that shows the OCRed text. In case they don't match simply alter the text and move on to the next item.
Subtitle Edit has very accurate result for the OCR. There are usually only 1-3 wrong subs out of 300 lines. That is quite impressive. So generally you wont need to do much editing, only verifying. The method I posted is the quickest way that I can think of to scan through entire sub files after the OCR and visually verify.
xekon is offline   Reply With Quote
Old 31st October 2011, 20:49   #32  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Hi Anakunda!

Quote:
Originally Posted by Anakunda View Post
...
In the pattern comparison mode, the engine totally ignores differencies between letters 'i' and 'l', and 'c' and 'o' and 'e'. All the letters are assigned the character that was assigned by the first occurence of on of letters from "same" group. For example. 1st subtitle contains word more, the wizard stops at o and I assign it o. When it passes over e, it doesnot ask again for letter even if that s 1st "e" in subtitles and assigns it automatically 'o'. That's very bad.
Yes, this is true. I've tried to improve it a bit here: http://www.nikse.dk/SubtitleEdit.zip
A work-around is to right-click on the offending line in the list view, and choose "Inspect compare matches for current image" - here you can choose "Add better match" to correct mistakes.
(my image compare code is a bit slow for blu-ray images...)


Quote:
Originally Posted by Anakunda View Post
Tesseract seems to work better but has considerable flaws too:
Some characters are auto uppercased even if they are in lowercase in the source matrix, especially it concerns 's', 'z', 'c' and 'a'. All occurences of these letters seem uppercased regardless on case in the original matrix if they stand as standalone letter or 1st letter in word. All of s, z, c and a's are kept lowercase if in middle a word.
Is this still the case in latest version?
If yes, could you provide a test file + a few line numbers?
Nikse555 is offline   Reply With Quote
Old 1st November 2011, 08:53   #33  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by xekon View Post
...
call it a visual tool for super fast comparison. (OCR can only get so good, and if you want to verify perfect subs, this is a good way to do it.)
...
Another way to proof read would be to right click in the list view - and choose "Save all images with html index...". This displays a web page with all images + ocr'ed text if available. In latest version, this also shows text with background color.
Nikse555 is offline   Reply With Quote
Old 2nd November 2011, 03:03   #34  |  Link
MajorX
Registered User
 
Join Date: Mar 2010
Posts: 52
Hi Nikse555
can u plzz check this *.SUP file...i get only strange symbols with OCR.

http://www.mediafire.com/?aoy66c5ue9mbah9
MajorX is offline   Reply With Quote
Old 2nd November 2011, 03:08   #35  |  Link
xekon
Registered User
 
Join Date: Jul 2011
Posts: 224
MajorX I tried your file with Nikse555's latest version here: http://www.nikse.dk/SubtitleEdit.zip

I also got lots of symbols if I had "Try MS MODI OCR for unknown words" unchecked.

but if you use the MS MODI OCR it detects all of them just fine

give it a shot.

PS: I wonder if that subtitle file has ever had its resolution resized.... the letters are really bad quality.

Last edited by xekon; 2nd November 2011 at 03:33.
xekon is offline   Reply With Quote
Old 2nd November 2011, 07:36   #36  |  Link
MajorX
Registered User
 
Join Date: Mar 2010
Posts: 52
I try with this version but i can't enable MS MODI OCR ...can u tell how can i do this.

MajorX is offline   Reply With Quote
Old 2nd November 2011, 10:50   #37  |  Link
kypec
User of free A/V tools
 
kypec's Avatar
 
Join Date: Jul 2006
Location: SK
Posts: 826
Quote:
Originally Posted by MajorX View Post
I try with this version but i can't enable MS MODI OCR ...can u tell how can i do this.
You must have some Microsoft Office libraries installed for this to work IIRC...
kypec is offline   Reply With Quote
Old 2nd November 2011, 22:30   #38  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by MajorX View Post
Hi Nikse555
can u plzz check this *.SUP file...i get only strange symbols with OCR.
Thx for the file
This font don't look blu-ray like but seems clear enough. Resizing did not help, but changing font color to white seems to help, so this is included latest version, which should handle your sup better: http://www.nikse.dk/SubtitleEdit.zip
Nikse555 is offline   Reply With Quote
Old 3rd November 2011, 06:06   #39  |  Link
MajorX
Registered User
 
Join Date: Mar 2010
Posts: 52
Quote:
Originally Posted by Nikse555 View Post
Thx for the file
This font don't look blu-ray like but seems clear enough. Resizing did not help, but changing font color to white seems to help, so this is included latest version, which should handle your sup better: http://www.nikse.dk/SubtitleEdit.zip
Thanks Nikse555....working perfectly.
MajorX is offline   Reply With Quote
Old 13th January 2012, 15:17   #40  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Subtitle Edit 3.2.3 is now finally out with lots of minor improvements and fixes!

Change log
New: Added Brazilian Portuguese - thx XXXXXXXXXX
New: Added Italian language file - thx Maff
New: Added Portuguese (Portugal) language file - thx Ricardo Perdigão
New: Added Japanese language file - thx Nardog
New: Added Spanish language file - thx m2s
New: Support for subtitle format AvidCaption - thx Laszlo
New: Support for F4 subtitle formats - thx Fred
New: Export to Blu-ray sup format
Improved: Updated Tesseract to 3.01. Now includes (some) italic detection + adds support for Arabic, Hebrew, Hindi and Thai
Improved: Undo improved so it also works for textbox + redo (Ctrl+Y)
Improved: Many new configurable shortcuts (e.g. for fullscreen video player)
Improved: OCR tweaked a bit + BluRay sup files are processed faster
Improved: TextBox with current subtitle now shows cursor position - thx Leszek
Improved: Subtitle format PAC much improved - thx Peter
Improved: Subtitle format FCP Xml improved - thx Ulrik
Improved: Subtitle format D-Cinema improved - thx Karam
Improved: Splitting of lines - Thx Trottel
Improved: Auto break lines - thx Majid
Improved: Some fixes for Fix common errors/Remove text for HI - thx Majid
Improved: Optimized Fix Common Errors
Improved: DirectShow can now also play audio-only files
Fixed: Crash when setting Options - thx karmazyn
Fixed: Crash in set color (or set font) - thx LEO33
Fixed: Crash/freeze when loading large subtitle files - thx Ulrik
Fixed: Bug when clicking in list view while running ocr - thx sialivi
Fixed: De-selecting text in textbox via single click - thx XhmikosR
Fixed: Possible crash in spell check + German dictionary should work
Fixed: Missing save/load of a fix common errors setting - thx menes
Fixed: Removed Microsoft translate as it's useless with new quotas
Fixed: Milliseconds in timed text - thx Calle
Fixed: Names with spaces now works in spell check - thx Dr. jackson
Fixed: Do not use frame rate if it's zero (audio files) - thx dixie.fever
Fixed: Possible crash when saving xml files - thx Peter

http://code.google.com/p/subtitleedit/downloads/list

Last edited by Nikse555; 13th January 2012 at 15:18. Reason: forgot link
Nikse555 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 17:38.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.