Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 3rd May 2020, 21:20   #921  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@GCRaistlin:
>1. After loading the OCR results to the main window SE behaves as if an unchanged SRT file is open
I cannot re-create this... what file format and how do you open the file exactly?

>3. Open a BD SUP...
Yeah, that's a feature. Do any real work and SE will prompt.

>3. The ability to load DVD SUP files from UI...
That works here in main window via File - Open... or drag-n-drop. How/where does it not work?
Nikse555 is offline   Reply With Quote
Old 3rd May 2020, 21:40   #922  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Janusz
Thanks again. I saw it but didn't think that this is what I'm looking for. Just wondering why it has a different name here...
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 3rd May 2020, 22:48   #923  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@GCRaistlin

Quote:
Bugs:
1. After loading the OCR results to the main window SE behaves as if an unchanged SRT file is open: no asterisk in the window title, no prompt to save file on closing. If we won't save the file manually we'll lose our work.
1. I don't know how you are.
For me in the title of the main window after loading the subtitles after OCR the window name changes to: * D: \ full path \ filename.html \ index.srt - SubtitleEdit...
If I am now trying to close the program I get a prompt to save a new file. I don't know why it doesn't work for you.
After saving, the window name indicates where to save the index.srt file without *. If I change anything in the text, the program name will start with * again.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 3rd May 2020 at 23:09.
Janusz is offline   Reply With Quote
Old 3rd May 2020, 23:26   #924  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Janusz
I got it: it happens when SUP file is open by supplying it as an argument in command line, not from UI.
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 4th May 2020, 12:37   #925  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@GCRaistlin: Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Fixed dvd sup from cmd line + ass paste + change detection with file+ocr from cmd line + compare issue + image save as number + remember column paste options - hopefully some of that also works you?
Nikse555 is offline   Reply With Quote
Old 4th May 2020, 17:33   #926  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse555

I don't know how understandable this translator text will be, but I'll try.

In connection with the problem of correct recognition by the OCR systems of the lowercase "L" and the "I",
please explain briefly the principles which follow Subtitle Edit for automatic correction during OCR.

Why i ask?
  1. I start the OCR process with a new, completely empty character base.
    My settings: in [OCR auto correction Dictionary] - none, other options unchecked.
    No dictionaries in the Dictionaries catalog. Options/Tools [Fix common OCR errors ...] unchecked.
    As a result, I get an empty text with spaces recognized according to <No of pixels is space>.
    This is correct and as expected.
  2. I add the lower case letter "L" to the character base, options marked as in (1).
    As a result, they appear, I don't know if all but definitely recognized lower case "L", but also "I",
    which I don't have in the character database. Spaces between words as in (1).
    I can agree. At this stage I accept "I", because in Polish lonely "I" occurs so it is OK.
    But why does the program change the lowercase "L" to the "I" at the beginning of words
    since they do not yet know these words. It looks like this: [I I**** *********].
    The first "I" - ok. The second may well be the lower case letter "L".
    I consider this a serious mistake.
    Such a conversion could take place only after recognizing the entire word and after checking
    in the selected dictionary that such a conversion would not cause an error.
    From what you can see - checking in the dictionary is missing or not working as it should.
    I omit the fact that now the dictionary is off because every word is a mistake.
  3. Another OCR attempt.
    a. I remove the only lower case letter "L" from the character base. Checking - the character database is empty. Start OCR - result as in (1) - OK.
    b. Another OCR attempt. The character base is the same. However, the small "L", previously removed, is still there. I do not know why?
    Did the program not save the changes permanently? OCR result as for (2) with all errors ("I").
    I tried in various ways to get rid of the stubborn letter from the base,
    but you can't delete a character if it is the only character in the database.
    The changes only apply in the current program session.
    After closing and restarting the program, everything (last letter) returns to the state it was before.
    I think it's a mistake. Or maybe it should work like that? I do not know.

Finally, my suggestion to consider in the distant or near future.
When automating the OCR process, give the opportunity to use (set) a second dictionary. Subtitles are usually a translation of one language into another.
However, proper names, first names, last names etc. which do not have equivalents in a given language are often not translated.
Using a second language outside the main language - will generate fewer errors, making the process more intelligent.
We don't always create or correct subtitles in one language.

If the translation is not understandable enough, I am sorry.
I wanted to help You and myself.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 4th May 2020 at 19:19.
Janusz is offline   Reply With Quote
Old 4th May 2020, 20:52   #927  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Nikse555 View Post
@GCRaistlin: Beta updated
I really can't find a pixel space to binary OCR the italic part of this sup. No problems at all with SupRip.

Can you please tell me if you can and what parameters are you using?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 5th May 2020, 00:28   #928  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Quote:
Originally Posted by Nikse555 View Post
Fixed dvd sup from cmd line
Not sure what you mean. The other fixes are confirmed, thank you.

Quote:
Originally Posted by Nikse555 View Post
>3. Open a BD SUP...
Yeah, that's a feature. Do any real work and SE will prompt.
I did the real work actually: I opened a BD SUP file, started OCR from the beginning, then aborted it at some point. There is recognized data, but if I press Cancel SE discards it silently. Though if I do the same steps but start OCR from # 1000 SE asks me about discarding. Why the behavior is different?
My logic is pretty simple: if there is any recognized data (even just one char) SE should ask about discarding if the user press Cancel or tries to close the window.

Quote:
Originally Posted by Nikse555 View Post
>3. The ability to load DVD SUP files from UI...
That works here in main window via File - Open... or drag-n-drop. How/where does it not work?
Yeah, this works. I should say though that it is completely unintuitive: we should use 'File - Import/OCR' for BD SUP files but 'File - Open' for DVD SUP files.

Bug: if we open a DVD SUP from the command line and then press Cancel in 'Import/OCR' dialog SE remains open.
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 5th May 2020, 11:47   #929  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@GCRaistlin:
"Fixed dvd sup from cmd line" is about the "*" in the title bar.

OK, OCR window should now prompt for save changes if anything has been added.

File - Import/OCR' for BD SUP will now also allow dvd sup. I just always use File -> Open...

>Bug: if we open a DVD SUP from the command line and then press Cancel in 'Import/OCR' dialog SE remains open.
Thx, should now hopefully be fixed.

Latest beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip



Quote:
Originally Posted by tormento View Post
I really can't find a pixel space to binary OCR the italic part of this sup. No problems at all with SupRip.

Can you please tell me if you can and what parameters are you using?
Sorry, SE cannot handle that file via "Binary image compare".
Nikse555 is offline   Reply With Quote
Old 5th May 2020, 12:05   #930  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Nikse555 View Post
Sorry, SE cannot handle that file via "Binary image compare".
SupRip tilt OCR frames according to character vertical lines.

Isn't possible to implement something like that into SE?
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 5th May 2020, 13:47   #931  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
@Nikse

This too has problem with Italic. Perhaps there was some regression at some point, because almost all titles I am doing OCR are "cursed".

Can you fix the binary compare engine, regarding Italic? You are telling that it's easily fixed by OCR error rules but I can't find any universally suitable.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 5th May 2020, 15:50   #932  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Quote:
Originally Posted by Nikse555 View Post
File - Import/OCR' for BD SUP will now also allow dvd sup. I just always use File -> Open...
Maybe it's better then to remove 'Import/OCR VobSub' and 'Import/OCR BluRay (.sup)' menu items since such files can be opened via File - Open? If you think it's too much then please at least rename 'Import/OCR BluRay (.sup)' to simple 'Import/OCR .sup subtitle file'.

Feature requests:
  1. Ability to set font properties for Spell Check dialog window.
  2. Ability to run Spell Checking from within 'Import/OCR' window after performing OCR - with the possibility to select the subpic with the currently unknown word. Why don't I perform spell checking during the OCR? Because I prefer adding a better match than using 'Fix common OCR errors' or spell check (it's better to prevent errors than to fix them). I consider spell check as a tool that helps finding errors that were missed first during the OCR and then during visual check.

A lot of mistakes in word boundaries detection could be avoided if the vertical lines of a character, if any, were taken into account in the first place. Therefore we need a new setting - 'No of pixels from/to vertical line is space'; this setting takes precedence over simple 'No of pixels is space'. Examples:
  1. "of July" in this subpic:

    is now being recognized as "ofJ uly" - two errors in once (with 'No of pixel is space' setting of 9 which seems to be the most reliable choice for all subpics in the sup file). The modified algorithm could process it as follows:
    For "f", there is 27 running dots with the same X position from the right boundary of 47 total "vertical" dots; hence, we have a vertical line from the right side here.
    For "J", there is 46 running dots with the same X position from the left boundary of 56 total "vertical" dots; hence, we have a vertical line from the left side here.
    There are 26 pixels between these two vertical lines while there are only 6 pixels between the most right dot of "f" and the most left dot of "J".
    There are 10 pixels between "J" and "u".
  2. "If you" in this subpic:

    is now being recognized as "Ifyou" (with 'No of pixel is space' setting of 9). The modified algorithm could process it as follows:
    We have the same case with "f" as above.
    With "y", we don't have a vertical line from the left side as there's no enough quantity of running dots with the same X position from the left boundary.
    Hence, we take into account the right vertical line of "f" and the most left dot of "y" when counting pixels between these characters: 19 pixels.
With 'No of pixels from/to vertical line is space' setting of 19 and 'No of pixel is space' setting of 10 both problematic substrings above would be recognized properly.
__________________
Windows 8.1 x64

Magically yours
Raistlin

Last edited by GCRaistlin; 5th May 2020 at 15:53.
GCRaistlin is offline   Reply With Quote
Old 5th May 2020, 22:24   #933  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Feature request: use different fonts for list view and text boxes. For text boxes, Courier New is sometimes a better choice than Tahoma as it clearly shows the difference between a double quote and doubled apostrophe (OCR error). But list view looks ugly with Courier.
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 6th May 2020, 01:22   #934  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
There's some incompatibility with Ditto, the clipboard manager:
  1. Download and install Ditto.
  2. Import the .reg file:
    Code:
    [HKEY_CURRENT_USER\Software\Ditto]
    "DittoHotKey"=dword:00000857
    It sets Win+W keyboard shortcut to call Ditto.
  3. Run Ditto, copy some text to the clipboard.
  4. Open SE, place the cursor to the text box.
  5. Press Win+W, select the text in the list, press Enter.
The text is pasted. Now you are unable to edit the text in the text box (though you can paste there).
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 6th May 2020, 11:10   #935  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
If possible, do no automatically select remove line break when the next line starts with a capital letter.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 6th May 2020, 16:53   #936  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
Just wanted to say thanks for Subtitle Edit, couldn't do without it!
I'm not much of a bugtester (there seems to be others... as SE works perfectly for me!
So... I feel there's no need for changing the GUI.
varekai is offline   Reply With Quote
Old 6th May 2020, 16:56   #937  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
@GCRaistlin
Please, consider another image hosting...
varekai is offline   Reply With Quote
Old 6th May 2020, 18:30   #938  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
varekai
You are welcome to offer another one if it is as handy as the current one is.
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 6th May 2020, 19:03   #939  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
This is how I would do it...
varekai is offline   Reply With Quote
Old 6th May 2020, 19:57   #940  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@GCRaistlin: Regarding Ditto, I have seen the error like 1 time in 100 pastes and I'm afraid I've no idea what's wrong or how to fix it. Ideas/fixes are welcome.

@tormento: Writing a new image-to-letter-splitter and integrating it into SE is a lot of work - could easily take 14+ days full time.
Is SupRip open source?
Edit: Thx for the sample files with italic! Surely uses more tilt than I've seen before.
Nikse555 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 13:33.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.