Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

 
 
Thread Tools Search this Thread Display Modes
Prev Previous Post   Next Post Next
Old 18th May 2020, 03:04   #11  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: I've fixed an issue related to your last post, but it's really hard to test without your exact setup/sup... could you make a .zip archive with all relevant files, if latest beta still has issues?
I use Windows 10, 64 bit. For this test Subtitle Edit 3.5.15 NEXT, beta 106, and nOCR.
For the purposes of the test I am not using pol_OCRFixReplaceList_User.xml.
Settings.xml, pol_OCRFixReplaceList, test.sup, test.db (incomplete), test.nocr in janusz.test.zip for download.
Images for this file come from various subtitles, hence many duplicate characters, but this is not a problem.
These or other characters are to be interpreted (read) correctly. This is the assumption.

I know that the sup file for this test was generated from images containing some error and the number of errors
(5 in 4 lines) has nothing to do with the number of errors in the text consisting of a thousand or more lines.



To begin with, the analysis of text created without a dictionary - that is, how the program itself deals with OCR.

1. Lines 5, 6 and 7 we see "I", which we do not have in the character database. Creating the base for this example
was not possible because the text does not contain "I" at all. So where does this come from? Suspicion falls on the program.
Browsing this forum, not everyone looks here, we'll find out that the program can replace l with I: at the beginning,
in the middle and, surprisingly, at the end of words written in lower case.
And also at the beginning of a paragraph or task - example line 5 where the dot in this case does not mean the end of the sentence.
Lines 6 and 7 in the original texts were a continuation of the sentence and should not be changed.
I will add that the words "lub" (or), "lecz" (but) are used in Polish often so for normal, full text there will be many mistakes.

I believe that the program function, which always works, cannot generate errors for any selected language, especially in its absence.
What have we gained? 3 errors instead of 0 (zero). With longer texts, the number of good replacements will always be less than
the number of errors for a simple reason. Statistical "I" is less common than "l" at the beginning of words, and certainly not
in the middle or end of words written in lowercase. Therefore, I would prefer to correct only errors arising in the OCR process.
Why do I need extra?

2. Line 8. There are 2 cases of combined words here. "chybajuż" and "przynajmniejpod".
I can improve them by reducing [No of pixels is space] to 3. I will get "przynajmniej pod" - that's OK, the rest of the text above.
The phrase "chybajuż" will divide into two words "chyba już" only at 2. However, now OCR found additional apostrophes,
which at the beginning creating a character base I combined into one ["].
The effect: line 8 is OK, but the text above went apart. [No of pixels is space] parameter is too small,
hence my request in one of the previous posts for a different space for italics.

Interesting fact: selecting [Inspect nocr matches for current image ...] on line 8 will display the text correctly with appropriate spacing for [No of pixels is space] = 2, pressing OK will not save any changes to the text, however, because this window is only for characters in the database. If selecting OK saved these changes to the text would be great, at least until the italics problem is solved globally.

OKAY. To deal with line 8 I return to setting [4]. I switch the dictionary to Polish.
The pol_OCRFixReplaceList.xml file already contains a <WordPart from = "j" to = " j" /> line in the <PartialWords> section
- this is OK for "chybajuż" - but let's see what happened with "przynajmniejpod".
Based on a comment to this section: the program added a space before "j", did not find in the dictionary either "przynajmnie"
or "jpod" - such words do not exist in Polish. For me, the repair program should end its work at this stage and change nothing.
Why did he divide the program by "j" and also replace "p" with "j". I could use <WordPart from = "j" to = "j "> for this and similar expressions,
but such a conversion in at least the Polish language will divide one correct word into two other also correct, e.g. "najjaśniejszy" (brightest)
to "naj" (most) and "jaśniejszy" (brighter) so I can't use it. In addition, I will not see such a replacement on the [All fixes] list or on [Guesses used] as opposed to substituting "p" for "j". This replacement is visible and can be quickly corrected manually.
Bottom line: it remains to improve "improved" again, as in item 1.

I saw in some files, e.g. dan_OCRFixReplaceList.xml, in the part concerning division into two words, such a notation,
e.g. <WordPart from = "o" to = "e" />. Why is this supposed to serve as not just a simple conversion of "o" to "e".

3. Now lines 1 to 4. They look flawless - that's how it is. Please perform [New], we will create a new character base,
any name other than "test", press [Edit], [Import] - indicate our base "test.nocr", [OK].
We return to OCR, we set ourselves on the first line and [Start].
Result: during import we lost all characters resulting from the combination of 2 or 3 adjacent characters, i.e. [''] is ["], [o/o] is [%].
This is what it looks like. Characters added by the extension to the adjacent character or characters,
are invisible in the database once, and two are lost when importing into a new character base.

It's a lot, but I wanted to write more than just "not working".

Thank you for the dark background in [Set un-italic factor].
I wanted to ask for this for a long time.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 18th May 2020 at 03:13.
Janusz is offline   Reply With Quote
 

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:41.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.