Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
18th May 2020, 03:04 | #981 | Link | |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Quote:
For the purposes of the test I am not using pol_OCRFixReplaceList_User.xml. Settings.xml, pol_OCRFixReplaceList, test.sup, test.db (incomplete), test.nocr in janusz.test.zip for download. Images for this file come from various subtitles, hence many duplicate characters, but this is not a problem. These or other characters are to be interpreted (read) correctly. This is the assumption. I know that the sup file for this test was generated from images containing some error and the number of errors (5 in 4 lines) has nothing to do with the number of errors in the text consisting of a thousand or more lines. To begin with, the analysis of text created without a dictionary - that is, how the program itself deals with OCR. 1. Lines 5, 6 and 7 we see "I", which we do not have in the character database. Creating the base for this example was not possible because the text does not contain "I" at all. So where does this come from? Suspicion falls on the program. Browsing this forum, not everyone looks here, we'll find out that the program can replace l with I: at the beginning, in the middle and, surprisingly, at the end of words written in lower case. And also at the beginning of a paragraph or task - example line 5 where the dot in this case does not mean the end of the sentence. Lines 6 and 7 in the original texts were a continuation of the sentence and should not be changed. I will add that the words "lub" (or), "lecz" (but) are used in Polish often so for normal, full text there will be many mistakes. I believe that the program function, which always works, cannot generate errors for any selected language, especially in its absence. What have we gained? 3 errors instead of 0 (zero). With longer texts, the number of good replacements will always be less than the number of errors for a simple reason. Statistical "I" is less common than "l" at the beginning of words, and certainly not in the middle or end of words written in lowercase. Therefore, I would prefer to correct only errors arising in the OCR process. Why do I need extra? 2. Line 8. There are 2 cases of combined words here. "chybajuż" and "przynajmniejpod". I can improve them by reducing [No of pixels is space] to 3. I will get "przynajmniej pod" - that's OK, the rest of the text above. The phrase "chybajuż" will divide into two words "chyba już" only at 2. However, now OCR found additional apostrophes, which at the beginning creating a character base I combined into one ["]. The effect: line 8 is OK, but the text above went apart. [No of pixels is space] parameter is too small, hence my request in one of the previous posts for a different space for italics. Interesting fact: selecting [Inspect nocr matches for current image ...] on line 8 will display the text correctly with appropriate spacing for [No of pixels is space] = 2, pressing OK will not save any changes to the text, however, because this window is only for characters in the database. If selecting OK saved these changes to the text would be great, at least until the italics problem is solved globally. OKAY. To deal with line 8 I return to setting [4]. I switch the dictionary to Polish. The pol_OCRFixReplaceList.xml file already contains a <WordPart from = "j" to = " j" /> line in the <PartialWords> section - this is OK for "chybajuż" - but let's see what happened with "przynajmniejpod". Based on a comment to this section: the program added a space before "j", did not find in the dictionary either "przynajmnie" or "jpod" - such words do not exist in Polish. For me, the repair program should end its work at this stage and change nothing. Why did he divide the program by "j" and also replace "p" with "j". I could use <WordPart from = "j" to = "j "> for this and similar expressions, but such a conversion in at least the Polish language will divide one correct word into two other also correct, e.g. "najjaśniejszy" (brightest) to "naj" (most) and "jaśniejszy" (brighter) so I can't use it. In addition, I will not see such a replacement on the [All fixes] list or on [Guesses used] as opposed to substituting "p" for "j". This replacement is visible and can be quickly corrected manually. Bottom line: it remains to improve "improved" again, as in item 1. I saw in some files, e.g. dan_OCRFixReplaceList.xml, in the part concerning division into two words, such a notation, e.g. <WordPart from = "o" to = "e" />. Why is this supposed to serve as not just a simple conversion of "o" to "e". 3. Now lines 1 to 4. They look flawless - that's how it is. Please perform [New], we will create a new character base, any name other than "test", press [Edit], [Import] - indicate our base "test.nocr", [OK]. We return to OCR, we set ourselves on the first line and [Start]. Result: during import we lost all characters resulting from the combination of 2 or 3 adjacent characters, i.e. [''] is ["], [o/o] is [%]. This is what it looks like. Characters added by the extension to the adjacent character or characters, are invisible in the database once, and two are lost when importing into a new character base. It's a lot, but I wanted to write more than just "not working". Thank you for the dark background in [Set un-italic factor]. I wanted to ask for this for a long time.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 18th May 2020 at 03:13. |
|
18th May 2020, 12:52 | #982 | Link | |
Suspended for forum rule violations
Join Date: Jul 2006
Posts: 528
|
Quote:
Are you really that stupid? If you wanna post a link to your neverending images use imgur.com and point directly to the jpg https://i.imgur.com/7jpu1si.jpg or use imgur.html https://imgur.com/a/dOBSfAA Please stop using fastpic*ru it's awful!! Grr... |
|
18th May 2020, 14:50 | #983 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: Sorry, I've not really done any work with "nOcr (line ocr)"... I've mostly done stuff to improve "Binary image compare"
With latest beta ( https://github.com/SubtitleEdit/subt...leEditBeta.zip ) I get this result with your sup file: |
18th May 2020, 15:20 | #984 | Link |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Thank you, Nikse555.
I also thought that nothing was happening with nOCR. Please, look again at line 8 at home and my attention 2 above. Why this division and why is "p" converted to "j"? The nOCR method gave me the same result with latest beta 119
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 18th May 2020 at 15:36. |
18th May 2020, 16:35 | #986 | Link | |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip
(also tried to fix the nOcr issues) Works best with pixes-is-space = 3 for me... Quote:
|
|
18th May 2020, 18:50 | #988 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip
|
18th May 2020, 21:07 | #989 | Link | |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Quote:
<WordPart from = "ą" to = "ą " /> <WordPart from = "j" to = " j" /> divide expressions consisting of two or even three combined words into single words. With an earlier amendment regarding "l" and "I", the text consisting of 1189 lines, of which almost half was written in italics, is read almost 100%. There are two mistakes to improve. If you add them to your replacements, the effectiveness will be 100%. Really good work @ Nikse555. Thank you again.
__________________
Sorry for my mistakes - I'm using a translator. |
|
18th May 2020, 22:08 | #990 | Link |
Registered User
Join Date: Jun 2006
Posts: 353
|
Nikse555
The latest beta still allows to add an empty better multi match. Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I mean along with showing the context menu.
__________________
Windows 8.1 x64 Magically yours Raistlin |
19th May 2020, 07:05 | #991 | Link |
Suspended for forum rule violations
Join Date: Jul 2006
Posts: 528
|
Code:
This is the message that was sent from you: *************** Who are you to speak to me this way? These pics aren't for you. Don't open them and relax if you never have heard about ad blockers. *************** We post images, to show and clearify the issues we have and want to report bugs, get input/help from the forum. You make a clickable link and the whole idea of that is to make someone click on it, right? Therefore you should spare the forum from ugly sites like fastpic*ru. Of course I have many layers of protection for my computer, including AV, Firewall, AD- and Script-blockers and what not. Not all in here have that protection. How hard can it be for you to understand that? Really? Do yourself and the forum a favour and use another imagehost, imgur*com is very good and easy to use and... it's adfree (almost)! No nasty pics close to pron, no ads, no popup windows etc etc... If you don't understand the difference... This is it: Link: https://imgur.com/a/Lqb3rjn Image: Last edited by varekai; 19th May 2020 at 07:31. Reason: . |
19th May 2020, 11:33 | #992 | Link |
Suspended for forum rule violations
Join Date: Jul 2006
Posts: 528
|
@GCRaistlin
Code:
This is the message that was sent from you: *************** If you don't understand that other visitors aren't interested in this discussion it's your problem. Nobody else seems to care about Fastpic so don't try protecting those who don't need your protection. And don't bother to address me again on the forum, you won't get any answer. *************** |
19th May 2020, 12:39 | #994 | Link |
Registered User
Join Date: Jan 2014
Location: Poland
Posts: 64
|
https://i.imgur.com/oePIeyR.png
I did it in 10 minutes. http://www.mediafire.com/file/7daaf4...3_eng.srt/file |
19th May 2020, 16:45 | #995 | Link |
Acid fr0g
Join Date: May 2002
Location: Italy
Posts: 2,577
|
And you did it wrong.
"of" is italic while in your OCR it is in normal style. I am finding issues, not establishing OCR time records. Would you please explain me how can line 1097 contain the {\an8} marker? I never noticed Subtitle Edit was capable of it.
__________________
@turment on Telegram Last edited by tormento; 19th May 2020 at 16:50. |
19th May 2020, 20:38 | #996 | Link | ||
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
Quote:
If you change [No of pixels is space] to e.g. 8, you will get 2 words "of" - in italics and "Apollo" - not italics, and "</i>" will be inserted after "of", but with such a small space remaining text will split up. As you can see, this functionality still needs to be refined. Quote:
Perhaps at this time other permanently embedded subtitles will appear at the bottom of the screen.
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 19th May 2020 at 21:06. |
||
20th May 2020, 12:20 | #998 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Melan: thx, I think SE works really well too. It's still nice with feedback and ideas as it might help with making SE even better.
@tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called). [No of pixels is space] = 13 worked fine for me I think. SE can detect top align from Bluray .sup files - can be toggled via right click on the image... I've also added a on-video-preview for each image - press Ctrl+P to see the subtitle on actual screen size. @GCRaistli >The latest beta still allows to add an empty better multi match. I think that "empty string" could be a valid text... perhaps a warning? >Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I don't follow... ? Latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip |
20th May 2020, 14:02 | #999 | Link | |
Registered User
Join Date: Apr 2020
Location: Poland
Posts: 143
|
@Nikse555
Is there sense for the nOCR method to continue reporting bugs in this forum since no one is using this method here? As you wrote above, you recommend "Binary image compare", and nothing has happened with the nOCR project for a long time. I would just ask you to fix the crash of the nOCR process from the start when "no dictionary" was selected. I get this error (last beta 123 and several earlier) regardless of the configuration for the program. In stable versions 3.5.14 and 3.5.15 this error is not there. If you need any files, you can use those from 18/05/2020. Setting various options except [Dictionary = none] in the nOCR window does not affect the error. Excerpt from error_log.txt Quote:
__________________
Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 20th May 2020 at 14:08. |
|
20th May 2020, 14:27 | #1000 | Link |
Registered User
Join Date: Feb 2004
Location: Mars
Posts: 428
|
@Janusz: Is the crash fixed in this beta?
https://github.com/SubtitleEdit/subt...leEditBeta.zip |
Thread Tools | Search this Thread |
Display Modes | |
|
|