Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 18th May 2020, 03:04   #981  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: I've fixed an issue related to your last post, but it's really hard to test without your exact setup/sup... could you make a .zip archive with all relevant files, if latest beta still has issues?
I use Windows 10, 64 bit. For this test Subtitle Edit 3.5.15 NEXT, beta 106, and nOCR.
For the purposes of the test I am not using pol_OCRFixReplaceList_User.xml.
Settings.xml, pol_OCRFixReplaceList, test.sup, test.db (incomplete), test.nocr in janusz.test.zip for download.
Images for this file come from various subtitles, hence many duplicate characters, but this is not a problem.
These or other characters are to be interpreted (read) correctly. This is the assumption.

I know that the sup file for this test was generated from images containing some error and the number of errors
(5 in 4 lines) has nothing to do with the number of errors in the text consisting of a thousand or more lines.



To begin with, the analysis of text created without a dictionary - that is, how the program itself deals with OCR.

1. Lines 5, 6 and 7 we see "I", which we do not have in the character database. Creating the base for this example
was not possible because the text does not contain "I" at all. So where does this come from? Suspicion falls on the program.
Browsing this forum, not everyone looks here, we'll find out that the program can replace l with I: at the beginning,
in the middle and, surprisingly, at the end of words written in lower case.
And also at the beginning of a paragraph or task - example line 5 where the dot in this case does not mean the end of the sentence.
Lines 6 and 7 in the original texts were a continuation of the sentence and should not be changed.
I will add that the words "lub" (or), "lecz" (but) are used in Polish often so for normal, full text there will be many mistakes.

I believe that the program function, which always works, cannot generate errors for any selected language, especially in its absence.
What have we gained? 3 errors instead of 0 (zero). With longer texts, the number of good replacements will always be less than
the number of errors for a simple reason. Statistical "I" is less common than "l" at the beginning of words, and certainly not
in the middle or end of words written in lowercase. Therefore, I would prefer to correct only errors arising in the OCR process.
Why do I need extra?

2. Line 8. There are 2 cases of combined words here. "chybajuż" and "przynajmniejpod".
I can improve them by reducing [No of pixels is space] to 3. I will get "przynajmniej pod" - that's OK, the rest of the text above.
The phrase "chybajuż" will divide into two words "chyba już" only at 2. However, now OCR found additional apostrophes,
which at the beginning creating a character base I combined into one ["].
The effect: line 8 is OK, but the text above went apart. [No of pixels is space] parameter is too small,
hence my request in one of the previous posts for a different space for italics.

Interesting fact: selecting [Inspect nocr matches for current image ...] on line 8 will display the text correctly with appropriate spacing for [No of pixels is space] = 2, pressing OK will not save any changes to the text, however, because this window is only for characters in the database. If selecting OK saved these changes to the text would be great, at least until the italics problem is solved globally.

OKAY. To deal with line 8 I return to setting [4]. I switch the dictionary to Polish.
The pol_OCRFixReplaceList.xml file already contains a <WordPart from = "j" to = " j" /> line in the <PartialWords> section
- this is OK for "chybajuż" - but let's see what happened with "przynajmniejpod".
Based on a comment to this section: the program added a space before "j", did not find in the dictionary either "przynajmnie"
or "jpod" - such words do not exist in Polish. For me, the repair program should end its work at this stage and change nothing.
Why did he divide the program by "j" and also replace "p" with "j". I could use <WordPart from = "j" to = "j "> for this and similar expressions,
but such a conversion in at least the Polish language will divide one correct word into two other also correct, e.g. "najjaśniejszy" (brightest)
to "naj" (most) and "jaśniejszy" (brighter) so I can't use it. In addition, I will not see such a replacement on the [All fixes] list or on [Guesses used] as opposed to substituting "p" for "j". This replacement is visible and can be quickly corrected manually.
Bottom line: it remains to improve "improved" again, as in item 1.

I saw in some files, e.g. dan_OCRFixReplaceList.xml, in the part concerning division into two words, such a notation,
e.g. <WordPart from = "o" to = "e" />. Why is this supposed to serve as not just a simple conversion of "o" to "e".

3. Now lines 1 to 4. They look flawless - that's how it is. Please perform [New], we will create a new character base,
any name other than "test", press [Edit], [Import] - indicate our base "test.nocr", [OK].
We return to OCR, we set ourselves on the first line and [Start].
Result: during import we lost all characters resulting from the combination of 2 or 3 adjacent characters, i.e. [''] is ["], [o/o] is [%].
This is what it looks like. Characters added by the extension to the adjacent character or characters,
are invisible in the database once, and two are lost when importing into a new character base.

It's a lot, but I wanted to write more than just "not working".

Thank you for the dark background in [Set un-italic factor].
I wanted to ask for this for a long time.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 18th May 2020 at 03:13.
Janusz is offline   Reply With Quote
Old 18th May 2020, 12:52   #982  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
Quote:
Originally Posted by GCRaistlin View Post
Something went wrong. I performed all the actions above, then rerun OCR from this line - SE didn't ask me anything but the percent sign is missing in the recognized line:

UPD: It seems that I didn't enter "%" to the field. It's worth to check if it isn't empty...
WTF!! You are extremely obnoxious!
Are you really that stupid?
If you wanna post a link to your neverending images use imgur.com and point directly to the jpg
https://i.imgur.com/7jpu1si.jpg
or use imgur.html
https://imgur.com/a/dOBSfAA
Please stop using fastpic*ru it's awful!!
Grr...
varekai is offline   Reply With Quote
Old 18th May 2020, 14:50   #983  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Sorry, I've not really done any work with "nOcr (line ocr)"... I've mostly done stuff to improve "Binary image compare"
With latest beta ( https://github.com/SubtitleEdit/subt...leEditBeta.zip ) I get this result with your sup file:

Nikse555 is offline   Reply With Quote
Old 18th May 2020, 15:20   #984  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Thank you, Nikse555.
I also thought that nothing was happening with nOCR.
Please, look again at line 8 at home and my attention 2 above. Why this division and why is "p" converted to "j"?

The nOCR method gave me the same result with latest beta 119
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 18th May 2020 at 15:36.
Janusz is offline   Reply With Quote
Old 18th May 2020, 15:48   #985  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: yes, thx
line 8 seems to be a bug - I'll look into it.
Nikse555 is offline   Reply With Quote
Old 18th May 2020, 16:35   #986  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip
(also tried to fix the nOcr issues)
Works best with pixes-is-space = 3 for me...

Quote:
Originally Posted by GCRaistlin View Post
Bug(s):
  1. Follow the steps above but add a wrong match, for example "@".
  2. Start OCR from the same line, then interrupt it.
  3. Call 'Inspect compare matches' window.
  4. Delete the wrong match, add the right match, press OK.
  5. Start OCR from the same line again.
    You'll get 'VobSub - Manual image to text' window for the char you have just added a match for. And by the way the window title is incorrect - it's not the VobSub being recognized. But let's go further.
  6. Press Abort, try to add multi match again. You'll get 'Image already in db' error.
Thx, should also be fixed in above beta.
Nikse555 is offline   Reply With Quote
Old 18th May 2020, 16:36   #987  |  Link
jlw_4049
Registered User
 
Join Date: Sep 2018
Posts: 391
I will try next beta out

Sent from my Pixel 3a using Tapatalk
jlw_4049 is offline   Reply With Quote
Old 18th May 2020, 18:50   #988  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Nikse555 is offline   Reply With Quote
Old 18th May 2020, 21:07   #989  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip
It's perfect now. Two lines in pol_OCRFixReplaceList.xml

<WordPart from = "ą" to = "ą " />
<WordPart from = "j" to = " j" />

divide expressions consisting of two or even three combined words into single words. With an earlier amendment regarding "l" and "I", the text consisting of 1189 lines, of which almost half was written in italics, is read almost 100%. There are two mistakes to improve. If you add them to your replacements, the effectiveness will be 100%.
Really good work @ Nikse555. Thank you again.
__________________
Sorry for my mistakes - I'm using a translator.
Janusz is offline   Reply With Quote
Old 18th May 2020, 22:08   #990  |  Link
GCRaistlin
Registered User
 
GCRaistlin's Avatar
 
Join Date: Jun 2006
Posts: 350
Nikse555
The latest beta still allows to add an empty better multi match.

Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I mean along with showing the context menu.
__________________
Windows 8.1 x64

Magically yours
Raistlin
GCRaistlin is offline   Reply With Quote
Old 19th May 2020, 07:05   #991  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
Code:
This is the message that was sent from you:
***************
Who are you to speak to me this way? 
These pics aren't for you. 
Don't open them and relax if you never have heard about ad blockers.
***************
This is an open forum, when you post something, everyone can read/see your post.
We post images, to show and clearify the issues we have and want to report bugs, get input/help from the forum.
You make a clickable link and the whole idea of that is to make someone click on it, right?
Therefore you should spare the forum from ugly sites like fastpic*ru.
Of course I have many layers of protection for my computer, including AV, Firewall, AD- and Script-blockers and what not.
Not all in here have that protection.
How hard can it be for you to understand that? Really?
Do yourself and the forum a favour and use another imagehost, imgur*com is very good and easy to use and... it's adfree (almost)!
No nasty pics close to pron, no ads, no popup windows etc etc...
If you don't understand the difference...
This is it:
Link:
https://imgur.com/a/Lqb3rjn
Image:

Last edited by varekai; 19th May 2020 at 07:31. Reason: .
varekai is offline   Reply With Quote
Old 19th May 2020, 11:33   #992  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
@GCRaistlin
Code:
This is the message that was sent from you:
***************
If you don't understand that other visitors aren't interested in this discussion it's your problem. 
Nobody else seems to care about Fastpic so don't try protecting those who don't need your protection. 
And don't bother to address me again on the forum, you won't get any answer.
***************
varekai is offline   Reply With Quote
Old 19th May 2020, 11:54   #993  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Nikse555 View Post
Latest beta has new (and hopefully improved) detection of space between italic letters:
Enjoy with line 13 of this.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 19th May 2020, 12:39   #994  |  Link
Melan
Registered User
 
Melan's Avatar
 
Join Date: Jan 2014
Location: Poland
Posts: 64
https://i.imgur.com/oePIeyR.png

I did it in 10 minutes.
http://www.mediafire.com/file/7daaf4...3_eng.srt/file
Melan is offline   Reply With Quote
Old 19th May 2020, 16:45   #995  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Melan View Post
I did it in 10 minutes.
And you did it wrong.

"of" is italic while in your OCR it is in normal style.

I am finding issues, not establishing OCR time records.

Would you please explain me how can line 1097 contain the {\an8} marker?

I never noticed Subtitle Edit was capable of it.
__________________
@turment on Telegram

Last edited by tormento; 19th May 2020 at 16:50.
tormento is offline   Reply With Quote
Old 19th May 2020, 20:38   #996  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
"of" is italic while in your OCR it is in normal style.
To make the text look good, [No of pixels is space] = 12, and this means that "of Apollo" is one word "ofApollo" and as such it was probably included by the algorithm as not italics. I think so - I don't know the algorithm. I do not know at what moment it is divided into two words, or on what terms. Probably this happens after selecting the "English" dictionary and selecting: [Fix OCR errors] and [Try to guess unknown words].
If you change [No of pixels is space] to e.g. 8, you will get 2 words "of" - in italics and "Apollo" - not italics, and "</i>" will be inserted after "of", but with such a small space remaining text will split up.
As you can see, this functionality still needs to be refined.

Quote:
Would you please explain me how can can 1097 contain the {\ an8} marker?
I never noticed Subtitle Edit was capable of it.
For some time this Subtitle Edit tag added to me while importing subtitles from ts files for texts placed at the top of the screen. I don't remember which version.
Perhaps at this time other permanently embedded subtitles will appear at the bottom of the screen.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 19th May 2020 at 21:06.
Janusz is offline   Reply With Quote
Old 19th May 2020, 20:51   #997  |  Link
Melan
Registered User
 
Melan's Avatar
 
Join Date: Jan 2014
Location: Poland
Posts: 64
@tormento
Don't be a child. If you think that more than 2,000 lines will not contain errors, you are wrong.
SE works really well.
Melan is offline   Reply With Quote
Old 20th May 2020, 12:20   #998  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Melan: thx, I think SE works really well too. It's still nice with feedback and ideas as it might help with making SE even better.


@tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called). [No of pixels is space] = 13 worked fine for me I think.
SE can detect top align from Bluray .sup files - can be toggled via right click on the image... I've also added a on-video-preview for each image - press Ctrl+P to see the subtitle on actual screen size.


@GCRaistli
>The latest beta still allows to add an empty better multi match.
I think that "empty string" could be a valid text... perhaps a warning?

>Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window?
I don't follow... ?


Latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Nikse555 is offline   Reply With Quote
Old 20th May 2020, 14:02   #999  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse555

Is there sense for the nOCR method to continue reporting bugs in this forum since no one is using this method here?
As you wrote above, you recommend "Binary image compare", and nothing has happened with the nOCR project for a long time.
I would just ask you to fix the crash of the nOCR process from the start when "no dictionary" was selected.

I get this error (last beta 123 and several earlier) regardless of the configuration for the program.
In stable versions 3.5.14 and 3.5.15 this error is not there. If you need any files, you can use those from 18/05/2020.
Setting various options except [Dictionary = none] in the nOCR window does not affect the error.



Excerpt from error_log.txt

Quote:
----------------------------------------------- ------------------------------
Date: 05/19/2020 22:38:29
Message: Unable to load '' (also check libc.so.6 + libdl.so.2)
-------------------------------------------------- ---------------------------
Date: 05/19/2020 22:38:29
Message: Not all required methods was found in libvlc
-------------------------------------------------- ---------------------------
Date: 05/19/2020 22:52:21
Message: Unable to load '' (also check libc.so.6 + libdl.so.2)
-------------------------------------------------- ---------------------------
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 20th May 2020 at 14:08.
Janusz is offline   Reply With Quote
Old 20th May 2020, 14:27   #1000  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Is the crash fixed in this beta?
https://github.com/SubtitleEdit/subt...leEditBeta.zip
Nikse555 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 19:53.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.