Subtitle Edit 4.0.4 - Page 50

Janusz · 18th May 2020, 03:04

Quote:

Originally Posted by Nikse555

@Janusz: I've fixed an issue related to your last post, but it's really hard to test without your exact setup/sup... could you make a .zip archive with all relevant files, if latest beta still has issues?

I use Windows 10, 64 bit. For this test Subtitle Edit 3.5.15 NEXT, beta 106, and nOCR.
For the purposes of the test I am not using pol_OCRFixReplaceList_User.xml.
Settings.xml, pol_OCRFixReplaceList, test.sup, test.db (incomplete), test.nocr in janusz.test.zip for download.
Images for this file come from various subtitles, hence many duplicate characters, but this is not a problem.
These or other characters are to be interpreted (read) correctly. This is the assumption.

I know that the sup file for this test was generated from images containing some error and the number of errors
(5 in 4 lines) has nothing to do with the number of errors in the text consisting of a thousand or more lines.

To begin with, the analysis of text created without a dictionary - that is, how the program itself deals with OCR.

1. Lines 5, 6 and 7 we see "I", which we do not have in the character database. Creating the base for this example
was not possible because the text does not contain "I" at all. So where does this come from? Suspicion falls on the program.
Browsing this forum, not everyone looks here, we'll find out that the program can replace l with I: at the beginning,
in the middle and, surprisingly, at the end of words written in lower case.
And also at the beginning of a paragraph or task - example line 5 where the dot in this case does not mean the end of the sentence.
Lines 6 and 7 in the original texts were a continuation of the sentence and should not be changed.
I will add that the words "lub" (or), "lecz" (but) are used in Polish often so for normal, full text there will be many mistakes.

I believe that the program function, which always works, cannot generate errors for any selected language, especially in its absence.
What have we gained? 3 errors instead of 0 (zero). With longer texts, the number of good replacements will always be less than
the number of errors for a simple reason. Statistical "I" is less common than "l" at the beginning of words, and certainly not
in the middle or end of words written in lowercase. Therefore, I would prefer to correct only errors arising in the OCR process.
Why do I need extra?

2. Line 8. There are 2 cases of combined words here. "chybajuż" and "przynajmniejpod".
I can improve them by reducing [No of pixels is space] to 3. I will get "przynajmniej pod" - that's OK, the rest of the text above.
The phrase "chybajuż" will divide into two words "chyba już" only at 2. However, now OCR found additional apostrophes,
which at the beginning creating a character base I combined into one ["].
The effect: line 8 is OK, but the text above went apart. [No of pixels is space] parameter is too small,
hence my request in one of the previous posts for a different space for italics.

Interesting fact: selecting [Inspect nocr matches for current image ...] on line 8 will display the text correctly with appropriate spacing for [No of pixels is space] = 2, pressing OK will not save any changes to the text, however, because this window is only for characters in the database. If selecting OK saved these changes to the text would be great, at least until the italics problem is solved globally.

OKAY. To deal with line 8 I return to setting [4]. I switch the dictionary to Polish.
The pol_OCRFixReplaceList.xml file already contains a <WordPart from = "j" to = " j" /> line in the <PartialWords> section
- this is OK for "chybajuż" - but let's see what happened with "przynajmniejpod".
Based on a comment to this section: the program added a space before "j", did not find in the dictionary either "przynajmnie"
or "jpod" - such words do not exist in Polish. For me, the repair program should end its work at this stage and change nothing.
Why did he divide the program by "j" and also replace "p" with "j". I could use <WordPart from = "j" to = "j "> for this and similar expressions,
but such a conversion in at least the Polish language will divide one correct word into two other also correct, e.g. "najjaśniejszy" (brightest)
to "naj" (most) and "jaśniejszy" (brighter) so I can't use it. In addition, I will not see such a replacement on the [All fixes] list or on [Guesses used] as opposed to substituting "p" for "j". This replacement is visible and can be quickly corrected manually.
Bottom line: it remains to improve "improved" again, as in item 1.

I saw in some files, e.g. dan_OCRFixReplaceList.xml, in the part concerning division into two words, such a notation,
e.g. <WordPart from = "o" to = "e" />. Why is this supposed to serve as not just a simple conversion of "o" to "e".

3. Now lines 1 to 4. They look flawless - that's how it is. Please perform [New], we will create a new character base,
any name other than "test", press [Edit], [Import] - indicate our base "test.nocr", [OK].
We return to OCR, we set ourselves on the first line and [Start].
Result: during import we lost all characters resulting from the combination of 2 or 3 adjacent characters, i.e. [''] is ["], [o/o] is [%].
This is what it looks like. Characters added by the extension to the adjacent character or characters,
are invisible in the database once, and two are lost when importing into a new character base.

It's a lot, but I wanted to write more than just "not working".

Thank you for the dark background in [Set un-italic factor].
I wanted to ask for this for a long time.

varekai · 18th May 2020, 12:52

Quote:

Originally Posted by GCRaistlin

Something went wrong. I performed all the actions above, then rerun OCR from this line - SE didn't ask me anything but the percent sign is missing in the recognized line:

UPD: It seems that I didn't enter "%" to the field. It's worth to check if it isn't empty...

WTF!! You are extremely obnoxious!
Are you really that stupid?
If you wanna post a link to your neverending images use imgur.com and point directly to the jpg
https://i.imgur.com/7jpu1si.jpg
or use imgur.html
https://imgur.com/a/dOBSfAA
Please stop using fastpic*ru it's awful!!
Grr...

Nikse555 · 18th May 2020, 14:50

@Janusz: Sorry, I've not really done any work with "nOcr (line ocr)"... I've mostly done stuff to improve "Binary image compare"
With latest beta ( https://github.com/SubtitleEdit/subt...leEditBeta.zip ) I get this result with your sup file:

Janusz · 18th May 2020, 15:20

Thank you, Nikse555.
I also thought that nothing was happening with nOCR.
Please, look again at line 8 at home and my attention 2 above. Why this division and why is "p" converted to "j"?

The nOCR method gave me the same result with latest beta 119

Nikse555 · 18th May 2020, 15:48

@Janusz: yes, thx

line 8 seems to be a bug - I'll look into it.

Nikse555 · 18th May 2020, 16:35

@Janusz: Beta updated: https://github.com/SubtitleEdit/subt...leEditBeta.zip
(also tried to fix the nOcr issues)
Works best with pixes-is-space = 3 for me...

Quote:

Originally Posted by GCRaistlin

Bug(s):

Follow the steps above but add a wrong match, for example "@".
Start OCR from the same line, then interrupt it.
Call 'Inspect compare matches' window.
Delete the wrong match, add the right match, press OK.
Start OCR from the same line again.
You'll get 'VobSub - Manual image to text' window for the char you have just added a match for. And by the way the window title is incorrect - it's not the VobSub being recognized. But let's go further.
Press Abort, try to add multi match again. You'll get 'Image already in db' error.

Thx, should also be fixed in above beta.

jlw_4049 · 18th May 2020, 16:36

I will try next beta out

Sent from my Pixel 3a using Tapatalk

Nikse555 · 18th May 2020, 18:50

@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Janusz · 18th May 2020, 21:07

Quote:

Originally Posted by Nikse555

@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip

It's perfect now. Two lines in pol_OCRFixReplaceList.xml

<WordPart from = "ą" to = "ą " />
<WordPart from = "j" to = " j" />

divide expressions consisting of two or even three combined words into single words. With an earlier amendment regarding "l" and "I", the text consisting of 1189 lines, of which almost half was written in italics, is read almost 100%. There are two mistakes to improve. If you add them to your replacements, the effectiveness will be 100%.
Really good work @ Nikse555. Thank you again.

GCRaistlin · 18th May 2020, 22:08

Nikse555
The latest beta still allows to add an empty better multi match.

Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I mean along with showing the context menu.

varekai · 19th May 2020, 07:05

Code:

This is the message that was sent from you:
***************
Who are you to speak to me this way? 
These pics aren't for you. 
Don't open them and relax if you never have heard about ad blockers.
***************

This is an open forum, when you post something, everyone can read/see your post.
We post images, to show and clearify the issues we have and want to report bugs, get input/help from the forum.
You make a clickable link and the whole idea of that is to make someone click on it, right?
Therefore you should spare the forum from ugly sites like fastpic*ru.
Of course I have many layers of protection for my computer, including AV, Firewall, AD- and Script-blockers and what not.
Not all in here have that protection.
How hard can it be for you to understand that? Really?
Do yourself and the forum a favour and use another imagehost, imgur*com is very good and easy to use and... it's adfree (almost)!
No nasty pics close to pron, no ads, no popup windows etc etc...
If you don't understand the difference...
This is it:
Link:
https://imgur.com/a/Lqb3rjn
Image:

varekai · 19th May 2020, 11:33

@GCRaistlin

Code:

This is the message that was sent from you:
***************
If you don't understand that other visitors aren't interested in this discussion it's your problem. 
Nobody else seems to care about Fastpic so don't try protecting those who don't need your protection. 
And don't bother to address me again on the forum, you won't get any answer.
***************

tormento · 19th May 2020, 11:54

Quote:

Originally Posted by Nikse555

Latest beta has new (and hopefully improved) detection of space between italic letters:

Enjoy with line 13 of this.

Melan · 19th May 2020, 12:39

https://i.imgur.com/oePIeyR.png

I did it in 10 minutes.
http://www.mediafire.com/file/7daaf4...3_eng.srt/file

tormento · 19th May 2020, 16:45

Quote:

Originally Posted by Melan

I did it in 10 minutes.

And you did it wrong.

"of" is italic while in your OCR it is in normal style.

I am finding issues, not establishing OCR time records.

Would you please explain me how can line 1097 contain the {\an8} marker?

I never noticed Subtitle Edit was capable of it.

Janusz · 19th May 2020, 20:38

Quote:

"of" is italic while in your OCR it is in normal style.

To make the text look good, [No of pixels is space] = 12, and this means that "of Apollo" is one word "ofApollo" and as such it was probably included by the algorithm as not italics. I think so - I don't know the algorithm. I do not know at what moment it is divided into two words, or on what terms. Probably this happens after selecting the "English" dictionary and selecting: [Fix OCR errors] and [Try to guess unknown words].
If you change [No of pixels is space] to e.g. 8, you will get 2 words "of" - in italics and "Apollo" - not italics, and "</i>" will be inserted after "of", but with such a small space remaining text will split up.
As you can see, this functionality still needs to be refined.

Quote:

Would you please explain me how can can 1097 contain the {\ an8} marker?
I never noticed Subtitle Edit was capable of it.

For some time this Subtitle Edit tag added to me while importing subtitles from ts files for texts placed at the top of the screen. I don't remember which version.
Perhaps at this time other permanently embedded subtitles will appear at the bottom of the screen.

Melan · 19th May 2020, 20:51

@tormento
Don't be a child. If you think that more than 2,000 lines will not contain errors, you are wrong.
SE works really well.

Nikse555 · 20th May 2020, 12:20

@Melan: thx, I think SE works really well too. It's still nice with feedback and ideas as it might help with making SE even better.

@tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called). [No of pixels is space] = 13 worked fine for me I think.
SE can detect top align from Bluray .sup files - can be toggled via right click on the image... I've also added a on-video-preview for each image - press Ctrl+P to see the subtitle on actual screen size.

@GCRaistli
>The latest beta still allows to add an empty better multi match.
I think that "empty string" could be a valid text... perhaps a warning?

>Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window?
I don't follow... ?

Latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Janusz · 20th May 2020, 14:02

@Nikse555

Is there sense for the nOCR method to continue reporting bugs in this forum since no one is using this method here?
As you wrote above, you recommend "Binary image compare", and nothing has happened with the nOCR project for a long time.
I would just ask you to fix the crash of the nOCR process from the start when "no dictionary" was selected.

I get this error (last beta 123 and several earlier) regardless of the configuration for the program.
In stable versions 3.5.14 and 3.5.15 this error is not there. If you need any files, you can use those from 18/05/2020.
Setting various options except [Dictionary = none] in the nOCR window does not affect the error.

Excerpt from error_log.txt

Quote:

----------------------------------------------- ------------------------------
Date: 05/19/2020 22:38:29
Message: Unable to load '' (also check libc.so.6 + libdl.so.2)
-------------------------------------------------- ---------------------------
Date: 05/19/2020 22:38:29
Message: Not all required methods was found in libvlc
-------------------------------------------------- ---------------------------
Date: 05/19/2020 22:52:21
Message: Unable to load '' (also check libc.so.6 + libdl.so.2)
-------------------------------------------------- ---------------------------

Nikse555 · 20th May 2020, 14:27

@Janusz: Is the crash fixed in this beta?
https://github.com/SubtitleEdit/subt...leEditBeta.zip

18th May 2020, 15:20	#984 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	Thank you, Nikse555. I also thought that nothing was happening with nOCR. Please, look again at line 8 at home and my attention 2 above. Why this division and why is "p" converted to "j"? The nOCR method gave me the same result with latest beta 119 __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 18th May 2020 at 15:36.

18th May 2020, 16:36	#987 \| Link
jlw_4049 Registered User Join Date: Sep 2018 Posts: 391	I will try next beta out Sent from my Pixel 3a using Tapatalk __________________ FFMPEG Audio Encoder Youtube-DL-GUI

18th May 2020, 22:08	#990 \| Link
GCRaistlin Registered User Join Date: Jun 2006 Posts: 353	Nikse555 The latest beta still allows to add an empty better multi match. Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I mean along with showing the context menu. __________________ Windows 8.1 x64 Magically yours Raistlin

19th May 2020, 07:05	#991 \| Link
varekai Suspended for forum rule violations Join Date: Jul 2006 Posts: 528	Code: This is the message that was sent from you: ************* Who are you to speak to me this way? These pics aren't for you. Don't open them and relax if you never have heard about ad blockers. ************* This is an open forum, when you post something, everyone can read/see your post. We post images, to show and clearify the issues we have and want to report bugs, get input/help from the forum. You make a clickable link and the whole idea of that is to make someone click on it, right? Therefore you should spare the forum from ugly sites like fastpicru. Of course I have many layers of protection for my computer, including AV, Firewall, AD- and Script-blockers and what not. Not all in here have that protection. How hard can it be for you to understand that? Really? Do yourself and the forum a favour and use another imagehost, imgurcom is very good and easy to use and... it's adfree (almost)! No nasty pics close to pron, no ads, no popup windows etc etc... If you don't understand the difference... This is it: Link: https://imgur.com/a/Lqb3rjn Image: Last edited by varekai; 19th May 2020 at 07:31. Reason: .

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

18th May 2020, 14:50	#983 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: Sorry, I've not really done any work with "nOcr (line ocr)"... I've mostly done stuff to improve "Binary image compare" With latest beta ( https://github.com/SubtitleEdit/subt...leEditBeta.zip ) I get this result with your sup file:

18th May 2020, 15:48	#985 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: yes, thx line 8 seems to be a bug - I'll look into it.

18th May 2020, 18:50	#988 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: And now really fixed the italic-space-stuff in nOcr: https://github.com/SubtitleEdit/subt...leEditBeta.zip

19th May 2020, 12:39	#994 \| Link
Melan Registered User Join Date: Jan 2014 Location: Poland Posts: 64	https://i.imgur.com/oePIeyR.png I did it in 10 minutes. http://www.mediafire.com/file/7daaf4...3_eng.srt/file

19th May 2020, 20:51	#997 \| Link
Melan Registered User Join Date: Jan 2014 Location: Poland Posts: 64	@tormento Don't be a child. If you think that more than 2,000 lines will not contain errors, you are wrong. SE works really well.

20th May 2020, 12:20	#998 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Melan: thx, I think SE works really well too. It's still nice with feedback and ideas as it might help with making SE even better. @tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called). [No of pixels is space] = 13 worked fine for me I think. SE can detect top align from Bluray .sup files - can be toggled via right click on the image... I've also added a on-video-preview for each image - press Ctrl+P to see the subtitle on actual screen size. @GCRaistli >The latest beta still allows to add an empty better multi match. I think that "empty string" could be a valid text... perhaps a warning? >Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window? I don't follow... ? Latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip

20th May 2020, 14:27	#1000 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: Is the crash fixed in this beta? https://github.com/SubtitleEdit/subt...leEditBeta.zip