Subtitle Edit 4.0.4 - Page 51

tormento · 20th May 2020, 14:35

Quote:

Originally Posted by Nikse555

@tormento: Ah, did you set the proper "italic factor"? Right click in the list view, and choose "Set un-italic" factor (I think it's called).

What value did you put in the un-italic? Tried many and no luck.

Janusz · 20th May 2020, 14:35

Quote:

Originally Posted by Nikse555

@Janusz: Is the crash fixed in this beta?
https://github.com/SubtitleEdit/subt...leEditBeta.zip

Yes too.
But now I don't have the "error_log.txt" file, the warning window is the same as before.

tormento · 20th May 2020, 14:36

Quote:

Originally Posted by Melan

Don't be a child.

Oh.

My.

God.

41 posts and you sermonize.

Nikse555 · 20th May 2020, 17:46

@Janusz: OK, think I got the crash now: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@tormento: Right click on image for line 3 in Apollo13 and choose "Set align angle" (previously "Set un-italic factor"). Looks like "0,21" is a good value. Does that work for you? (#pixels is space=15)

Janusz · 20th May 2020, 18:02

Quote:

Originally Posted by Nikse555

@Janusz: OK, think I got the crash now: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Yes, it works, thank you.
I have a few more comments for this wonderful program that do not depend on the OCR method chosen, but first I need to prepare the appropriate files.

tormento · 20th May 2020, 18:46

Quote:

Originally Posted by Nikse555

Right click on image for line 3 in Apollo13 and choose "Set align angle" (previously "Set un-italic factor"). Looks like "0,21" is a good value. Does that work for you? (#pixels is space=15)

Unfortunately, even with less space pixels between letters, SE recognizes "of" outside italic and attaches it to "Apollo".

Nikse555 · 20th May 2020, 20:18

@tormento: thx

This should now work (did not work because it was half italic / half regular): https://github.com/SubtitleEdit/subt...leEditBeta.zip
But it was also working before for me... and for you too, if you had been using the dictionaries included with SE, like "eng_OCRFixReplaceList.xml". Why would you not use them?

Janusz · 21st May 2020, 14:17

Quote:

Originally Posted by Nikse555

@tormento: thx

This should now work (did not work because it was half italic / half regular)

Beta 129.
For this example I created a sup file from the text "the tragedy ofAlabama" where "the tragedy of" I marked italic.

I only installed the following dictionaries: French, German, Italian, English without additional OCRFixReplaceList.xml files.
For: French, German, Italian, English - the patch works ok. The text after OCR looks like this: "the tragedy of Alabama",
for: Polish and "none" like this: "the tragedy ofAlabama".
I did not check others, but I think the amendment should work in all languages because the word "ofAlabama" is not correct in any language, and any division in this case may occur between italics/regular or regular/italics always regardless of the language chosen how many new words exist in the selected dictionary. Example from Poland: "fotografAdam" (photographer Adam).

Nikse555 · 21st May 2020, 16:50

Quote:

Originally Posted by Janusz

Beta 129.
For this example I created a sup file from the text "the tragedy ofAlabama" where "the tragedy of" I marked italic.

I only installed the following dictionaries: French, German, Italian, English without additional OCRFixReplaceList.xml files.
For: French, German, Italian, English - the patch works ok. The text after OCR looks like this: "the tragedy of Alabama",
for: Polish and "none" like this: "the tragedy ofAlabama".
I did not check others, but I think the amendment should work in all languages because the word "ofAlabama" is not correct in any language, and any division in this case may occur between italics/regular or regular/italics always regardless of the language chosen how many new words exist in the selected dictionary. Example from Poland: "fotografAdam" (photographer Adam).

Yes, the OCR process benefits from a good OCR fix replace list.
I've added a Polish one based on your input here: https://github.com/SubtitleEdit/subt...eplaceList.xml
Feel free to add to it

Janusz · 21st May 2020, 21:57

Quote:

Originally Posted by Nikse555

Yes, the OCR process benefits from a good OCR fix replace list.

Quote:

I had this line in <OCRFixReplaceList> so there had to be something else here.
In the first version of the file, the line contained only one phrase "photographerAdam".
"Adam" is only 5 letters, I thought maybe this is it?

I have created a new file. I added a few lines and longer words starting with "A".

OCR worked, but as you can see above - not quite.
The division has happened, but it is not everywhere it should be. Only on line 2 is good.
I disabled split after "f" in "pol_OCRFixReplaceList.xml". The effect of this is at the bottom.
The division is correct, is where it should be, also on line 1.

Conclusion: The rare case of such a combination of words means that we have to choose ourselves:
enable or disable this option and when in our "OCRFixReplaceList.xml",
because we can do more damage than it is worth.

If you really don't have anything to do, you could look into the source, because changing the dictionary repeatedly to any one installed
and each time OCR with a new dictionary causes that what now looks so nice at the bottom will look like at the top again.
Only starting OCR restores order again.
I know that nobody will mix dictionaries under normal use, but the problem is.

tormento · 23rd May 2020, 10:16

Quote:

Originally Posted by Nikse555

But it was also working before for me... and for you too, if you had been using the dictionaries included with SE, like "eng_OCRFixReplaceList.xml". Why would you not use them?

It does work with OCR fix, not without.

I tend not to use it because the I have trained the OCR so well that I can postprocess I-l after OCR and have a faster job.

Perhaps you could implement a OCR fix with OCR errors only and not word dictionary aware.

Janusz · 23rd May 2020, 12:56

@Nikse555
To report a mistake.
Occurs since beta 119, beta 112 works fine.

The previously reported bug in beta 123 and later concerned a missing dictionary.
Because I rarely use "Prompt for unknown words" so the option was not enabled and was not checked.

In my previous thread I used "Binary image compare" so I didn't notice this error.
Today I returned to nOCR. My Settings: for the function to work, the dictionary must be selected so it is selected.
"Draw missing texts" - disabled so that the program does not call for every new unknown letter.
(

Even for this function of the program it is worth using nOCR

).
"Prompt for unknown words" - enabled.
"Fix OCR errors" - disabled - OCR does not use user files.
"Try to guess unknown words" - does not matter with "Fix ..." = disabled. It doesn't work though it's turned on.

Start OCR begins to process the text until it encounters the first unknown word.
With a well-constructed character base, it will be a word not in the dictionary, otherwise an unrecognized character in the word.
The process calls the "Spell check" window. "Skip one", "Skip all", "Abort" causes an error window to be called:

Depending on what we choose, we will return to Windows - the program will crash or to the Program.
I checked on various sup files, including those available from this forum.

@Tormento

Quote:

I don't usually use it because I trained OCR so well that I can postprocess Il after OCR and have a faster job.
Perhaps you could only implement the OCR patch with OCR errors and not recognize the word dictionary.

I do not use "Fix common OCR errors - also use hard-coded rules" because this option does something more than just what results from its name - especially when it comes to Il and iL. For this reason I do not use "I" in the character database. I have definitely fewer mistakes to improve, at least I know which ones.
My assumption is that changes made for English should not affect other languages available in the program.

Nikse555 · 23rd May 2020, 20:33

@Janusz: thx for the crash info

Should hopefully be fixed here: https://github.com/SubtitleEdit/subt...leEditBeta.zip

Also, Ctrl+T in the OCR window will start some auto-training... probably not too useful, but it's a little fun to play with.

Janusz · 23rd May 2020, 21:55

Patch works, thank you.

Quote:

Originally Posted by Nikse555

Also, Ctrl+T in the OCR window will start some auto-training... probably not too useful, but it's a little fun to play with.

This function was and is also available under the right mouse button.
However, using it did not bring up any additional windows as it does now.
Something was happening in the background, the effects of this work could not be seen.
I noticed this window yesterday, but I didn't have time to check exactly what it was.
I am curious myself how this file will look.

Nikse555 · 24th May 2020, 08:58

Forgot... for training you need a .srt file with spaces around characters, like:

Code:

1
00:00:00,490 --> 00:00:02,350
a b c d e f g h i j k l m n o p q r s t u

2
00:00:02,530 --> 00:00:04,150
v w x y z

3
00:00:04,240 --> 00:00:06,240
0 1 2 3 4 5 6 7 8 9 , . ( ) [ ] ' " $ % ♫ ♪  &

4
00:00:06,510 --> 00:00:08,200
A B C D E F G H I J K L M N O P Q R S T U

5
00:00:08,320 --> 00:00:10,570
V W X Y Z

6
00:00:11,510 --> 00:00:13,510
: ; - ! ?

7
00:00:13,540 --> 00:00:15,540
é É Č Ę Ė Į Š Ū Ž č ę ė į š ų ž

8
00:00:15,560 --> 00:00:17,560
ß ü Ü æ ø å ä ö Æ Ø Å Ä Ö

9
00:00:17,584 --> 00:00:19,584
ff ft fi fj fl rf rt rv rw ry rt ryt tt TV tw yt yw

Also, unattended OCR alarm (taskbar blink/beep) is now customizable (via Settings.xml) and these settings ( in latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip ):
<UnfocusedAttentionBlinkCount>50</UnfocusedAttentionBlinkCount>
<UnfocusedAttentionPlaySoundCount>2</UnfocusedAttentionPlaySoundCount>
<UnfocusedAttentionPlaySoundEvery>2</UnfocusedAttentionPlaySoundEvery>

Janusz · 25th May 2020, 11:57

And the game is over.
The text with 6052 lines (31736 words, 189721 characters) was read without the need to add at least 1 character. I used nOCR. I'm really shocked how it worked for the "Arial Black" font.
The one thing I've corrected before is that I've added a few triple and a dozen double characters to your train.srt file
A great tool.

tormento · 25th May 2020, 12:54

Quote:

Originally Posted by Janusz

I used nOCR.

What is nOCR?

Janusz · 25th May 2020, 14:01

Quote:

Originally Posted by tormento

What is nOCR?

Disable Subtitle Edit, in settings.ini find "<ShowBetaStuff>" and replace "False" with "True".
Launch the program. In [OCR Method] you will have a new method: "OCR via nOCR".
From the parameter name you can see that not everything can work as it should.
And that's how it is now. I didn't take notes of what I was doing and I can't reproduce what I wrote above. Fortunately, I have saved the character base and it can be repeated with it, but I can't generate the same database a second time.

---
For sure @Nikse555 will read it so I will add that:
the original character base entered from the hand to read the entire file error-free contains 367 elements, the new one was created by N-OCR training 481 characters so it may contain already recognized characters. I don't have the tools to check it.
---
It turned out that my admiration turned out to be premature. My mistake - I left my character base in the working directory, thanks to which the generated new characters were added to my base and hence the sensational result. Detriment. It seems that this project is no longer being developed. In the state in which it is now it can only serve as a curiosity.
--------------------------------------------------------------------------------------------

@Nikse555

There was a problem with beta 161.

- nOCR has stopped recognizing: . , - (three characters) and calls for each character encountered as a new one - unknown.

- 'o' recognizes as '0' or 'c', but this does not occur for everyone 'o' in the text.

Example:
beta 145: Nie. Na United Fusion Corporation. To co innego. (I worked on this version until then).
beta 161: Nie* Na United Fusion Corporation* T0 co inneg0*

tormento · 27th May 2020, 17:06

@Nikse555

Out of curiosity, would you please do a x64 compile? I am curious to see if it gets faster on binary OCR.

GCRaistlin · 27th May 2020, 17:50

Quote:

Originally Posted by Nikse555

>The latest beta still allows to add an empty better multi match.
I think that "empty string" could be a valid text... perhaps a warning?

Sure, a warning would be nice.

Quote:

Originally Posted by Nikse555

>Could you please allow selecting a character by a right click in 'Inspect items' area of 'Inspect compare matches for current image' window?
I don't follow... ?

I call 'Inspect compare matches for current image' window. By default, the 1st item is selected on the left. I want to add better multi match for, say, the 5th item. I do a right-click on the 5th item and select 'Add better multi match' - but I don't get what I expected because right click doesn't select anything so this way I add better multi match for the 1st item, not the 5th one.

20th May 2020, 17:46	#1004 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: OK, think I got the crash now: https://github.com/SubtitleEdit/subt...leEditBeta.zip @tormento: Right click on image for line 3 in Apollo13 and choose "Set align angle" (previously "Set un-italic factor"). Looks like "0,21" is a good value. Does that work for you? (#pixels is space=15) Last edited by Nikse555; 20th May 2020 at 17:49.

24th May 2020, 08:58	#1015 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	Forgot... for training you need a .srt file with spaces around characters, like: Code: 1 00:00:00,490 --> 00:00:02,350 a b c d e f g h i j k l m n o p q r s t u 2 00:00:02,530 --> 00:00:04,150 v w x y z 3 00:00:04,240 --> 00:00:06,240 0 1 2 3 4 5 6 7 8 9 , . ( ) [ ] ' " $ % ♫ ♪ & 4 00:00:06,510 --> 00:00:08,200 A B C D E F G H I J K L M N O P Q R S T U 5 00:00:08,320 --> 00:00:10,570 V W X Y Z 6 00:00:11,510 --> 00:00:13,510 : ; - ! ? 7 00:00:13,540 --> 00:00:15,540 é É Č Ę Ė Į Š Ū Ž č ę ė į š ų ž 8 00:00:15,560 --> 00:00:17,560 ß ü Ü æ ø å ä ö Æ Ø Å Ä Ö 9 00:00:17,584 --> 00:00:19,584 ff ft fi fj fl rf rt rv rw ry rt ryt tt TV tw yt yw Also, unattended OCR alarm (taskbar blink/beep) is now customizable (via Settings.xml) and these settings ( in latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip ): <UnfocusedAttentionBlinkCount>50</UnfocusedAttentionBlinkCount> <UnfocusedAttentionPlaySoundCount>2</UnfocusedAttentionPlaySoundCount> <UnfocusedAttentionPlaySoundEvery>2</UnfocusedAttentionPlaySoundEvery>

25th May 2020, 11:57	#1016 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	And the game is over. The text with 6052 lines (31736 words, 189721 characters) was read without the need to add at least 1 character. I used nOCR. I'm really shocked how it worked for the "Arial Black" font. The one thing I've corrected before is that I've added a few triple and a dozen double characters to your train.srt file A great tool. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 25th May 2020 at 13:10.

27th May 2020, 17:06	#1019 \| Link
tormento Acid fr0g Join Date: May 2002 Location: Italy Posts: 2,582	@Nikse555 Out of curiosity, would you please do a x64 compile? I am curious to see if it gets faster on binary OCR. __________________ @turment on Telegram

20th May 2020, 20:18	#1007 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@tormento: thx This should now work (did not work because it was half italic / half regular): https://github.com/SubtitleEdit/subt...leEditBeta.zip But it was also working before for me... and for you too, if you had been using the dictionaries included with SE, like "eng_OCRFixReplaceList.xml". Why would you not use them?

23rd May 2020, 20:33	#1013 \| Link
Nikse555 Registered User Join Date: Feb 2004 Location: Mars Posts: 428	@Janusz: thx for the crash info Should hopefully be fixed here: https://github.com/SubtitleEdit/subt...leEditBeta.zip Also, Ctrl+T in the OCR window will start some auto-training... probably not too useful, but it's a little fun to play with.