Doom9's Forum - View Single Post

Janusz · 1st June 2020, 11:48

@Nikse

File to download:

Because there was a problem with capital letters with accents, e.g. Ś, Ć, Ó, Ř, Í, Š, É etc. I prepared a text consisting of sentences containing all letters used in such languages: English, Polish, German, Spanish and Czech . If we compare the line marked in blue on the right side, we will notice that between upper case letters lower case letters hide, but not everywhere. There is "Ó" which has not been replaced with "ó" or "Ń" and several others. Other capital letters also contain substitutions of this type. The exception is English, for obvious reasons - there are no letters with accent.

During the OCR I did not make any corrections, I did not add any characters manually. In both cases: beta 145 and beta 187, the text in the form we see has been fully read by the character base created by nOCR Training beta 187. Comparing pages line by line, you can see how much progress has been made since beta 145.
-----
02.06
Why some characters, e.g. Czech Ř, Ď, Á ... are remembered as one character, while others, e.g. Polish Ó, Ż, Ź as letters O, Z with an accent.
WARNING! Characters memorized in the character database as "." "´", I think they can be edited, but they cannot be deleted in any way because deletion causes our character base to crash.
Removal would be possible, but then all associated characters should also be removed from the database.
-----
I will return to _index.html file with "Batman Begins" with the character base attached.
If we start ORC with the [Draw missing text] option enabled, the program will ask for "." or "," in the middle or at the end of a sentence. We can add, it will be good. When in line 74 we are asked not to add "Ż" but to "." located above "Z" and we will add it - our character base will crash.
From now on, all "-" in the dialogs at the beginning of the line will be replaced with "." If we run OCR from the beginning from the first line without recognizing new characters, it will turn out that all "-" at the beginning of the line will be changed to ".". An additional gift will be exchanging ś into Ś and vice versa, z into Z. Long to exchange.
I don't know what it looks like in other languages with uppercase letters in indexes - I don't have the right files, but for single letters it works the same way.
-----
I can already see good changes in beta 193. Keep it up. Good job. Thank you.

1st June 2020, 11:48	#1037 \| Link
Janusz Registered User Join Date: Apr 2020 Location: Poland Posts: 143	@Nikse File to download: Because there was a problem with capital letters with accents, e.g. Ś, Ć, Ó, Ř, Í, Š, É etc. I prepared a text consisting of sentences containing all letters used in such languages: English, Polish, German, Spanish and Czech . If we compare the line marked in blue on the right side, we will notice that between upper case letters lower case letters hide, but not everywhere. There is "Ó" which has not been replaced with "ó" or "Ń" and several others. Other capital letters also contain substitutions of this type. The exception is English, for obvious reasons - there are no letters with accent. During the OCR I did not make any corrections, I did not add any characters manually. In both cases: beta 145 and beta 187, the text in the form we see has been fully read by the character base created by nOCR Training beta 187. Comparing pages line by line, you can see how much progress has been made since beta 145. ----- 02.06 Why some characters, e.g. Czech Ř, Ď, Á ... are remembered as one character, while others, e.g. Polish Ó, Ż, Ź as letters O, Z with an accent. WARNING! Characters memorized in the character database as "." "´", I think they can be edited, but they cannot be deleted in any way because deletion causes our character base to crash. Removal would be possible, but then all associated characters should also be removed from the database. ----- I will return to _index.html file with "Batman Begins" with the character base attached. If we start ORC with the [Draw missing text] option enabled, the program will ask for "." or "," in the middle or at the end of a sentence. We can add, it will be good. When in line 74 we are asked not to add "Ż" but to "." located above "Z" and we will add it - our character base will crash. From now on, all "-" in the dialogs at the beginning of the line will be replaced with "." If we run OCR from the beginning from the first line without recognizing new characters, it will turn out that all "-" at the beginning of the line will be changed to ".". An additional gift will be exchanging ś into Ś and vice versa, z into Z. Long to exchange. I don't know what it looks like in other languages with uppercase letters in indexes - I don't have the right files, but for single letters it works the same way. ----- I can already see good changes in beta 193. Keep it up. Good job. Thank you. __________________ Sorry for my mistakes - I'm using a translator. Last edited by Janusz; 2nd June 2020 at 12:27.