Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > General > Subtitles

Reply
 
Thread Tools Search this Thread Display Modes
Old 27th May 2020, 17:52   #1021  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: nOCR training + ,.- should be improved here: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@tormento: SE already runs 64-bit if you have a 64-bit OS. Normally 64-bit programs run a little slower...

Latest beta now does fallback to "Latin.nocr" (nOCR) from "Binary image compare" db "Latin.db".
Included large (auto-trained) "Latin.nocr" db in beta.
Nikse555 is offline   Reply With Quote
Old 28th May 2020, 01:14   #1022  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: nOCR training + ,.- should be improved here: https://github.com/SubtitleEdit/subt...leEditBeta.zip
Unfortunately. Version 168 does not work well. I sent files for testing and comparison to the email address.

nOCR Training works much better.
In the sentence: "The quick brown fox jumps over the Iazy do*." there are only 2 errors. "l" was read as "I" and no "g".
It is poor with punctuation marks.
In testing the same files that I sent, I am unable to determine the source of the character swap compared to beta 145.
In any case, beta 168 compared to beta 145 recognizes characters created in the new nOCR training much better.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 28th May 2020 at 09:50.
Janusz is offline   Reply With Quote
Old 28th May 2020, 09:33   #1023  |  Link
tormento
Acid fr0g
 
tormento's Avatar
 
Join Date: May 2002
Location: Italy
Posts: 2,542
Quote:
Originally Posted by Nikse555 View Post
@tormento: SE already runs 64-bit if you have a 64-bit OS.
The only x64 part I can see is the Hunspell spell checker. Main exe is x86, tesseract (both) are x86. What part of your app runs in x64?
Quote:
Originally Posted by Nikse555 View Post
Normally 64-bit programs run a little slower...
That is a very questionable statement. All the x64 programs I use are definitely faster.
__________________
@turment on Telegram
tormento is offline   Reply With Quote
Old 28th May 2020, 11:28   #1024  |  Link
ACKR
Registered User
 
Join Date: Apr 2020
Posts: 42
Hello i want to change the framerate of a large number of subs from 24 to 29 fps how to do this?
ACKR is offline   Reply With Quote
Old 28th May 2020, 14:23   #1025  |  Link
varekai
Registered User
 
varekai's Avatar
 
Join Date: Jul 2006
Posts: 528
I'm guessing you talk about .srt subtitles?
Subtitle Edit
Code:
Tool -> Batch convert -> Change framerate
Or you can try this:
https://www.videohelp.com/software/S...merate-changer
It's no longer developed so I have no idea if it works for you.
Tried a few subtitles and it seems to work...

Best regards
varekai
varekai is offline   Reply With Quote
Old 28th May 2020, 15:52   #1026  |  Link
ACKR
Registered User
 
Join Date: Apr 2020
Posts: 42
Quote:
Originally Posted by varekai View Post
I'm guessing you talk about .srt subtitles?
Subtitle Edit
Code:
Tool -> Batch convert -> Change framerate
Or you can try this:
https://www.videohelp.com/software/S...merate-changer
It's no longer developed so I have no idea if it works for you.
Tried a few subtitles and it seems to work...

Best regards
varekai
.ass subtitles

Also can Batch convert be used to add delays?

Last edited by ACKR; 29th May 2020 at 10:17.
ACKR is offline   Reply With Quote
Old 29th May 2020, 17:57   #1027  |  Link
sneaker_ger
Registered User
 
Join Date: Dec 2002
Posts: 5,565
Quote:
Originally Posted by ACKR View Post
Also can Batch convert be used to add delays?
SubtitleEdit's batch converter calls it "Offset time codes".
sneaker_ger is offline   Reply With Quote
Old 30th May 2020, 12:28   #1028  |  Link
18fps
Registered User
 
Join Date: Oct 2008
Posts: 55
When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss).
18fps is offline   Reply With Quote
Old 30th May 2020, 14:53   #1029  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@tormento: SubtitleEdit.exe (all the C# code) runs 64-bit on 64-bit OS (check task manager, there should be no "(32-bit)" after the name). I've not actually tested SE 32-bit vs SE 64-bit performance... at least you have more memory with 64-bit programs.
Tesseract exe runs 32-bit (the 32-bit tesseract is faster than the 64-bit tesseract - well tested)

@18fps:
>When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss)
Actually not a bad idea... but there seems to a lot of "Cute Girl" and "Prison Guard" which I guess would make this hard. Ideas?

@Janusz: The nOCR (line ocr) works best with larger subtitles, like from bluray sup files, so go for "Binary image compare" with small subtitles or even Tesseract 5.
nOCR is still missing a lot of work - latest beta has added "fallback to nOCR" from "Binary image compare" which I think will work nicely. Also added a "Max bad pixels" for nOCR (based on not matching pixels from lines).
The nOCR auto-training now calculates correct top margin (I hope) which was also missing in the normal OCR run (so all existing nOCR dbs will work less well).
Also fixed in training: quote + percentage sign. Missing in training: combined letters like "ff" and "rt" are not working.
Many changes has been made regarding OCR (mostly nOCR): https://github.com/SubtitleEdit/subt...leEditBeta.zip

Last edited by Nikse555; 30th May 2020 at 19:43.
Nikse555 is offline   Reply With Quote
Old 30th May 2020, 20:57   #1030  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by Nikse555 View Post
@Janusz: The nOCR (line ocr) works best with larger subtitles, like from bluray sup files, so go for "Binary image compare" with small subtitles or even Tesseract 5.
nOCR is still missing a lot of work - latest beta has added "fallback to nOCR" from "Binary image compare" which I think will work nicely. Also added a "Max bad pixels" for nOCR (based on not matching pixels from lines).
The nOCR auto-training now calculates correct top margin (I hope) which was also missing in the normal OCR run (so all existing nOCR dbs will work less well).
Also fixed in training: quote + percentage sign. Missing in training: combined letters like "ff" and "rt" are not working.
Many changes has been made regarding OCR (mostly nOCR): https://github.com/SubtitleEdit/subt...leEditBeta.zip
The problem is not about OCR itself. The program changes the character assignment in the character database. Change from ż to Ż, ć to Ć, from z to Z.

Attempt on _index.html file with "Batman Begins".
The character base created only for the first two lines for each new unrecognized character consists of 17 characters: ! , ? a c e h k l o p P R s t z ż
OCR was stopped at "n" on the third line.
As you can see, a small "z" instead of "Z" appeared in the third line.
We can repeatedly start OCR from the first line, each time OCR will stop at "n". Character Database content does not change.
However, if, for example, on the first line we call "Inspect nocr matches for ..." the "nOCR inspekt" window opens and click in the "Inspect items" box, then select OK or Cancel to close the window without any changes.
Reopening this window will show us that "ż" was assigned to "Ż" although we did not. These changes are now saved permanently. Another OCR will show us that "ż" on lines 1 and 2 has been replaced with "Ż" and "z" on "Z" on line 3. The re-OCR is again calling for "ż".
If you need pictures I can attach.
-----
Beta 145 doesn't have this problem.
It started with beta 161. I could check this version. I wrote about beta 168 earlier, but at that time I didn't know where to look for the cause.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 30th May 2020 at 23:50.
Janusz is offline   Reply With Quote
Old 30th May 2020, 21:23   #1031  |  Link
jlw_4049
Registered User
 
Join Date: Sep 2018
Posts: 391
Quote:
Originally Posted by Janusz View Post
The problem is not about OCR itself. The program changes the character assignment in the character database. Change from ż to Ż, ć to Ć, from z to Z.

Attempt on _index.html file with "Batman Begins".
The character base created only for the first two lines for each new unrecognized character consists of 17 characters: ! ,? a c e h k l o p P R s t z ż
OCR was stopped at "n" on the third line.
As you can see, a small "z" instead of "Z" appeared in the third line.
We can repeatedly start OCR from the first line, each time OCR will stop at "n". Character Database content does not change.
However, if, for example, on the first line we call "Inspect nocr matches for ..." the "nOCR inspekt" window opens and click in the "Inspect items" box, then select OK or Cancel to close the window without any changes.
Reopening this window will show us that "ż" was assigned to "Ż" although we did not. These changes are now saved permanently. Another OCR will show us that "ż" on lines 1 and 2 has been replaced with "Ż" and "z" on "Z" on line 3. The re-OCR is again calling for "ż".
You can also make changes to the characters in the settings yourself

Sent from my SM-G986U1 using Tapatalk
jlw_4049 is offline   Reply With Quote
Old 30th May 2020, 21:49   #1032  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
Quote:
Originally Posted by jlw_4049 View Post
You can also make changes to the characters in the settings yourself
Yes. Where?
The program cannot change the image assignment to a character by itself. If you have once determined that "a" is "a", then where suddenly "a" is "A".
-----


Please explain to me what "in the settings" setting causes such a change. The left side, although the subtitles look strange, is correct.
Sup and nocr files to download.
The first OCR call will be OK. Second and subsequent OCRs on the same file will replace.
-----
Everything indicates that the problem concerns only the Polish language, so it went unnoticed by other users.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 31st May 2020 at 09:37.
Janusz is offline   Reply With Quote
Old 31st May 2020, 10:31   #1033  |  Link
18fps
Registered User
 
Join Date: Oct 2008
Posts: 55
Quote:
Originally Posted by Nikse555 View Post
@18fps:
>When correcting capitalization of all caps subtitles, it would be of great help if the program could read the names of characters from the imdb page of the film (the user would give the http addresss)
Actually not a bad idea... but there seems to a lot of "Cute Girl" and "Prison Guard" which I guess would make this hard. Ideas?
Well, many of these will not actually be in the text of the subtitles ("man in the counter"), so maybe, just the way the program shows the text for impaired hearing that is going to remove, it could show the list of identified names it found in the actual subtitles, for approval.
18fps is offline   Reply With Quote
Old 31st May 2020, 20:47   #1034  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Janusz: Yes, nOCR is not finished yet... in SE 3.5.15 it was probably about 60% done, and now it's about 85% done.
I was close to giving up on it, but after a few fixes in auto-traning it's actually working very well (besides words that are stuck together like "rw", "ff" etc... - it's on my todo list + also italic font might be a problem).
I've fixed the OCR inspect in latest beta and also added some code for correcting casing Polish letters... do let me know how that works: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@18fps: OK, I might give it a try.
Nikse555 is offline   Reply With Quote
Old 1st June 2020, 09:28   #1035  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse555:
It is certainly good for characters without accents: a b c ... A B C ...
It is not bad for lowercase letters with an accent: é č ę ė š ž ü å ä ö ą ś ż ś ...
Unfortunately, capital letters with accent: polish: Ś Ó Ż Ź Ć, spanish, portuguese, czech are recognized as two separate signs: accent and capital letter. This cannot be improved by "Add better match ...".
This does not apply to German. Here Ü Ä Ö is recognized as one character.
Until the problem with single characters is solved, you can forgive yourself "besides words that are stuck together like "rw","ff"etc .."
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 1st June 2020 at 10:56.
Janusz is offline   Reply With Quote
Old 1st June 2020, 11:09   #1036  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
Quote:
Originally Posted by Janusz View Post
@Nikse555:
Unfortunately, capital letters with accent: polish: Ś Ó Ż Ź Ć, spanish, portuguese, czech are recognized as two separate signs
.."
Latest beta should work a little better with accents... do you have a sample file with problematic accents?

Quote:
Originally Posted by Janusz View Post
@Nikse555:
Until the problem with single characters is solved, you can forgive yourself "besides words that are stuck together like "rw","ff"etc .."
Latest beta can now train letters that are stuck togeter

https://github.com/SubtitleEdit/subt...leEditBeta.zip
Nikse555 is offline   Reply With Quote
Old 1st June 2020, 11:48   #1037  |  Link
Janusz
Registered User
 
Join Date: Apr 2020
Location: Poland
Posts: 143
@Nikse



File to download:

Because there was a problem with capital letters with accents, e.g. Ś, Ć, Ó, Ř, Í, Š, É etc. I prepared a text consisting of sentences containing all letters used in such languages: English, Polish, German, Spanish and Czech . If we compare the line marked in blue on the right side, we will notice that between upper case letters lower case letters hide, but not everywhere. There is "Ó" which has not been replaced with "ó" or "Ń" and several others. Other capital letters also contain substitutions of this type. The exception is English, for obvious reasons - there are no letters with accent.

During the OCR I did not make any corrections, I did not add any characters manually. In both cases: beta 145 and beta 187, the text in the form we see has been fully read by the character base created by nOCR Training beta 187. Comparing pages line by line, you can see how much progress has been made since beta 145.
-----
02.06
Why some characters, e.g. Czech Ř, Ď, Á ... are remembered as one character, while others, e.g. Polish Ó, Ż, Ź as letters O, Z with an accent.
WARNING! Characters memorized in the character database as "." "´", I think they can be edited, but they cannot be deleted in any way because deletion causes our character base to crash.
Removal would be possible, but then all associated characters should also be removed from the database.
-----
I will return to _index.html file with "Batman Begins" with the character base attached.
If we start ORC with the [Draw missing text] option enabled, the program will ask for "." or "," in the middle or at the end of a sentence. We can add, it will be good. When in line 74 we are asked not to add "Ż" but to "." located above "Z" and we will add it - our character base will crash.
From now on, all "-" in the dialogs at the beginning of the line will be replaced with "." If we run OCR from the beginning from the first line without recognizing new characters, it will turn out that all "-" at the beginning of the line will be changed to ".". An additional gift will be exchanging ś into Ś and vice versa, z into Z. Long to exchange.
I don't know what it looks like in other languages ​​with uppercase letters in indexes - I don't have the right files, but for single letters it works the same way.
-----
I can already see good changes in beta 193. Keep it up. Good job. Thank you.
__________________
Sorry for my mistakes - I'm using a translator.

Last edited by Janusz; 2nd June 2020 at 12:27.
Janusz is offline   Reply With Quote
Old 1st June 2020, 18:12   #1038  |  Link
Melan
Registered User
 
Melan's Avatar
 
Join Date: Jan 2014
Location: Poland
Posts: 64
I received the message as below. My .nocr file is created from scratch.

https://i.imgur.com/4TJc4kQ.png
Melan is offline   Reply With Quote
Old 2nd June 2020, 15:07   #1039  |  Link
Nikse555
Registered User
 
Join Date: Feb 2004
Location: Mars
Posts: 428
@Melan: Should hopefully be fixed in latest beta: https://github.com/SubtitleEdit/subt...leEditBeta.zip

@Janusz: thx for the test file - good idea
I see the problem with batman begins and "Ż" - but that's about line splitting (also happens in "Binary image compare")

>deletion causes our character base to crash.
Does this still happen?
Nikse555 is offline   Reply With Quote
Old 2nd June 2020, 16:35   #1040  |  Link
Melan
Registered User
 
Melan's Avatar
 
Join Date: Jan 2014
Location: Poland
Posts: 64
Still the same.

https://i.imgur.com/tysac5k.png
Melan is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 03:22.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.