Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Capturing and Editing Video > Avisynth Usage

Reply
 
Thread Tools Search this Thread Display Modes
Old 24th March 2017, 04:10   #1  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
Unicode File Paths

Here's something strange I noticed. I know script files must be in ANSI format and don't support Unicode characters. What about the script file name itself? It works with characters from a variety of languages without problem.

However, if I open a file name with Chinese or Thai characters, it crashes saying "Import: couldn't open ..." followed with the file name with the Chinese characters replaced with ???

Why is this not working? Neither MPC-HC nor VirtualDub opens it.

How can I either open the files, or validate file names to make sure they will work?

I'm using Avisynth+
MysteryX is offline   Reply With Quote
Old 24th March 2017, 09:08   #2  |  Link
stax76
Registered User
 
stax76's Avatar
 
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
The ANSI limitation is true for the script file paths as well, most languages are covered by their ANSI code page however, there are hardly any request for unicode support, only from time to time from tool makers like us.

Since you raised this topic, unicode and console/batch don't work either in Win 7 because Win 7 has unicode bugs, batch files are needed for x265 piping for instance, that makes in pactical x265 not supporting unicode as well, at least not for Win 7 users which are still a lot.

Last but not least .NET and Windows are getting long file path support (more then MAX_PATH/260 characters), it can already be used but only with group policy change and manifest entry. I use it already in personal scripts.
stax76 is offline   Reply With Quote
Old 24th March 2017, 09:45   #3  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by MysteryX View Post
Here's something strange I noticed. I know script files must be in ANSI format and don't support Unicode characters. What about the script file name itself? It works with characters from a variety of languages without problem.

However, if I open a file name with Chinese or Thai characters, it crashes saying "Import: couldn't open ..." followed with the file name with the Chinese characters replaced with ???

Why is this not working? Neither MPC-HC nor VirtualDub opens it.

How can I either open the files, or validate file names to make sure they will work?

I'm using Avisynth+
You have to set your system locale to the correct language. MPC-HC will play the file happily:
https://s9.postimg.org/tvbb9b2pb/Image1.png

Same for a console app (AVSMeter):
https://s9.postimg.org/6fte3yiy7/Image2.png

Strangely, VirtualDub (1.10.4) will not open the file.

In WinNT using NTFS, file names are conveniently stored in Unicode internally, it's up to the programmer to interpret them correctly.
__________________
Groucho's Avisynth Stuff

Last edited by Groucho2004; 24th March 2017 at 20:16.
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 17:43   #4  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
Detecting the character language and shifting the system locale for every file is not an option.

What I'm trying to do is create a AVS file with the same file name as the video and only replacing the extension. It works 96% of the time but a few videos are crashing. I need some way of determining which ones won't work as I can decide to use another file name.

If only ANSI characters were supported, then video names with Arabic characters would also fail, but they work.
MysteryX is offline   Reply With Quote
Old 24th March 2017, 18:03   #5  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by MysteryX View Post
Detecting the character language and shifting the system locale for every file is not an option.
You asked a question, I answered it.

Quote:
Originally Posted by MysteryX View Post
What I'm trying to do is create a AVS file with the same file name as the video and only replacing the extension. It works 96% of the time but a few videos are crashing.
The 96% indicates that you tried at least 25 different names. What languages did you try? What do mean by "crashing"?

Quote:
Originally Posted by MysteryX View Post
If only ANSI characters were supported, then video names with Arabic characters would also fail, but they work.
Arabic (CP1256) is not a multi byte character set which might explain that. Chinese, Korean and Japanese are MBCS. Do you have trouble with other single byte character sets?
__________________
Groucho's Avisynth Stuff

Last edited by Groucho2004; 24th March 2017 at 18:20.
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 18:31   #6  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
Quote:
Originally Posted by Groucho2004 View Post
The 96% indicates that you tried at least 25 different names. What languages did you try? What do mean by "crashing"?
I already mentioned the error message.

I have a database of 500+ videos, so yes I tried at least 25 names. Only 2 or 3 file names with Chinese or Thai characters failed to load.

Quote:
Originally Posted by Groucho2004 View Post
Arabic (CP1256) is not a multi byte character set which might explain that. Chinese, Korean and Japanese are MBCS. Do you have trouble with other single byte character sets?
This would explain why only Chinese and Thai fail while Arabic and other languages work.

However, detecting those isn't so simple
Quote:
It's the encoding (characterset), which decides whether a specific character is encoded as a single or multiple bytes. For example, if you use ISO-8859-1 as encoding, the character Ø is encoded as a single byte, but if you use UTF-8 as encoding, it's encoded as 2 bytes. So to know how many bytes a character will be encoded with, you need to know which characterset, you're going to transport the text in.
MysteryX is offline   Reply With Quote
Old 24th March 2017, 19:17   #7  |  Link
real.finder
Registered User
 
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,587
I was thinking to suggest add utf-8 support to avs+ script some time ago, and about the compatibility I was thinking about suggest auto convert ANSI to utf-8 for old scripts internally if the encode script is utf-8, the convert done for the used scripts in encoder script not all scripts in autoload folder

don't know if this can be done or not
__________________
See My Avisynth Stuff
real.finder is offline   Reply With Quote
Old 24th March 2017, 20:02   #8  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by MysteryX View Post
I already mentioned the error message.
We appear to have different interpretations of the word "crashing".
Quote:
Originally Posted by MysteryX View Post
Only 2 or 3 file names with Chinese or Thai characters failed to load.
Can you post the names that fail?
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 20:26   #9  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Adding Unicode support to Avs+ is probably pretty trivial. You can pass UTF-8 around in AVSValues no problem, so all you need to do is wrap the file I/O functions that exist in a few places (like in import) with a trivial function that does MultiByteToWideChar and then calls the W-version of the I/O function.

You'll be incompatible with old scripts that use some local code page, but re-saving as UTF-8 should hardly be a huge problem for anyone. You should not support local code pages, there is absolutely nothing to be gained from that.

Oh, and then you get to fix the VFW interface. Have fun with that.

Quote:
Originally Posted by stax76 View Post
Win 7 has unicode bugs
what

no seriously, what

Quote:
Originally Posted by stax76 View Post
Last but not least .NET and Windows are getting long file path support (more then MAX_PATH/260 characters), it can already be used but only with group policy change and manifest entry. I use it already in personal scripts.
I know this whole "unicode" thing is painfully new to you guys, but seriously now. UNC paths have been around since Windows 2000. I know in Win10 they removed the MAX_PATH restriction for regular paths but that doesn't solve the problem because all the old garbage from the 90's still does "TCHAR filename[MAX_PATH];" somewhere and the user still has to opt in to it. So UNC or bust.

Last edited by TheFluff; 24th March 2017 at 20:41.
TheFluff is offline   Reply With Quote
Old 24th March 2017, 20:38   #10  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters. If I recall correctly, Win32 CreateFileW() can handle these files. However, most tools use standard C library or STL functions to open/save files which will fail in some cases.

Unicode aware programs like MS Word or EMEditor handle them without problems independent of the system locale.

Also, we're just talking about file names, not the content of these files. Using these file names within scripts opens another can of worms.
__________________
Groucho's Avisynth Stuff

Last edited by Groucho2004; 24th March 2017 at 20:41.
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 20:46   #11  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by Groucho2004 View Post
The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters. If I recall correctly, Win32 CreateFileW() can handle these files. However, most tools use standard C library or STL functions to open/save files which will fail in some cases.

Unicode aware programs like MS Word or EMEditor handle them without problems independent of the system locale.

Also, we're just talking about file names, not the content of these files. Using these file names within scripts opens another can of worms.
That's what I said, though? Literally all the VFW interface does is call env->Import() with the filename it gets from VFW itself, so if you've fixed import you only need to switch the VFW API functions to the W variant (which I am quite sure exist, but can't be arsed to look up on MSDN).

Now, things that interact with the Avisynth API directly instead of going through VFW will of course have to be made aware of the fact that the new hot thing to do is to pass UTF8. Oh. Wait, this is Avisynth and you will never break API backwards compatibility ever. Never mind.

Pretty sure the FFMS2 Avisynth plugin supports UTF8 filenames but breaks on local code page, by the way.

e:
Quote:
Originally Posted by Groucho2004 View Post
The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters.
I'm pretty sure that in 2017, doom9 is one of very few places on the internet where you will not only hear someone say this, but also expect it to be seen as a reasonable standpoint to have.

Last edited by TheFluff; 24th March 2017 at 20:53.
TheFluff is offline   Reply With Quote
Old 24th March 2017, 21:10   #12  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by Groucho2004 View Post
Using these file names within scripts opens another can of worms.
It does not. As I mentioned before, I'm prrrretty FFMS2 supports this right now and it definitely did so in the past (because I wrote the code that did it, but it has since been replaced with a simpler solution). The only thing you need to do is save the script as UTF8 without BOM. UTF8 can safely be treated as any other array of char. To Avisynth, it's just a regular string, which gets passed to FFMS2, which is char* everywhere in the API, so it goes straight to the libavformat I/O, which does the actual file opening and actually does support converting from UTF8 to the Windows style wchar_t API's.
TheFluff is offline   Reply With Quote
Old 24th March 2017, 22:01   #13  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by TheFluff View Post
It does not. As I mentioned before, I'm prrrretty FFMS2 supports this right now and it definitely did so in the past (because I wrote the code that did it, but it has since been replaced with a simpler solution). The only thing you need to do is save the script as UTF8 without BOM. UTF8 can safely be treated as any other array of char. To Avisynth, it's just a regular string, which gets passed to FFMS2, which is char* everywhere in the API, so it goes straight to the libavformat I/O, which does the actual file opening and actually does support converting from UTF8 to the Windows style wchar_t API's.
You're singling out ffms2, what about other filters or Avisynth internal functions that take file names as arguments?
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 22:14   #14  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by Groucho2004 View Post
You're singling out ffms2, what about other filters or Avisynth internal functions that take file names as arguments?
AviSource and DirectShowSource are both trivial, they each have like one or two CreateFile or similar function calls that need to be wrapped, everything else can remain unchanged.

Everything else, no idea but it's almost definitely gonna be similar - you replace/wrap the call to fopen/CreateFile and that's it, everything else works by passing the handle around and doesn't need to be changed. That's the entire point of UTF-8; everything that uses 1-byte char encodings keeps working as normal.

It's not like the current situation is any good either - only accepting filenames in your local codepage simply isn't an acceptable solution today. I mean, 7-bit ASCII still works everywhere, but come on, this is 2017. The unicode consortium is so bored it's busy adding entire codepages full of emojis.

Last edited by TheFluff; 24th March 2017 at 22:17.
TheFluff is offline   Reply With Quote
Old 24th March 2017, 22:41   #15  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by TheFluff View Post
That's the entire point of UTF-8; everything that uses 1-byte char encodings keeps working as normal.
Hm, Russian encoded in CP1251 is a single byte character set. Converted to UTF-8, all (or most, not sure) characters use 2 bytes.
If you're referring to the ASCII character subset (0 - 127) you're right.
__________________
Groucho's Avisynth Stuff
Groucho2004 is offline   Reply With Quote
Old 24th March 2017, 23:00   #16  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
You're missing the point. UTF-8 is just a multibyte encoding, and how many bytes you use for encoding a single character isn't at all interesting to anyone, really (unless you're the kind of person who expects strlen to return the number of natural language characters, but in that case you're beyond hope). If your codepage is set to, say, 932 (Japanese, ShiftJIS) almost every character an actual Japanese person is interested in will take more than one byte. Avisynth handles that just fine - you can put Japanese characters in your script all you want as long as the script uses the local codepage. Functions that parse directories from a path string still work because most multibyte encodings (including ShiftJIS and UTF-8) leave the 7-bit ASCII range alone (it doesn't get used in the extra bytes so you can't mistake the second or third byte of some many-byte character for a regular 7-bit ASCII character). The problems arise when you encode your script in one charset and the win32 API non-W functions expect another, which is what MysteryX seems to have done above.

So, you need one charset that can represent all characters. Unicode is that, but for historical reasons Windows uses the UTF-16 encoding where one wchar_t is two bytes and you have nulls everywhere so none of the old functions work and there are ABI breaks and so on and so forth. Nobody wants that. That's why you use UTF-8, which is a regular multibyte encoding just like all the other local multibyte encodings so everything that expects a single null byte to terminate a string still works, strlen still works, parsing URL's and filepaths still work etc etc. But the Windows API functions don't support that so before passing strings to them you have to encode UTF-8 to UTF-16. After doing that though you're good.

That make things any clearer?

(e: this is what Windows has done internally for you all the time by the way when you called the old non-W API's, because both FAT32 and NTFS have used Unicode filenames on the filesystem level since the 1990's)

Last edited by TheFluff; 24th March 2017 at 23:15.
TheFluff is offline   Reply With Quote
Old 25th March 2017, 00:08   #17  |  Link
Wilbert
Moderator
 
Join Date: Nov 2001
Location: Netherlands
Posts: 6,364
https://forum.doom9.org/showthread.p...39#post1420439
Wilbert is offline   Reply With Quote
Old 25th March 2017, 01:28   #18  |  Link
MysteryX
Soul Architect
 
MysteryX's Avatar
 
Join Date: Apr 2014
Posts: 2,559
Wow this thread has gone into all sorts of directions.

Quote:
Originally Posted by Groucho2004 View Post
We appear to have different interpretations of the word "crashing".


Quote:
Originally Posted by Groucho2004 View Post
Can you post the names that fail?
SNH48 - 夏日主题泳装.mkv

My question is extremely simple: how to handle this to either make it work with those paths, or detect unsupported characters to remove them. I don't need anything else.

First question is: why does it crash to begin with? Which part is responsible for the crash? If I comment everything from the file and open an empty script file, I still get the same error, so we can discard plugins as being the cause. This more looks like a bug in Avisynth+ for Pinterf to fix.

But meanwhile, I must also find another work-around.

Last edited by MysteryX; 25th March 2017 at 01:38.
MysteryX is offline   Reply With Quote
Old 25th March 2017, 09:51   #19  |  Link
Groucho2004
 
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
Quote:
Originally Posted by MysteryX View Post
SNH48 - 夏日主题泳装.mkv
That file name doesn't give me any trouble. I can open it in MPC-HC and VirtualDub even with the system locale set to my usual CP1252 (which I suppose you use too).

Quote:
Originally Posted by MysteryX View Post
First question is: why does it crash to begin with? Which part is responsible for the crash?
The screen shot shows a .avs, not .mkv. I suppose you generate that file name in your software? If so, check your code for proper handling of such names.

Again, it's not a crash if the application displays an error message and can be terminated the usual way.

This is a crash:
__________________
Groucho's Avisynth Stuff

Last edited by Groucho2004; 25th March 2017 at 10:20.
Groucho2004 is offline   Reply With Quote
Old 25th March 2017, 13:15   #20  |  Link
TheFluff
Excessively jovial fellow
 
Join Date: Jun 2004
Location: rude
Posts: 1,100
Quote:
Originally Posted by Groucho2004 View Post
That file name doesn't give me any trouble. I can open it in MPC-HC and VirtualDub even with the system locale set to my usual CP1252 (which I suppose you use too).
That's just because VDub and MPC-HC and everything else uses the unicode API's. Reminder that the year is 2017.

Quote:
Originally Posted by Groucho2004 View Post
The screen shot shows a .avs, not .mkv. I suppose you generate that file name in your software? If so, check your code for proper handling of such names.
That won't help. The name isn't representable in cp1252 so when Windows attempts to translate it from 1252 (which is what you've told it that you're using) to the internal Unicode codepage used in the filesystem, it won't get the right filename and you'll get the "can't open file" message.

Quote:
Originally Posted by MysteryX View Post
My question is extremely simple: how to handle this to either make it work with those paths, or detect unsupported characters to remove them. I don't need anything else.
The only way to really detect it is to try it and see if it fails. You can't "detect" unsupported characters, since the problem isn't really that the characters are unsupported, it's that you haven't told Windows what charset to translate from. A byte sequence that's perfectly valid 1252 and also perfectly valid ShiftJIS may open or not open depending on what the actual Unicode filename of the target file is.

Quote:
Originally Posted by MysteryX View Post
First question is: why does it crash to begin with?
It doesn't crash. That's just the standard way the VFW interface does error reporting. It tries to env->import the .avs file but can't find it, and the nicest way to report an error like that in VFW is to print the error message on the video stream, so that's what it does.

Quote:
Originally Posted by MysteryX View Post
But meanwhile, I must also find another work-around.
Generate a long random string with only 7-bit ASCII contents and use that. It's either that or patch Avisynth.
TheFluff is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 14:05.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.