Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules. |
24th March 2017, 04:10 | #1 | Link |
Soul Architect
Join Date: Apr 2014
Posts: 2,559
|
Unicode File Paths
Here's something strange I noticed. I know script files must be in ANSI format and don't support Unicode characters. What about the script file name itself? It works with characters from a variety of languages without problem.
However, if I open a file name with Chinese or Thai characters, it crashes saying "Import: couldn't open ..." followed with the file name with the Chinese characters replaced with ??? Why is this not working? Neither MPC-HC nor VirtualDub opens it. How can I either open the files, or validate file names to make sure they will work? I'm using Avisynth+ |
24th March 2017, 09:08 | #2 | Link |
Registered User
Join Date: Jun 2002
Location: On thin ice
Posts: 6,837
|
The ANSI limitation is true for the script file paths as well, most languages are covered by their ANSI code page however, there are hardly any request for unicode support, only from time to time from tool makers like us.
Since you raised this topic, unicode and console/batch don't work either in Win 7 because Win 7 has unicode bugs, batch files are needed for x265 piping for instance, that makes in pactical x265 not supporting unicode as well, at least not for Win 7 users which are still a lot. Last but not least .NET and Windows are getting long file path support (more then MAX_PATH/260 characters), it can already be used but only with group policy change and manifest entry. I use it already in personal scripts.
__________________
https://github.com/stax76/software-list https://www.youtube.com/@stax76/playlists |
24th March 2017, 09:45 | #3 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
https://s9.postimg.org/tvbb9b2pb/Image1.png Same for a console app (AVSMeter): https://s9.postimg.org/6fte3yiy7/Image2.png Strangely, VirtualDub (1.10.4) will not open the file. In WinNT using NTFS, file names are conveniently stored in Unicode internally, it's up to the programmer to interpret them correctly.
__________________
Groucho's Avisynth Stuff Last edited by Groucho2004; 24th March 2017 at 20:16. |
|
24th March 2017, 17:43 | #4 | Link |
Soul Architect
Join Date: Apr 2014
Posts: 2,559
|
Detecting the character language and shifting the system locale for every file is not an option.
What I'm trying to do is create a AVS file with the same file name as the video and only replacing the extension. It works 96% of the time but a few videos are crashing. I need some way of determining which ones won't work as I can decide to use another file name. If only ANSI characters were supported, then video names with Arabic characters would also fail, but they work. |
24th March 2017, 18:03 | #5 | Link | ||
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
Quote:
Arabic (CP1256) is not a multi byte character set which might explain that. Chinese, Korean and Japanese are MBCS. Do you have trouble with other single byte character sets?
__________________
Groucho's Avisynth Stuff Last edited by Groucho2004; 24th March 2017 at 18:20. |
||
24th March 2017, 18:31 | #6 | Link | |||
Soul Architect
Join Date: Apr 2014
Posts: 2,559
|
Quote:
I have a database of 500+ videos, so yes I tried at least 25 names. Only 2 or 3 file names with Chinese or Thai characters failed to load. Quote:
However, detecting those isn't so simple Quote:
|
|||
24th March 2017, 19:17 | #7 | Link |
Registered User
Join Date: Jan 2012
Location: Mesopotamia
Posts: 2,587
|
I was thinking to suggest add utf-8 support to avs+ script some time ago, and about the compatibility I was thinking about suggest auto convert ANSI to utf-8 for old scripts internally if the encode script is utf-8, the convert done for the used scripts in encoder script not all scripts in autoload folder
don't know if this can be done or not
__________________
See My Avisynth Stuff |
24th March 2017, 20:02 | #8 | Link |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
We appear to have different interpretations of the word "crashing".
Can you post the names that fail?
__________________
Groucho's Avisynth Stuff |
24th March 2017, 20:26 | #9 | Link |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
Adding Unicode support to Avs+ is probably pretty trivial. You can pass UTF-8 around in AVSValues no problem, so all you need to do is wrap the file I/O functions that exist in a few places (like in import) with a trivial function that does MultiByteToWideChar and then calls the W-version of the I/O function.
You'll be incompatible with old scripts that use some local code page, but re-saving as UTF-8 should hardly be a huge problem for anyone. You should not support local code pages, there is absolutely nothing to be gained from that. Oh, and then you get to fix the VFW interface. Have fun with that. what no seriously, what I know this whole "unicode" thing is painfully new to you guys, but seriously now. UNC paths have been around since Windows 2000. I know in Win10 they removed the MAX_PATH restriction for regular paths but that doesn't solve the problem because all the old garbage from the 90's still does "TCHAR filename[MAX_PATH];" somewhere and the user still has to opt in to it. So UNC or bust. Last edited by TheFluff; 24th March 2017 at 20:41. |
24th March 2017, 20:38 | #10 | Link |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
The real problem here is that one can't expect every program to handle file names with Thai, Chinese, etc characters. If I recall correctly, Win32 CreateFileW() can handle these files. However, most tools use standard C library or STL functions to open/save files which will fail in some cases.
Unicode aware programs like MS Word or EMEditor handle them without problems independent of the system locale. Also, we're just talking about file names, not the content of these files. Using these file names within scripts opens another can of worms.
__________________
Groucho's Avisynth Stuff Last edited by Groucho2004; 24th March 2017 at 20:41. |
24th March 2017, 20:46 | #11 | Link | |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
Quote:
Now, things that interact with the Avisynth API directly instead of going through VFW will of course have to be made aware of the fact that the new hot thing to do is to pass UTF8. Oh. Wait, this is Avisynth and you will never break API backwards compatibility ever. Never mind. Pretty sure the FFMS2 Avisynth plugin supports UTF8 filenames but breaks on local code page, by the way. e: I'm pretty sure that in 2017, doom9 is one of very few places on the internet where you will not only hear someone say this, but also expect it to be seen as a reasonable standpoint to have. Last edited by TheFluff; 24th March 2017 at 20:53. |
|
24th March 2017, 21:10 | #12 | Link |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
It does not. As I mentioned before, I'm prrrretty FFMS2 supports this right now and it definitely did so in the past (because I wrote the code that did it, but it has since been replaced with a simpler solution). The only thing you need to do is save the script as UTF8 without BOM. UTF8 can safely be treated as any other array of char. To Avisynth, it's just a regular string, which gets passed to FFMS2, which is char* everywhere in the API, so it goes straight to the libavformat I/O, which does the actual file opening and actually does support converting from UTF8 to the Windows style wchar_t API's.
|
24th March 2017, 22:01 | #13 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
__________________
Groucho's Avisynth Stuff |
|
24th March 2017, 22:14 | #14 | Link | |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
Quote:
Everything else, no idea but it's almost definitely gonna be similar - you replace/wrap the call to fopen/CreateFile and that's it, everything else works by passing the handle around and doesn't need to be changed. That's the entire point of UTF-8; everything that uses 1-byte char encodings keeps working as normal. It's not like the current situation is any good either - only accepting filenames in your local codepage simply isn't an acceptable solution today. I mean, 7-bit ASCII still works everywhere, but come on, this is 2017. The unicode consortium is so bored it's busy adding entire codepages full of emojis. Last edited by TheFluff; 24th March 2017 at 22:17. |
|
24th March 2017, 22:41 | #15 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
Quote:
If you're referring to the ASCII character subset (0 - 127) you're right.
__________________
Groucho's Avisynth Stuff |
|
24th March 2017, 23:00 | #16 | Link |
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
You're missing the point. UTF-8 is just a multibyte encoding, and how many bytes you use for encoding a single character isn't at all interesting to anyone, really (unless you're the kind of person who expects strlen to return the number of natural language characters, but in that case you're beyond hope). If your codepage is set to, say, 932 (Japanese, ShiftJIS) almost every character an actual Japanese person is interested in will take more than one byte. Avisynth handles that just fine - you can put Japanese characters in your script all you want as long as the script uses the local codepage. Functions that parse directories from a path string still work because most multibyte encodings (including ShiftJIS and UTF-8) leave the 7-bit ASCII range alone (it doesn't get used in the extra bytes so you can't mistake the second or third byte of some many-byte character for a regular 7-bit ASCII character). The problems arise when you encode your script in one charset and the win32 API non-W functions expect another, which is what MysteryX seems to have done above.
So, you need one charset that can represent all characters. Unicode is that, but for historical reasons Windows uses the UTF-16 encoding where one wchar_t is two bytes and you have nulls everywhere so none of the old functions work and there are ABI breaks and so on and so forth. Nobody wants that. That's why you use UTF-8, which is a regular multibyte encoding just like all the other local multibyte encodings so everything that expects a single null byte to terminate a string still works, strlen still works, parsing URL's and filepaths still work etc etc. But the Windows API functions don't support that so before passing strings to them you have to encode UTF-8 to UTF-16. After doing that though you're good. That make things any clearer? (e: this is what Windows has done internally for you all the time by the way when you called the old non-W API's, because both FAT32 and NTFS have used Unicode filenames on the filesystem level since the 1990's) Last edited by TheFluff; 24th March 2017 at 23:15. |
25th March 2017, 01:28 | #18 | Link | |
Soul Architect
Join Date: Apr 2014
Posts: 2,559
|
Wow this thread has gone into all sorts of directions.
Quote:
SNH48 - 夏日主题泳装.mkv My question is extremely simple: how to handle this to either make it work with those paths, or detect unsupported characters to remove them. I don't need anything else. First question is: why does it crash to begin with? Which part is responsible for the crash? If I comment everything from the file and open an empty script file, I still get the same error, so we can discard plugins as being the cause. This more looks like a bug in Avisynth+ for Pinterf to fix. But meanwhile, I must also find another work-around.
__________________
FrameRateConverter | AvisynthShader | AvsFilterNet | Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer Last edited by MysteryX; 25th March 2017 at 01:38. |
|
25th March 2017, 09:51 | #19 | Link | |
Join Date: Mar 2006
Location: Barcelona
Posts: 5,034
|
That file name doesn't give me any trouble. I can open it in MPC-HC and VirtualDub even with the system locale set to my usual CP1252 (which I suppose you use too).
Quote:
Again, it's not a crash if the application displays an error message and can be terminated the usual way. This is a crash:
__________________
Groucho's Avisynth Stuff Last edited by Groucho2004; 25th March 2017 at 10:20. |
|
25th March 2017, 13:15 | #20 | Link | |||
Excessively jovial fellow
Join Date: Jun 2004
Location: rude
Posts: 1,100
|
Quote:
Quote:
Quote:
It doesn't crash. That's just the standard way the VFW interface does error reporting. It tries to env->import the .avs file but can't find it, and the nicest way to report an error like that in VFW is to print the error message on the video stream, so that's what it does. Generate a long random string with only 7-bit ASCII contents and use that. It's either that or patch Avisynth. |
|||
Thread Tools | Search this Thread |
Display Modes | |
|
|