View Single Post
Old 13th December 2011, 01:27   #544  |  Link
LoRd_MuldeR
Software Developer
 
LoRd_MuldeR's Avatar
 
Join Date: Jun 2005
Location: Last House on Slunk Street
Posts: 13,248
Quote:
Originally Posted by Reimar View Post
Yes, I was slightly off, it is not invalid it just is "not recommended".
Well, at least on Windows, having an UTF-8 BOM is quite common. For example, the Windows Notepad does add a BOM to UFT-8 files. Winamp does add it to .m3u8 playlists as well.

Last but not least, Notepad++ supports normal "UTF-8" (that is with BOM) and "UTF-8 without BOB" (aka "ANSI as UTF-8").

Quote:
Originally Posted by Reimar View Post
Yes, that is a sensible solution. However the chances that a string containing a character of value > 127 parses as UTF-8 but is not UTF-8 is almost as negligible as the chances that something starting with UTF-8 BOM code is not UTF-8, since UTF-8 is a rather inefficient and wasteful encoding. (the first byte of value > 127 - which must at least have a value of 247 IIRC - indicates how many following bytes start with the bit pattern 10 - you'll have a hard time finding a word in any language in any encoding - except UTF-8 of course - that happens to conform to this).
For purely random data you should reach the same confidence as a UTF-16 BOM at about 8 bytes > 127 and the same as UTF-8 BOM at about 12.
But enough side-tracking, I'll shut up now and let you all get back to the topic.
Deciding whether something will decode as valid UTF-8 or not isn't that trivial tough. Especially as I'm not planning to reinvent the wheel and implement my own UTF-8 decoder. Instead I just select the desired QTextCodec class and let it do its job. Also, as far as I understand, UTF-8 was designed in a way to allow for easy resynchronization after "invalid" or "missing" bytes. So decoding errors at one point don't necessarily mean that the rest can't be valid UTF-8...
__________________
Go to https://standforukraine.com/ to find legitimate Ukrainian Charities 🇺🇦✊

Last edited by LoRd_MuldeR; 13th December 2011 at 02:37.
LoRd_MuldeR is offline   Reply With Quote