View Single Post
Old 12th December 2011, 22:52   #543  |  Link
Reimar
Registered User
 
Join Date: Jun 2005
Posts: 278
Quote:
Originally Posted by LoRd_MuldeR View Post
@Reimar:
Yes, the BOM symbol in UTF-8 does not serve to indicate the Byte Order (there is no Byte Order in single-byte sequences, yes) and it's optional. Still it's a valid character and often present.
Yes, I was slightly off, it is not invalid it just is "not recommended".

Quote:
Originally Posted by LoRd_MuldeR View Post
In the other case, when there is no UTF-8 BOM present, the text may still be valid UTF-8. But it maybe "some" local 8-Bit codepage just as well. It may not even be encoded in the Windows ANSI Codepage that happens to be configured on the individual computer. For all these reasons, if no UTF-8 BOM is found, LameXP will now pop up a small dialog, allowing the user to select the desired Codepage.
Yes, that is a sensible solution. However the chances that a string containing a character of value > 127 parses as UTF-8 but is not UTF-8 is almost as negligible as the chances that something starting with UTF-8 BOM code is not UTF-8, since UTF-8 is a rather inefficient and wasteful encoding. (the first byte of value > 127 - which must at least have a value of 247 IIRC - indicates how many following bytes start with the bit pattern 10 - you'll have a hard time finding a word in any language in any encoding - except UTF-8 of course - that happens to conform to this).
For purely random data you should reach the same confidence as a UTF-16 BOM at about 8 bytes > 127 and the same as UTF-8 BOM at about 12.
But enough side-tracking, I'll shut up now and let you all get back to the topic.
Reimar is offline   Reply With Quote