This is hard. The printable characters cover the range 32-127 and a whole lot of scattered
characters in between. I once used a program that was convinced my file was binary because
it started off as
Copyright 1997 XYZ corporation
and the program thought that which i> > 127, made the file binary. It does not.
And what if it contains text like the name Jer? The >s > 127, but is certainly a valid
text character. The algorithm th>t > 128 is by definition binary almost always is doomed
to ignominious failure.
In ISO-Latin-1 (ISO-8859-1), there are some characte>s > 127 which have no matching
glyphs, but then you are making an assumption that you are using ISO-Latin-1, which may
not be true in other countries.
ISO-Latin-1 has printable characters in positions 161-255, and usually 128 (the Euro
symbol now), and 160 is the "non-breaking space". So there is very little that is not
legal. Characters 129-159 are valid in Latin-Extended-A, used by some countries which are
not supported by ISO-Latin-1. Latin-Extended-B supports Croatian and Romanian, After that
it becomes even more complex.
Generally, I find looking for byt<s < 32 is safer. After you eliminate the various
important control characters (TAB, LF, CR, FF) you have a pretty good chance of
determining that the file is binary. Note that if it has a lot of 00 characters and starts
with the Unicode Byte Order Mark (either FEFF or FFFE, depending on the endianness of the
machine that wrote it) then it is probably a Unicode file.
Note that it is always possible to have a binary file that fools your heuristic. So you
should probably present, in the file open dialog, an option that says "Open as Text",
"Open as binary" and "Automatically determine" so the user can force the type of open
(this can be done by subclassing CFileDialog and adding your own extension dialog to it to
get new controls. I'd also do the determination here and "suggest" the appropriate open
mode; for example, having only two radio buttons, text and binary, and selecting one based
on the file contents. You just respond to the file-select call and at that point apply
your heuristic. The user can then override it explicitly).
>I am writing an applicatioin that will need to distinguish betwee>
>binary and ASCII file to be loaded from hard disk so that differen>
>processing is applied to different type of file>
>Is there any simple way to identify the type of file during the fil>
>opening process? Thank you>
Joseph M. Newcomer [MVP]
MVP Tips: http://www.yqcomputer.com/