How to distinguish between binary and ASCII file on file opening?

How to distinguish between binary and ASCII file on file opening?

Post by cffun » Wed, 11 Aug 2004 12:26:14


I am writing an applicatioin that will need to distinguish between
binary and ASCII file to be loaded from hard disk so that different
processing is applied to different type of file.

Is there any simple way to identify the type of file during the file
opening process? Thank you.

CFF
 
 
 

How to distinguish between binary and ASCII file on file opening?

Post by Bernhar » Wed, 11 Aug 2004 16:52:21

You should analyse the ending of the file, mostly this will give you a
hint.

Bernhard

 
 
 

How to distinguish between binary and ASCII file on file opening?

Post by Doug Harri » Mon, 16 Aug 2004 04:50:44


ASCII is defined on character values 0-127. I think you must mean text file,
in which case, you could search a reasonably long prefix of the file (say,
the first 4-16 KB) for the char value zero, which should not appear in any
text file based on a single byte character set such as ASCII or the Windows
"ANSI".

--
Doug Harrison
Microsoft MVP - Visual C++
 
 
 

How to distinguish between binary and ASCII file on file opening?

Post by Doug Harri » Mon, 16 Aug 2004 04:50:45


That might work on CP/M, but a physical Ctrl+Z marker at the end of text
files is not commonly found in Windows or even DOS text files.

--
Doug Harrison
Microsoft MVP - Visual C++
 
 
 

How to distinguish between binary and ASCII file on file opening?

Post by Bill Thomp » Mon, 16 Aug 2004 15:05:08


You could analyze character frequency. If the predominate characters are
A-Za-z you probably have ASCII, if the distribution is more uniform you
probably have binary. The more characters you check, the stronger the
'probably' becomes.

Any clues about the contents of the file can be useful as well; e.g., you
can recognize a CSV file by the number of unqouted commas per line.
 
 
 

How to distinguish between binary and ASCII file on file opening?

Post by Joseph M. » Tue, 17 Aug 2004 00:50:50

This is hard. The printable characters cover the range 32-127 and a whole lot of scattered
characters in between. I once used a program that was convinced my file was binary because
it started off as

Copyright 1997 XYZ corporation

and the program thought that which i> > 127, made the file binary. It does not.

And what if it contains text like the name Jer? The >s > 127, but is certainly a valid
text character. The algorithm th>t > 128 is by definition binary almost always is doomed
to ignominious failure.

In ISO-Latin-1 (ISO-8859-1), there are some characte>s > 127 which have no matching
glyphs, but then you are making an assumption that you are using ISO-Latin-1, which may
not be true in other countries.

ISO-Latin-1 has printable characters in positions 161-255, and usually 128 (the Euro
symbol now), and 160 is the "non-breaking space". So there is very little that is not
legal. Characters 129-159 are valid in Latin-Extended-A, used by some countries which are
not supported by ISO-Latin-1. Latin-Extended-B supports Croatian and Romanian, After that
it becomes even more complex.

Generally, I find looking for byt<s < 32 is safer. After you eliminate the various
important control characters (TAB, LF, CR, FF) you have a pretty good chance of
determining that the file is binary. Note that if it has a lot of 00 characters and starts
with the Unicode Byte Order Mark (either FEFF or FFFE, depending on the endianness of the
machine that wrote it) then it is probably a Unicode file.

Note that it is always possible to have a binary file that fools your heuristic. So you
should probably present, in the file open dialog, an option that says "Open as Text",
"Open as binary" and "Automatically determine" so the user can force the type of open
(this can be done by subclassing CFileDialog and adding your own extension dialog to it to
get new controls. I'd also do the determination here and "suggest" the appropriate open
mode; for example, having only two radio buttons, text and binary, and selecting one based
on the file contents. You just respond to the file-select call and at that point apply
your heuristic. The user can then override it explicitly).
joe

>
>I am writing an applicatioin that will need to distinguish betwee>
>binary and ASCII file to be loaded from hard disk so that differen>
>processing is applied to different type of file>
>Is there any simple way to identify the type of file during the fil>
>opening process? Thank you>
>CFF

Joseph M. Newcomer [MVP]
email: XXXX@XXXXX.COM
Web: http://www.yqcomputer.com/
MVP Tips: http://www.yqcomputer.com/