ascii or binary

P

Pohihihi

Hello NG,

I am making a small tool which reads files on harddisk and saves many
information about files in a db. Now, while reading information from the
file I want to figure out what type, ASCII or BINary, it is while reading. I
can look for ext but there are millions of them and not very reliable. Also,
there could be a embedded binary information in the file (e.g. pdf) that
might make that file a mixed type.

I want to know if there is a way, maybe some kind of signature, to find what
kind of file is it in C# (or any other language that can be used with cs).
 
M

Morten Wennevik

Hi Pohihihi,

As far as I know there is no good way to split text and binary files.
You could search every byte and if none are above 127 chances are it's
an ASCII text file, but even if there are bytes above 127 it could be
an extended ASCII text file.

The best bet would probably be to keep a list of known text file
extensions like .bat .inf .reg .txt, but even then you can't be sure
if some program decides to make a binary .txt

Maybe you could tap into the registry and get a descriptive line about
a registered file extension instead.
 
G

Guest

A fairly reliable method I have used in the past is to read it for binary
access, and if any of the bytes are zero, then it ISN'T a text file. If it
contains NO zeros in the whole file, it's probable that it's a text file.
Exceptions to this are *very* rare.

So just have a method IsTextFile, and start to read it using a FileStream in
a using block, and if any of the bytes are zero, return false immediately,
otherwise return true if it gets to the end of the loop.
 
H

Hans Kesting

Bonj said:
A fairly reliable method I have used in the past is to read it for
binary access, and if any of the bytes are zero, then it ISN'T a text
file. If it contains NO zeros in the whole file, it's probable that
it's a text file. Exceptions to this are *very* rare.

So just have a method IsTextFile, and start to read it using a
FileStream in a using block, and if any of the bytes are zero, return
false immediately, otherwise return true if it gets to the end of the
loop.

But this will identify only ASCII files as text, Unicode files will probably
be detected as binary! When characters are stored as two bytes, it is
very possibly that one of these is zero.

Hans Kesting
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top