How to detect binary file?

M

Marty McDonald

Our web app lets users upload tab-delimited files to the site, where
the content is parsed & loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?
 
P

Peter Duniho

Marty said:
Our web app lets users upload tab-delimited files to the site, where
the content is parsed & loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?

You can't do so reliably. You should should just attempt to interpret
the file as those formats you expect to be able to parse, and if that
fails, assume the file is binary.

Note that you may want to provide the user with a way to specify the
format explicitly, since they may have a tab-delimited file that they
don't want your server to actually treat as tab-delimited, but instead
just want it stored verbatim as binary.

Pete
 
A

Arne Vajhøj

Our web app lets users upload tab-delimited files to the site, where
the content is parsed& loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?

You can not reliable test for text file versus binary file
without knowing specifics about the content.

If your text files primarily contains digits and English
alphabet, then you can make a guess by counting the number
of bytes n that have values >=128 in the first N bytes.
Small n/N indicates text file, large n/N indicates binary
file. Maybe do the cutoff at 25%.

Arne
 
A

Andy B.

Check the ASP:FileUpload control. It has a property for determining what
file type it is exactly. If it is anything other than plain text, throw it
out and tell the user it was the wrong filetype. If it is a text file, try
to parse it and handle additional errors there. I had to do this using mp3
files. The website could only allow mp3 uploads.
 
P

Peter Duniho

Andy said:
Check the ASP:FileUpload control. It has a property for determining what
file type it is exactly.

I don't even have to look at the docs to know that there's no property
on the object that can literally do that. It's not possible given some
arbitrary stream of bytes to know for sure what the format is.

More likely, the control supports interpretation of headers, such as
MIME type, and perhaps even looks at the file extension. And those are
fine techniques (in fact, MIME is a very nice, reliable way to mark up
data for portability and format recognition). But that's not what the
original question was asking about.

Pete
 
K

kndg

Check the ASP:FileUpload control. It has a property for determining what
file type it is exactly. If it is anything other than plain text, throw it
out and tell the user it was the wrong filetype. If it is a text file, try
to parse it and handle additional errors there. I had to do this using mp3
files. The website could only allow mp3 uploads.

Hi Andy,

I'm not sure whether that would help.
Are you referring to FileUpload.PostedFile.ContentType property?
If that so, that property does not examine the content of the file.
It just return the type base on file extension (ie: "aa.txt" ->
text/plain, "aa.mp3" -> audio/mpeg). Someone could just take binary file
(probably malicious one!) and rename it to something that look like a
text file (ex: iwonthurtyou.txt) and upload to your server!

Regards.
 
A

Andy B.

True. This is why I wouldn't rely totally on the contenttype property. I
have seen instances where people wouldn't be thinking or paying attention
and just uploaded what they thought might be good. Throwing out the file
based on contenttype works in that sort of way. If you are actually trying
to put all the bytes in the stream together to find out what it might be, I
don't think you can really do that. Best I can say is to test for estimated
guesses and "brace for impact" if it isn't what you wanted.

1. Test contenttype for the right stream type. If it is OK, go to step 2. If
not, throw the file out and start over.
2. Some websites/applications will refuse to take text files unless they
have a particular extension (csv, xml, txt). Test the file extension if any.
If it matches what you expect, go to step 3. If not, throw it out and start
over.
3. OK. We have what we assume to be a valid text file of some kind. Now test
for the correct formatting specifications (csv, xml, tabbed delimited) by
trying to read the file as you would need to. Remember to just be careful
about what the data has in it.

Anybody have any other ideas?
 
A

Andy B.

Do you have any better ideas then?
Peter Duniho said:
I don't even have to look at the docs to know that there's no property on
the object that can literally do that. It's not possible given some
arbitrary stream of bytes to know for sure what the format is.

More likely, the control supports interpretation of headers, such as MIME
type, and perhaps even looks at the file extension. And those are fine
techniques (in fact, MIME is a very nice, reliable way to mark up data for
portability and format recognition). But that's not what the original
question was asking about.

Pete
 
H

Hector Santos

Marty said:
Our web app lets users upload tab-delimited files to the site, where
the content is parsed & loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?


There is no hard rule what a binary file is but the definition of your
file spec. However, there are some traditional guidelines:

If the file did not expect characters less then ascii code 32 and over
127, then its generally considered a text file. Otherwise its binary
or "garbage."

In other words, your field type defines what is valid. If you don't
expect something, then its garbage.

In binary file transfer protocols (i.e. zmodem), you generally escape
characters if its going to interfere with the protocol. In the ASCII
transfer protocol, you have to escape "Terminal" control characters,
especially FF, ESC, XON, XOFF. For example, uploading a text file via
the ASCII protocol with form feeds, a control character but still
considered a "text file."

But you are doing an WEB upload which is a binary file transfer and
has no escaping and you will be having a content-type header too if
set the encoding right on the form tag.

So the first thing is to make sure the content type is not one that is
considered binary, like an image, zip, exe, etc. That it says
text/plain perhaps.

After that, your only option really is to check for control characters
(ascii code less than 32) and possibly ascii codes > 127 as you parse
it, and I think you have to do this anyway because you should not
trust the content-type saying its text/plain. It is provided an image
type than you don't need to bother with the parser, just reject it.
 
M

Marty McDonald

Our web app lets users upload tab-delimited files to the site, where
the content is parsed & loaded to a database. Aside from looking for
tab characters, control/linefeed characters, what would be a good way
to detect if the user uploaded a binary file?

Thanks everyone for your comments. I should have provided more info on
the original post, here's more info:
Instead of using our web app's online pages to enter information,
users can upload tab-delimited text files which are eventually written
to db tables. So users can work in Excel, then "save as" tab-
delimited text & upload that file. But a user accidentally uploaded
the Excel file itself (xlsx extension). Our process handled it as a
file with one record containing zero fields. Our edit routines
therefore encountered no invalid fields, because there were none!
Nothing was saved in the db, but no error message was displayed
either.

I will likely use Andy B's 3 step approach, along with Hector's ASCII
code test method.

Thanks everyone!
--Marty
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top