Detecting tsv/csv files basing on file content

Y

Yexiong Feng

Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Thanks!
Feng
 
Y

Yexiong Feng

Thanks Mark for the reply. However when I encounter a more complicated file
(say, an actual csv file that has tabs within fields), this method won't work.

I am wondering whether or not there exists a more sosphicated and developed
algorithm to do the detection?

Thanks,
Feng
 
A

Arne Vajhøj

Yexiong said:
I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Not with 100% certainty.

But if you read a few lines and count commas and tabs, then
chances are good that you can guess correctly.

3 commas, 0 tabs
3 commas, 0 tabs
3 commas, 0 tabs

means CSV.

0 commas 3 tabs
1 commas 3 tabs
0 commas 3 tabs

means TSV.

xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx

could be any. But that type of data would be extremely rare.

Arne
 
A

Arne Vajhøj

Mark said:
[top-posting corrected]
Thanks Mark for the reply. However when I encounter a more complicated
file
(say, an actual csv file that has tabs within fields), this method
won't work.

Indeed. You could say the same for a TSV file which has a lot of commas
within fields - incidentally, commas in fields is one of the most common
reasons for using the TSV format...

But quotes can also solve the problem for CSV.

Arne
 
A

Arne Vajhøj

Yexiong said:
Thanks Mark for the reply. However when I encounter a more complicated file
(say, an actual csv file that has tabs within fields), this method won't work.

I am wondering whether or not there exists a more sosphicated and developed
algorithm to do the detection?

If you count comma and tab in multiple lines, then the delimiter will
have the same number of occurrences in each line.

That is most likely not the case for the non-delimiter.

Arne
 
H

Hans Liss

Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

The only safe way would be to scan the entire file and look for (and count)
any delimiters you want to support, split the lines and verify the actual
data. Then you may or may not need to rescan the file from the start if
your default assumption does not hold. This way, you could even support
various different text quoting methods.

Depending on the context, if I needed to make a really flexible import
function, I would probably try simultaneously parsing as much as possible
using different delimiters until all but one methods failed.

If memory isn't a big problem, just import into internal structures, one
for each delimiter. Once you are sure what kind of file it is, you can get
rid of all the in-memory data and dump the correct version to your database
or whatever you want to do with it.

If there's a memory constraint, you may need to parse the file in several
passes, one for each delimiter.

Oh, and remember that CSV files aren't always comma separated in Windows.
Depending on the system locale, commas may be used as decimal delimiters
in numbers, and CSV files will then be semicolon delimited.

Regards,

Hans
 
T

Todd Carnes

Yexiong said:
Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Thanks!
Feng

Try using FileHelpers to read in your files. It's free and works good.
It's at http://filehelpers.sourceforge.net/
 
J

Jeff Johnson

Try using FileHelpers to read in your files. It's free and works good.
It's at http://filehelpers.sourceforge.net/

Damn, I got all excited there for a second. "It can read from EXCEL??
Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
meaning Excel had to be installed on the machine. Which of course is exactly
what I'm trying to avoid. Oh well, it's still a really good project; it just
didn't do the one thing I needed it to.
 
D

Dude

If you need a managed way (i.e. without Interop) to manipulate Excel files
and without an installed copy of Excel, you need this:http://www.aspose.com/categories/file-format-components/aspose.cells-...

Do you have any control over the creation of the source file?
If so, a header block would eliminate confusion.

First line would be something along the lines

Format = \t
or
Format = ,
or you could possibly handle this
Format = "",
or my personal favorite
Format = |
 
T

Todd Carnes

Jeff said:
Damn, I got all excited there for a second. "It can read from EXCEL??
Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
meaning Excel had to be installed on the machine. Which of course is exactly
what I'm trying to avoid. Oh well, it's still a really good project; it just
didn't do the one thing I needed it to.

If you're programming in Java you can use this http://poi.apache.org/,
but I'm not sure what else you could use for c#.

Todd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top