Detecting tsv/csv files basing on file content

Yexiong Feng · Apr 14, 2009

Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Thanks!
Feng

Yexiong Feng · Apr 14, 2009

Thanks Mark for the reply. However when I encounter a more complicated file
(say, an actual csv file that has tabs within fields), this method won't work.

I am wondering whether or not there exists a more sosphicated and developed
algorithm to do the detection?

Thanks,
Feng

Arne VajhÃ¸j · Apr 14, 2009

Yexiong said:
I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Not with 100% certainty.

But if you read a few lines and count commas and tabs, then
chances are good that you can guess correctly.

3 commas, 0 tabs
3 commas, 0 tabs
3 commas, 0 tabs

means CSV.

0 commas 3 tabs
1 commas 3 tabs
0 commas 3 tabs

means TSV.

xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx

could be any. But that type of data would be extremely rare.

Arne

Arne VajhÃ¸j · Apr 14, 2009

Mark said:
[top-posting corrected]

Thanks Mark for the reply. However when I encounter a more complicated
file
(say, an actual csv file that has tabs within fields), this method
won't work.

Click to expand...

Indeed. You could say the same for a TSV file which has a lot of commas
within fields - incidentally, commas in fields is one of the most common
reasons for using the TSV format...

But quotes can also solve the problem for CSV.

Arne

Arne VajhÃ¸j · Apr 14, 2009

Yexiong said:
Thanks Mark for the reply. However when I encounter a more complicated file
(say, an actual csv file that has tabs within fields), this method won't work.

I am wondering whether or not there exists a more sosphicated and developed
algorithm to do the detection?

If you count comma and tab in multiple lines, then the delimiter will
have the same number of occurrences in each line.

That is most likely not the case for the non-delimiter.

Arne

Hans Liss · Apr 14, 2009

Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

The only safe way would be to scan the entire file and look for (and count)
any delimiters you want to support, split the lines and verify the actual
data. Then you may or may not need to rescan the file from the start if
your default assumption does not hold. This way, you could even support
various different text quoting methods.

Depending on the context, if I needed to make a really flexible import
function, I would probably try simultaneously parsing as much as possible
using different delimiters until all but one methods failed.

If memory isn't a big problem, just import into internal structures, one
for each delimiter. Once you are sure what kind of file it is, you can get
rid of all the in-memory data and dump the correct version to your database
or whatever you want to do with it.

If there's a memory constraint, you may need to parse the file in several
passes, one for each delimiter.

Oh, and remember that CSV files aren't always comma separated in Windows.
Depending on the system locale, commas may be used as decimal delimiters
in numbers, and CSV files will then be semicolon delimited.

Regards,

Hans

Todd Carnes · Apr 15, 2009

Yexiong said:
Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Thanks!
Feng

Try using FileHelpers to read in your files. It's free and works good.
It's at http://filehelpers.sourceforge.net/

Jeff Johnson · Apr 17, 2009

Try using FileHelpers to read in your files. It's free and works good.
It's at http://filehelpers.sourceforge.net/

Damn, I got all excited there for a second. "It can read from EXCEL??
Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
meaning Excel had to be installed on the machine. Which of course is exactly
what I'm trying to avoid. Oh well, it's still a really good project; it just
didn't do the one thing I needed it to.

Dude · Apr 17, 2009

If you need a managed way (i.e. without Interop) to manipulate Excel files
and without an installed copy of Excel, you need this:http://www.aspose.com/categories/file-format-components/aspose.cells-...

Do you have any control over the creation of the source file?
If so, a header block would eliminate confusion.

First line would be something along the lines

Format = \t
or
Format = ,
or you could possibly handle this
Format = "",
or my personal favorite
Format = |

Todd Carnes · Apr 17, 2009

Jeff said:
Damn, I got all excited there for a second. "It can read from EXCEL??
Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
meaning Excel had to be installed on the machine. Which of course is exactly
what I'm trying to avoid. Oh well, it's still a really good project; it just
didn't do the one thing I needed it to.

If you're programming in Java you can use this http://poi.apache.org/,
but I'm not sure what else you could use for c#.

Todd

Jeff Johnson · Apr 17, 2009

If you need a managed way (i.e. without Interop) to manipulate Excel files
and without an installed copy of Excel, you need this:
http://www.aspose.com/categories/file-format-components/aspose.cells-for-.net-and-java/default.aspx

Yeah, but I'm cheap. I was looking for a free library.

HELP- Need TSV extension for Excel	6	Jul 13, 2009
Excel (Office 2007) Not Opening .tsv files	0	May 2, 2009
TSV Data import	1	Apr 4, 2007
Reading TSv files in Excel	1	Oct 12, 2006
How to Keep Excel Sheet Formatting On a CSV Sheet with vba	1	Aug 14, 2022
.tsv file	1	Sep 18, 2003
Using vbTab to create tsv files	4	Jun 8, 2006
How to disable General formatting when opening CSV	1	Feb 24, 2009

Detecting tsv/csv files basing on file content

Yexiong Feng

Yexiong Feng

Arne VajhÃ¸j

Arne VajhÃ¸j

Arne VajhÃ¸j

Hans Liss

Todd Carnes

Jeff Johnson

Dude

Todd Carnes

Jeff Johnson

Ask a Question

Similar Threads