PC Review


Reply
Thread Tools Rate Thread

Detecting tsv/csv files basing on file content

 
 
Yexiong Feng
Guest
Posts: n/a
 
      14th Apr 2009
Hi All,

I wonder how to detect whether a file is a comma separated file or a tab
separated file (given that the input file is either of them), without looking
at the file extension (i.e. the extension could be only .txt but the content
is structured in a tsv/csv manner).

Thanks!
Feng
 
Reply With Quote
 
 
 
 
Yexiong Feng
Guest
Posts: n/a
 
      14th Apr 2009
Thanks Mark for the reply. However when I encounter a more complicated file
(say, an actual csv file that has tabs within fields), this method won't work.

I am wondering whether or not there exists a more sosphicated and developed
algorithm to do the detection?

Thanks,
Feng

"Mark Rae [MVP]" wrote:

> "Yexiong Feng" <(E-Mail Removed)> wrote in message
> news:1A814C6C-2917-4C54-A903-(E-Mail Removed)...
>
> > I wonder how to detect whether a file is a comma separated file or a tab
> > separated file (given that the input file is either of them), without
> > looking
> > at the file extension (i.e. the extension could be only .txt but the
> > content
> > is structured in a tsv/csv manner).

>
> I suppose, you could open the file, read in the first line and check whether
> it contains a tab (\t) character. If it does, assume it's tsv. If not, check
> whether it contains a comma...
>
>
> --
> Mark Rae
> ASP.NET MVP
> http://www.markrae.net
>
>

 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      14th Apr 2009
Yexiong Feng wrote:
> I wonder how to detect whether a file is a comma separated file or a tab
> separated file (given that the input file is either of them), without looking
> at the file extension (i.e. the extension could be only .txt but the content
> is structured in a tsv/csv manner).


Not with 100% certainty.

But if you read a few lines and count commas and tabs, then
chances are good that you can guess correctly.

3 commas, 0 tabs
3 commas, 0 tabs
3 commas, 0 tabs

means CSV.

0 commas 3 tabs
1 commas 3 tabs
0 commas 3 tabs

means TSV.

xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx
xxx<TAB>xxx,xxx

could be any. But that type of data would be extremely rare.

Arne



 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      14th Apr 2009
Mark Rae [MVP] wrote:
> "Yexiong Feng" <(E-Mail Removed)> wrote in message
> news:FC070E15-2526-4682-A957-(E-Mail Removed)...
>
> [top-posting corrected]
>
>>>> I wonder how to detect whether a file is a comma separated file or a
>>>> tab
>>>> separated file (given that the input file is either of them), without
>>>> looking at the file extension (i.e. the extension could be only .txt
>>>> but the
>>>> content is structured in a tsv/csv manner).
>>>
>>> I suppose, you could open the file, read in the first line and check
>>> whether
>>> it contains a tab (\t) character. If it does, assume it's tsv. If
>>> not, check
>>> whether it contains a comma...

>>
>> Thanks Mark for the reply. However when I encounter a more complicated
>> file
>> (say, an actual csv file that has tabs within fields), this method
>> won't work.

>
> Indeed. You could say the same for a TSV file which has a lot of commas
> within fields - incidentally, commas in fields is one of the most common
> reasons for using the TSV format...


But quotes can also solve the problem for CSV.

Arne
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      14th Apr 2009
Yexiong Feng wrote:
> "Mark Rae [MVP]" wrote:
>> "Yexiong Feng" <(E-Mail Removed)> wrote in message
>> news:1A814C6C-2917-4C54-A903-(E-Mail Removed)...
>>> I wonder how to detect whether a file is a comma separated file or a tab
>>> separated file (given that the input file is either of them), without
>>> looking
>>> at the file extension (i.e. the extension could be only .txt but the
>>> content
>>> is structured in a tsv/csv manner).

>> I suppose, you could open the file, read in the first line and check whether
>> it contains a tab (\t) character. If it does, assume it's tsv. If not, check
>> whether it contains a comma...

> Thanks Mark for the reply. However when I encounter a more

complicated file
> (say, an actual csv file that has tabs within fields), this method

won't work.
>
> I am wondering whether or not there exists a more sosphicated and

developed
> algorithm to do the detection?


If you count comma and tab in multiple lines, then the delimiter will
have the same number of occurrences in each line.

That is most likely not the case for the non-delimiter.

Arne



 
Reply With Quote
 
Hans Liss
Guest
Posts: n/a
 
      14th Apr 2009
In article <1A814C6C-2917-4C54-A903-(E-Mail Removed)>,
Yexiong Feng <(E-Mail Removed)> wrote:
>Hi All,
>
>I wonder how to detect whether a file is a comma separated file or a tab
>separated file (given that the input file is either of them), without looking
>at the file extension (i.e. the extension could be only .txt but the content
>is structured in a tsv/csv manner).


The only safe way would be to scan the entire file and look for (and count)
any delimiters you want to support, split the lines and verify the actual
data. Then you may or may not need to rescan the file from the start if
your default assumption does not hold. This way, you could even support
various different text quoting methods.

Depending on the context, if I needed to make a really flexible import
function, I would probably try simultaneously parsing as much as possible
using different delimiters until all but one methods failed.

If memory isn't a big problem, just import into internal structures, one
for each delimiter. Once you are sure what kind of file it is, you can get
rid of all the in-memory data and dump the correct version to your database
or whatever you want to do with it.

If there's a memory constraint, you may need to parse the file in several
passes, one for each delimiter.

Oh, and remember that CSV files aren't always comma separated in Windows.
Depending on the system locale, commas may be used as decimal delimiters
in numbers, and CSV files will then be semicolon delimited.

Regards,

Hans

 
Reply With Quote
 
Todd Carnes
Guest
Posts: n/a
 
      15th Apr 2009
Yexiong Feng wrote:
> Hi All,
>
> I wonder how to detect whether a file is a comma separated file or a tab
> separated file (given that the input file is either of them), without looking
> at the file extension (i.e. the extension could be only .txt but the content
> is structured in a tsv/csv manner).
>
> Thanks!
> Feng


Try using FileHelpers to read in your files. It's free and works good.
It's at http://filehelpers.sourceforge.net/
 
Reply With Quote
 
Jeff Johnson
Guest
Posts: n/a
 
      17th Apr 2009
"Todd Carnes" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...

> Try using FileHelpers to read in your files. It's free and works good.
> It's at http://filehelpers.sourceforge.net/


Damn, I got all excited there for a second. "It can read from EXCEL??
Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
meaning Excel had to be installed on the machine. Which of course is exactly
what I'm trying to avoid. Oh well, it's still a really good project; it just
didn't do the one thing I needed it to.


 
Reply With Quote
 
Dude
Guest
Posts: n/a
 
      17th Apr 2009
On Apr 17, 1:52*pm, "Mark Rae [MVP]" <m...@markNOSPAMrae.net> wrote:
> "Jeff Johnson" <i....@enough.spam> wrote in message
>
> news:%(E-Mail Removed)...
>
> > Damn, I got all excited there for a second. "It can read from EXCEL??
> > Woo-hoo!!" Then I downloaded the source and found out it did so via
> > Interop, meaning Excel had to be installed on the machine. Which of course
> > is exactly what I'm trying to avoid.

>
> If you need a managed way (i.e. without Interop) to manipulate Excel files
> and without an installed copy of Excel, you need this:http://www.aspose.com/categories/fil.../aspose.cells-...
>
> --
> Mark Rae
> ASP.NET MVPhttp://www.markrae.net


Do you have any control over the creation of the source file?
If so, a header block would eliminate confusion.

First line would be something along the lines

Format = \t
or
Format = ,
or you could possibly handle this
Format = "",
or my personal favorite
Format = |
 
Reply With Quote
 
Todd Carnes
Guest
Posts: n/a
 
      17th Apr 2009
Jeff Johnson wrote:
> "Todd Carnes" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
>> Try using FileHelpers to read in your files. It's free and works good.
>> It's at http://filehelpers.sourceforge.net/

>
> Damn, I got all excited there for a second. "It can read from EXCEL??
> Woo-hoo!!" Then I downloaded the source and found out it did so via Interop,
> meaning Excel had to be installed on the machine. Which of course is exactly
> what I'm trying to avoid. Oh well, it's still a really good project; it just
> didn't do the one thing I needed it to.
>
>


If you're programming in Java you can use this http://poi.apache.org/,
but I'm not sure what else you could use for c#.

Todd
 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
batch file to generat file names and content from .txt files? gerryR Microsoft Windows 2000 CMD Promt 5 31st Aug 2006 09:38 PM
checking file content and move to another dir files with same content Matthieu Microsoft Windows 2000 CMD Promt 2 30th May 2005 07:58 PM
Search files and folders not searching content of files (file compression wasn't the problem) Mike Ditka Windows XP General 2 18th May 2005 03:47 PM
Extract content files from CAB file =?Utf-8?B?TGVzbGllIEtvdQ==?= Microsoft Dot NET Framework 0 19th Apr 2005 09:04 PM
Something deleted the content in Microsoft Word files and WordPerfectfiles, but left the file names. Edward Anti-Virus 7 16th Apr 2005 11:11 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 01:48 AM.