Parser for CSV files

  • Thread starter Thread starter Ali-R
  • Start date Start date
No,I'm getting a CSV file from a client and I need to open it up and do some
checks on it it before handing it over to another programme.I'm looking for
something similar to get some ideas how to pars a CSV file in the most
efficient way.

Thanks for your quick help.
 
Nothing built-in for this. The two solutions most commonly given to this
question is to use String.Spit() (too simplistic for general use; it doesn't
handle field values that are quoted and have a comma within them, for
instance) ... or, use ADO.NET to read/write a CSV file via, say, a generic
text ODBC driver or the Excel driver. The ADO.NET solution has the downside
of configuration issues to iron out, and I believe a 254 field limit in some
cases. Particularly if deployment will be on many arbitrary machines, you
have to worry about whether your chosen drive is installed, what version it
is, etc. It does make life more complicated and finicky.

Since I read and write a variety of CSV and CSV-like formats all the time
(pipe-delimited, tab-delimited, space-delimited, and various bizarre formats
such as field label....value, one field to a line, and even scraping of text
reports), I've found it much more efficient and trouble free to write my own
classes for this purpose. For example I have a DelimitedTextFile() class
whose constructor takes a file name (or a FileInfo instance), that has a
ReadRecord() method that returns an arraylist of record values, and a
WriteRecord() method that takes an arraylist of values and writes them out.
And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field. All of this works very
fast and has proven to be highly adaptable to anything that gets thrown at
it.

--Bob
 
I ended up writing a CSV parser based upon the Microsoft.Jet.OLEDB.4.0
driver, which I use all the time! It's not hard and you can then build
the validator(s) as a seperate component.

Kirsty
 
Excellent ,very nice idea.here is some questions in regards to your
solution:

1) you said :
use ADO.NET to read/write a CSV file via, say, a generic text ODBC driver
or the Excel driver
What do yuo mean by a generic text ODBC driver or the Excel driver?

2)I didn't underestand what's the usage of
And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field.

3)I need to validate each record value against some complicated rules ,what
have you done about that?

4) Can you send me that example by any chance?
my email is (e-mail address removed) (remove Fake from the beggining of my
userid:-)

Thanks very much for your nice help.

Ali
 
Are you using the Visual Basic sample showing how to use the CsvDataAdapter
to read and write a table of data to a csv text file?

Thanks
 
1) you said :
What do yuo mean by a generic text ODBC driver or the Excel driver?

Microsoft operating systems used to come with "desktop ODBC drivers" amongst
which was one called (if I recall correctly) Generic Text or just the "text
driver", for reading CSVs and fixed field length text files as if they were
database tables. There is also an ODBC driver that reads Excel files
including CSVs. I think it's part of the JET driver. I've not personally
used either one but they would work like any other ADO.NET data source, via
the ODBC provider.
2)I didn't underestand what's the usage of

Unless you want all of your fields to be strings and don't care to handle /
validate field name header records that are part of some CSV files, you need
some mechanism to define field names, field order, field data types, and
perhaps width and basic formatting instructions. You just create a struct
or class that defines these items.
3)I need to validate each record value against some complicated rules
,what have you done about that?

Whatever is necessary. That is up to the client program. A basic loop
looks something like:


using (f = new DelimitedTextFile("somefile.csv")) {
// code here to set up the FieldInfo arraylist
ArrayList fieldData = new ArrayList();

while (fieldData = f.ReadRecord() != null) {
// Do whatever you need to with the current record, which is in
fieldData.
}

}
4) Can you send me that example by any chance?
my email is (e-mail address removed) (remove Fake from the beggining of my
userid:-)

Sorry, the code was written under nondisclosure and I am not permitted to
release the source code. But it isn't rocket science to put something like
this together and it's well worth the effort.

--Bob
 
Bob ,thanks very much for your nice replies
Unless you want all of your fields to be strings and don't care to handle
/ validate field name header records that are part of some CSV files, you
need some mechanism to define field names, field order, field data types,
and perhaps width and basic formatting instructions. You just create a
struct or class that defines these items.


I don't think I need to be worried about fieldNames,but datatype of the
field (e.g ,some times it should be only a number and no letters allowed)
,field order is really important to me ,so I have to use FieldInfo.
I'm going to implement it this weekend ,if I had small questions ,can I
email it to you using the email address you have provided?

Thanks again
 
Field names can become important when you have a header record in the CSV
and need to check that the header record does not change from what's
expected, since change in field order is significant, and change in field
names *may* be significant. It has been my experience that data from
outside sources can change at any time without notice, no matter whether the
provider of the data guarantees they will or will not. So it pays to check
that you're getting what you expect before you go and put the wrong data
into the wrong fields or something.

In any case I like to name all the fields just for debugging purposes, even
when there is no header record. Field names can also be useful for
populating default data grid headers and the like, in the UI, though I've
never actually needed to do it. Finally you could create indexers or
methods that would return field values based on field name rather than just
field index. Again, something I haven't actually needed to do, but it could
be handy at times.

My FieldInfo concept is not magical, your parsing / reading / writing class
has to use that info to do the appropriate transformations and enforce rules
or raise errors; it's just a container for some limited schema information.
There are probably lots of ways you could structure it, this is just the way
I chose.

One thing I have found is that although I built in support for converting
fields to and from specific .NET data types as part of reading and writing
CSVs, in practice I haven't used that nearly as much as I thought I would,
and tend to just define all fields as strings. For the work I do anyway, it
seems better to let client applications of these text file I/O classes be
responsible for whatever scrubbing and conversion they need to do. If you
wanted to be brutally simplistic you could just design a class that treats
all fields as strings and is just responsible for correctly separating out
each field string in order, and doesn't care about names, data types, or
anything else but perhaps how many fields to expect. The decision here is a
matter of where you want to replace the responsibility for data integrity
and how you want to handle data that doesn't conform to expectations.

You can always drop me a note. I can't guarantee I'll have time to respond,
at least right away, but I will try.

Best,

--Bob
 
Thanks for your suggessions.

Bob Grommes said:
Field names can become important when you have a header record in the CSV
and need to check that the header record does not change from what's
expected, since change in field order is significant, and change in field
names *may* be significant. It has been my experience that data from
outside sources can change at any time without notice, no matter whether
the provider of the data guarantees they will or will not. So it pays to
check that you're getting what you expect before you go and put the wrong
data into the wrong fields or something.

In any case I like to name all the fields just for debugging purposes,
even when there is no header record. Field names can also be useful for
populating default data grid headers and the like, in the UI, though I've
never actually needed to do it. Finally you could create indexers or
methods that would return field values based on field name rather than
just field index. Again, something I haven't actually needed to do, but
it could be handy at times.

My FieldInfo concept is not magical, your parsing / reading / writing
class has to use that info to do the appropriate transformations and
enforce rules or raise errors; it's just a container for some limited
schema information. There are probably lots of ways you could structure
it, this is just the way I chose.

One thing I have found is that although I built in support for converting
fields to and from specific .NET data types as part of reading and writing
CSVs, in practice I haven't used that nearly as much as I thought I would,
and tend to just define all fields as strings. For the work I do anyway,
it seems better to let client applications of these text file I/O classes
be responsible for whatever scrubbing and conversion they need to do. If
you wanted to be brutally simplistic you could just design a class that
treats all fields as strings and is just responsible for correctly
separating out each field string in order, and doesn't care about names,
data types, or anything else but perhaps how many fields to expect. The
decision here is a matter of where you want to replace the responsibility
for data integrity and how you want to handle data that doesn't conform to
expectations.

You can always drop me a note. I can't guarantee I'll have time to
respond, at least right away, but I will try.

Best,

--Bob
 
Bob,

I forgot to ask ,what 's your sterategy usually for transformig CSV files
into DB?
Here is what I'm going to do:
=======================================================================================
I'm recieving some CSV files from our clients and they are supposed to be
written in the Database.Considering the following issues:
1) The CSV files should be hughly validated,each field against a lot of
bussiness rules
2) Everything must be logged
I) Invlaid Field Values
II) Errors and Warnings happened during pumping data from CSV into
DB
III)All other Error and Exceptions
I choose to have a Windows Service which monitors a folder for CSV files and
picks them up as soon as they are created and after validation( Parsing +
Validating) ,executes a DTS package (anything happenes in DTS ,windows
services gets notified) .there is also a log component in the Windows
Service which logs everything.
I just wanted to know if anybody has the experience.I'm looking for the best
strategy actually.
Thanks
Ali-R
 
In the case where I'm getting a csv file from another source, I
normally read in the csv file, validate it as I go, and write good
records back out to another csv or delimited file. That way I limit the
number of errors thrown by my DTS package, which should only really be
worry about getting data into the database as fast as possible and not
worrying about parsing errors, etc. How big are these files you're
processing?
 
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance

2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)

3) Another problem is that they zip the CSV files and put it in our database
,I probably need another DTS package to unzip the data from DTS and Write it
somewhere as a CSV file

I have some questions about yours:

1) Is your application multi-threaded or not?

2) Do yuo process CSV Files in memory or you put them on File system and you
just pass the path to the CSV to your DTS package.

3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database ,how
would deal with them (I have to log those errors too)

Thanks for your help
 
Back
Top