Parser for CSV files

  • Thread starter Thread starter Ali-R
  • Start date Start date
1)How to design the validator for the best performance

For best performance, don't design your own, go find a product already
made and tuned. It will save you hours of work, hours of support, and
will probably handle more issues than you can even think of.
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)

You said you're just planning on monitoring a directory right? If so,
I'd say use a windows service.
3) Another problem is that they zip the CSV files and put it in our database
,I probably need another DTS package to unzip the data from DTS and Write it
somewhere as a CSV file

I'm not sure why you'd use DTS to unzip files. I'd just use C# products
to do that.
http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx
1) Is your application multi-threaded or not?

My applications that have done this type of work have actually been web
page file uploads, so they've of course been multithreaded because of
the way web pages run.
2) Do yuo process CSV Files in memory or you put them on File system and you
just pass the path to the CSV to your DTS package.

I would not recommend loading everything into memory. Again, I wouldn't
be using DTS to validate, I'd be using a CSV parser product to read
over the file line by line. It's scalable, efficient, and fast.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database ,how
would deal with them (I have to log those errors too)

Most people I know that have these types of uploads do not load records
directly into the table that they'll end up at. They load data into a
working table first. Then, you can run queries against the working
table and the end table if you want to log duplicate records for
instance. Then, when they're all done, they move the records from the
working table into the live table.
 
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance

That is a very broad question and a completely separate issue from parsing
the CSV. There is no one "correct" answer. However, you will be amazed at
how much detailed parsing and checking a finished C# application can crank
out in the blink of an eye. Parsing and rewriting CSVs with thousands of
records takes just seconds including program load time. I don't even bother
to do a release build, it's not worth the effort to maintain two builds (of
course all of my stuff is for internal release to one machine).
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)

Depends on your needs. Generally a console app is my starting point for
anything like this. My apps tend to stay as console applications, some of
which run via a scheduler. That works fine for me since my apps all run on
dedicated worker machines. With very little extra work you can turn a
debugged console app into a Windows Service, or call the mainline processing
from a WindowsForms application (although you may want to feed all your
console output through a central routine so that you can modify it to
properly update the UI on the UI thread, if you think you will be talking to
a WinForms app to show detailed progress messages; otherwise, and
Console.WriteLine() calls are just ignored in a normal Windows app since no
console is present at runtime).
3) Another problem is that they zip the CSV files and put it in our
database ,I probably need another DTS package to unzip the data from DTS
and Write it somewhere as a CSV file

If the data is zipped and put into a database field, you can just grab the
zip file image as a byte[] from your C# app and then feed it to a zip
library. There are several free or inexpensive zip libraries for .NET.
I have some questions about yours:

1) Is your application multi-threaded or not?

No, it scans a drop folder and sequentially processes whatever files it
finds there. A scheduler runs it every 5 minutes. This may or may not be
suitable to your needs, it happens to work for my particular scenario.

Even in .NET, where threads are much simpler, they greatly complicate
testing and debugging and I use multi threading only when the application
really demands it, which is rarely. After all, running a console app on a
dedicated box, or a background Windows Service, amounts to a dedicated
thread anyway.
2) Do yuo process CSV Files in memory or you put them on File system and
you just pass the path to the CSV to your DTS package.

Most of the time I process one record at a time -- one in, one out. My
objective is to convert a variety of source file formats (including CSV)
into a standard CSV format with fields in a particular order. From there it
goes into our database (temp table) for scrubbing and parsing and error
logging, and a certain amount of human intervention as well (our source data
is very poorly validated and some human judgment calls are required). Only
then does it go into the live database.

I have one particular special case where I load the entire file into memory
because of a complex record merge that is necessary part of the parsing /
conversion process; it's analagous to loading an entire XML document and
selecting data from different parts of the document and rearranging it.
This uses string interning to keep memory consumption manageable, but that
slows things way down. It works however and is still acceptably fast. But
this is a really unusual special case and I doubt very much that I would
encounter a need like this once in a decade of work.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database
,how would deal with them (I have to log those errors too)

I think I pretty much covered that question in my last answer above.

Let me just say, I get the sense that you are trying to bite off too much of
this problem at once. Don't look at it monolithically. Break it down into
smaller problems:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

Write the low level routines (such as parsing the CSV) for the most part
without regard to what kind of app they will go into or what thread they
will execute in. Then assemble the pieces as appropriate. You may well end
up with two or more cooperating applications. For example I have one
console app for converting source data into a *standard* CSV format; a
scheduler to run that; another WinForms app takes the standard CSV and does
the scrubbing / validating and then finally exports the scrubbed data as a
CSV which is then bulk loaded into the DB by another console app. The
database then has an ASP.NET front end for general management and querying,
and the queries feed into a report server that queues up reporting jobs and
distributes them (Windows Service).

The right tools for each task -- of course "right" is relative, and the
above decisions were the result of unique requirements of a particular
situation.

--Bob
 
Hi Bob,
thanks for monitoring this thread.
yes you're right,I'm kind of complicating it ;-)
you said:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

I can't do the section 2 because they want it to be validated through the
code ,that's why I am looking for a way to validate each line and as some
people suggest make another CVS file which dosen't have those problematic
records on it and then through a DTS package ,CSV is trasformed to the
table.I searched internet but there is no even one example (for instance a
class) which validates a line in a CSV file.Most of people have written only
the parser which is kind of useless without validator (in my case).If you
know somewhere where a validator class has been presented so I can get a
general idea how to write my vallidator class and validate each field value
against some sort of ruls ,I would appreciate if you introduce me.my lines
and rules are like this:

"C1","C2","C3","C4"

C1:could be 1,2,3,4 (1,2,3,4 is actually defined in a lookup table in my
database)
C2 :is a data and it's fomrat should be MM/DD/YYYY
C3: is an integer witch should be in the range of an integer
C4:shouldn't be null

I havw different options like using regex or XML schema ,but the problem is
the lookup tables (like country codes,marital status,.......) which their
vallues are used to validate some fields.

Thanks agian for your complete and nice answeres.
Ali-R

Bob Grommes said:
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance

That is a very broad question and a completely separate issue from parsing
the CSV. There is no one "correct" answer. However, you will be amazed
at how much detailed parsing and checking a finished C# application can
crank out in the blink of an eye. Parsing and rewriting CSVs with
thousands of records takes just seconds including program load time. I
don't even bother to do a release build, it's not worth the effort to
maintain two builds (of course all of my stuff is for internal release to
one machine).
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)

Depends on your needs. Generally a console app is my starting point for
anything like this. My apps tend to stay as console applications, some of
which run via a scheduler. That works fine for me since my apps all run
on dedicated worker machines. With very little extra work you can turn a
debugged console app into a Windows Service, or call the mainline
processing from a WindowsForms application (although you may want to feed
all your console output through a central routine so that you can modify
it to properly update the UI on the UI thread, if you think you will be
talking to a WinForms app to show detailed progress messages; otherwise,
and Console.WriteLine() calls are just ignored in a normal Windows app
since no console is present at runtime).
3) Another problem is that they zip the CSV files and put it in our
database ,I probably need another DTS package to unzip the data from DTS
and Write it somewhere as a CSV file

If the data is zipped and put into a database field, you can just grab the
zip file image as a byte[] from your C# app and then feed it to a zip
library. There are several free or inexpensive zip libraries for .NET.
I have some questions about yours:

1) Is your application multi-threaded or not?

No, it scans a drop folder and sequentially processes whatever files it
finds there. A scheduler runs it every 5 minutes. This may or may not be
suitable to your needs, it happens to work for my particular scenario.

Even in .NET, where threads are much simpler, they greatly complicate
testing and debugging and I use multi threading only when the application
really demands it, which is rarely. After all, running a console app on a
dedicated box, or a background Windows Service, amounts to a dedicated
thread anyway.
2) Do yuo process CSV Files in memory or you put them on File system and
you just pass the path to the CSV to your DTS package.

Most of the time I process one record at a time -- one in, one out. My
objective is to convert a variety of source file formats (including CSV)
into a standard CSV format with fields in a particular order. From there
it goes into our database (temp table) for scrubbing and parsing and error
logging, and a certain amount of human intervention as well (our source
data is very poorly validated and some human judgment calls are required).
Only then does it go into the live database.

I have one particular special case where I load the entire file into
memory because of a complex record merge that is necessary part of the
parsing / conversion process; it's analagous to loading an entire XML
document and selecting data from different parts of the document and
rearranging it. This uses string interning to keep memory consumption
manageable, but that slows things way down. It works however and is still
acceptably fast. But this is a really unusual special case and I doubt
very much that I would encounter a need like this once in a decade of
work.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database
,how would deal with them (I have to log those errors too)

I think I pretty much covered that question in my last answer above.

Let me just say, I get the sense that you are trying to bite off too much
of this problem at once. Don't look at it monolithically. Break it down
into smaller problems:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

Write the low level routines (such as parsing the CSV) for the most part
without regard to what kind of app they will go into or what thread they
will execute in. Then assemble the pieces as appropriate. You may well
end up with two or more cooperating applications. For example I have one
console app for converting source data into a *standard* CSV format; a
scheduler to run that; another WinForms app takes the standard CSV and
does the scrubbing / validating and then finally exports the scrubbed data
as a CSV which is then bulk loaded into the DB by another console app.
The database then has an ASP.NET front end for general management and
querying, and the queries feed into a report server that queues up
reporting jobs and distributes them (Windows Service).

The right tools for each task -- of course "right" is relative, and the
above decisions were the result of unique requirements of a particular
situation.

--Bob
 
shriop,
For best performance, don't design your own, go find a >product alreadymade
and tuned. It will save you hours of >work, hours of support, andwill
probably handle more >issues than you can even think of.

Do you know any kind of validator from which I can get some ideas?

Thanks for you help
Ali-R
 
I can't really recommend a validator for you. I'm normally more
interested in how to parse these files using the fastest parser than
how to validate them. When I have had to validate them, I've either
used regex's or used simple hardcoded string parsing. Regex's are the
most flexible, but they're more meant to validate on large amounts of
data at once, not on simple small string bits that have already been
parsed. I doubt you're going to find any samples out there where
someone is doing all of this well. I would suggest just breaking up the
task into seperate pieces, and finding the best approach for each piece.
 
Back
Top