There are lots of records in each CSV with 33 fields in each record.My
problem are as follows
1)How to design the validator for the best performance
That is a very broad question and a completely separate issue from parsing
the CSV. There is no one "correct" answer. However, you will be amazed
at how much detailed parsing and checking a finished C# application can
crank out in the blink of an eye. Parsing and rewriting CSVs with
thousands of records takes just seconds including program load time. I
don't even bother to do a release build, it's not worth the effort to
maintain two builds (of course all of my stuff is for internal release to
one machine).
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)
Depends on your needs. Generally a console app is my starting point for
anything like this. My apps tend to stay as console applications, some of
which run via a scheduler. That works fine for me since my apps all run
on dedicated worker machines. With very little extra work you can turn a
debugged console app into a Windows Service, or call the mainline
processing from a WindowsForms application (although you may want to feed
all your console output through a central routine so that you can modify
it to properly update the UI on the UI thread, if you think you will be
talking to a WinForms app to show detailed progress messages; otherwise,
and Console.WriteLine() calls are just ignored in a normal Windows app
since no console is present at runtime).
3) Another problem is that they zip the CSV files and put it in our
database ,I probably need another DTS package to unzip the data from DTS
and Write it somewhere as a CSV file
If the data is zipped and put into a database field, you can just grab the
zip file image as a byte[] from your C# app and then feed it to a zip
library. There are several free or inexpensive zip libraries for .NET.
I have some questions about yours:
1) Is your application multi-threaded or not?
No, it scans a drop folder and sequentially processes whatever files it
finds there. A scheduler runs it every 5 minutes. This may or may not be
suitable to your needs, it happens to work for my particular scenario.
Even in .NET, where threads are much simpler, they greatly complicate
testing and debugging and I use multi threading only when the application
really demands it, which is rarely. After all, running a console app on a
dedicated box, or a background Windows Service, amounts to a dedicated
thread anyway.
2) Do yuo process CSV Files in memory or you put them on File system and
you just pass the path to the CSV to your DTS package.
Most of the time I process one record at a time -- one in, one out. My
objective is to convert a variety of source file formats (including CSV)
into a standard CSV format with fields in a particular order. From there
it goes into our database (temp table) for scrubbing and parsing and error
logging, and a certain amount of human intervention as well (our source
data is very poorly validated and some human judgment calls are required).
Only then does it go into the live database.
I have one particular special case where I load the entire file into
memory because of a complex record merge that is necessary part of the
parsing / conversion process; it's analagous to loading an entire XML
document and selecting data from different parts of the document and
rearranging it. This uses string interning to keep memory consumption
manageable, but that slows things way down. It works however and is still
acceptably fast. But this is a really unusual special case and I doubt
very much that I would encounter a need like this once in a decade of
work.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database
,how would deal with them (I have to log those errors too)
I think I pretty much covered that question in my last answer above.
Let me just say, I get the sense that you are trying to bite off too much
of this problem at once. Don't look at it monolithically. Break it down
into smaller problems:
1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB
Write the low level routines (such as parsing the CSV) for the most part
without regard to what kind of app they will go into or what thread they
will execute in. Then assemble the pieces as appropriate. You may well
end up with two or more cooperating applications. For example I have one
console app for converting source data into a *standard* CSV format; a
scheduler to run that; another WinForms app takes the standard CSV and
does the scrubbing / validating and then finally exports the scrubbed data
as a CSV which is then bulk loaded into the DB by another console app.
The database then has an ASP.NET front end for general management and
querying, and the queries feed into a report server that queues up
reporting jobs and distributes them (Windows Service).
The right tools for each task -- of course "right" is relative, and the
above decisions were the result of unique requirements of a particular
situation.
--Bob