Large text file import: MVP question

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi
 
MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)


so i hope to have given you some ideas

regards

Michel Posseth [MCP]
 
There are several ways to complete this, but I am wondering if anyone
has insight into the FASTEST way to read in the data and chop it up
ahead of posting it into the DB.

You can do this via SQL BCP, Bulk Inserts or DTS.

We have a project in which we build a dynamic DTS package via VB.NET to do
imports. The advantage of this is that the DTS graph can be dynamically
built to handle a variety of file layouts. DTS can also handle special
characters during import - something Bulk Inserts doesn't do. With BCP
you'll need to create a format file which is a bit of a hassle if your file
import specifications change a lot.
 
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor
 
With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about the
return to the DB, thats irrevelant for now. So far the benchmark has been ok
- but its not brilliant. the disk I/O is well short of the capabilities. cpu
and memory are barely in use. That means either our code is crap or the .net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
 
Read my response to Cor. This is very MVP.


Michel Posseth said:
MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)


so i hope to have given you some ideas

regards

Michel Posseth [MCP]


Avi said:
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi
 
Avi,

And still the answer should be simple. Simple solutions give fast processing
applications.

One thing you have to keep in mind. Although there can be one ILS
instruction than still can this mean thousand times of looping. The 8086
processor famillie and its descendants are not that advanced that they can
do things in one cycle so behind your code is a lot of looping or whatever.
All methods from the Visual Basic Namespace are therefore using more
througput time than from the basic Net namespace. Normally no problem,
humans cannot see milliseconds.

The second thing you have to keep in mind because that you are saying
splitting that the normal string is not mutable. This means that every
change, with mid, split, tolower, or whatever will result in a new string.
However overdoing this can of course cost you a lot of time.

But giving you any reasonable answer withouth seeing even a piece of your
current code is in my opinion impossible. If it was, then there would not
have been so many possibilities.

By the way, you have selected the chance on an answer very much by writting
that this is an MVP question. Probably somebody who is dedicated to your
problem however not a MVP has
said......................................................... Two words
mostly said with the middle finger in the air.

I hope this helps something,

Cor
 
Read my response to Cor. This is very MVP.

No it isn`t , I have never seen a MVP signature at a post of Francesco
Balena
for instance :-)

Having read the answer of Cor, i can only fully concur with him

By the way ,,, i believe i have earned my stripes as a professional
programmer
( made programs for thousands of users throughout europe ) in my previous
Job i worked at a Automotive data provider ( the market leader in there
business ) where i was not only a lead developer , but also the sql server (
2000 , 2005 ) database administrator .

One of my responsability`s was optimizing data conversions ( imports ) from
external delivered data ( Mysql , flat text files etc etc )

I worked only with Poweredges , with at least 8 gigs of mem , in raid 10
( and no i did not have one of them we had i believe 25 of those beasts in
our server room , besides the other servers the companny had )

So finished beating my chest , i can tell you that nobody MVP, MCP , MCAD ,
MCSD or Steve and / or even Bill himself
can`t answer your question correctly unless you provide some detailed info

and you did not do that ... so .....

regards

Michel Posseth
just a Software developer



Avi said:
Read my response to Cor. This is very MVP.


Michel Posseth said:
MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)


so i hope to have given you some ideas

regards

Michel Posseth [MCP]


Avi said:
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi
 
Thanks for the incoherent response..... I guess if I look beyond the
defensive language, I'll find something useful.... Oh. I can't.

In many other posts to other newsgroups, if I have not posted enough info, I
simply get asked for it. I have never before been so acosted for asking a
question the wrong way. I think thats the idea of newsgroups in the first
place.

Time to get off your high perch. I'm not perfect. And I'll bet you aren't
either - MVP or not.
 
¤ Hi
¤
¤ I need to read in a large set of text files (9GB+ each) into a database
¤ table based on fixed width lengths.
¤
¤ There are several ways to complete this, but I am wondering if anyone has
¤ insight into the FASTEST way to read in the data and chop it up ahead of
¤ posting it into the DB.
¤
¤ So far, things work, but they are far slower than expected.
¤
¤ Do I need to consider a VC++ app to get a large improvement here?

I don't know what type of database you are working with but the fastest methods are typically the
native database bulk operations as Spam Catcher mentioned. If you process a 9 GB file line by line
it's going to take a while regardless of what programmatic method you use.


Paul
~~~~
Microsoft MVP (Visual Basic)
 
Back
Top