Dealing with large text files

Guest · Nov 14, 2007

Hello all:

I have a situation where I need to read a text file containing several
million rows (insurance eligibility files). In additional to sequential
operations, I also need to support a 'seek' on the file. The file itself is
not in a fixed-field format and each line could be different lengths. I
obviously don't want to simply start at the top of the file and read lines
till I hit the requested index.

What other options do I have?

Kevin Spencer · Nov 14, 2007

You can open a file stream that is seekable, but you haven't specified how
you want to "seek" in the file. How do you know what you're looking for?

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

Guest · Nov 14, 2007

Opps ... sorry.

This is for a library that will have a few overloads of the seek method. A
typical seek might involve the calling context asking for the line at index
position 567778. I've done similar things like this in the past but I've had
the luxury of fixed-size file formats where I can determine the length of
each line and seek using the position of the file pointer ((NUM_LINES *
LENGTH_PER_LINE) * INDEX). Of course, this concept works with the fixed
length lines but will be a problem when the file fields are 'delimited' and
thus variable length.

I am aware of seekable file streams but determining how to position the file
pointer is the biggest issue imo. There is a little more to it than I've
described but this is the main issue.

Prior to performing the seek operation I do have some info about the file.
Namely, I do a pre-parse that determines the count of lines (although not the
length of each line of course

).

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

Bill Butler · Nov 14, 2007

DCW said:
Opps ... sorry.

This is for a library that will have a few overloads of the seek
method. A
typical seek might involve the calling context asking for the line at
index
position 567778. I've done similar things like this in the past but
I've had
the luxury of fixed-size file formats where I can determine the length
of
each line and seek using the position of the file pointer ((NUM_LINES
*
LENGTH_PER_LINE) * INDEX). Of course, this concept works with the
fixed
length lines but will be a problem when the file fields are
'delimited' and
thus variable length.

I am aware of seekable file streams but determining how to position
the file
pointer is the biggest issue imo. There is a little more to it than
I've
described but this is the main issue.

Prior to performing the seek operation I do have some info about the
file.
Namely, I do a pre-parse that determines the count of lines (although
not the
length of each line of course ).

I'm thinking a possible solution might be to create a collection
during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position
requested,
pull the starting position from the dictionary and seek till I hit the
next
dictionary item's starting position.

Basically you need to create your own index into the file.
foreach line in the file record the (index,offset)

Chris Shepherd · Nov 14, 2007

DCW wrote:
[...]

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

The solution you suggest here strikes me as being the simplest and
probably the most efficient way of solving your problem.

Chris.

Guest · Nov 14, 2007

Thanks guys, I was really looking for confirmation but if someone had a novel
way I'd never thought of, that would have been too. Either way, I do
appreciate the responses.

D

Chris Shepherd said:
DCW wrote:
[...]

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?

Click to expand...

The solution you suggest here strikes me as being the simplest and
probably the most efficient way of solving your problem.

Chris.

Chris Mullins [MVP - C#] · Nov 14, 2007

I would import the thing into SQL Server (SQL Compact, SQL Express,
whatever..) and do you operations on that. When you're done, just drop the
database.

You could import the data through code, or via an SSIS package.

It's going to (obviously) depend on your uses cases. Importing the file will
take a bit, so it's going to depend how many search operations you're going
to be doing, versus the time to import the file into SQL.

Dealing with very large binary files	5	Aug 6, 2007
Dealing with large files (random access)	1	Jul 29, 2005
Architecture question on parsing a large text file	21	Aug 26, 2011
Help with Streams	6	Jun 18, 2009
Reading large csv-file and removing duplicates	11	Dec 5, 2009
Read Text File with Binary Header - C#	5	Jun 18, 2008
Reading a large text file line by line backwards	1	Apr 2, 2004
Importing Text Files Into Access	1	May 2, 2018

Dealing with large text files

Guest

Kevin Spencer

Guest

Bill Butler

Chris Shepherd

Guest

Chris Mullins [MVP - C#]

Ask a Question

Similar Threads