PC Review


Reply
Thread Tools Rate Thread

Dealing with large text files

 
 
=?Utf-8?B?RENX?=
Guest
Posts: n/a
 
      14th Nov 2007
Hello all:

I have a situation where I need to read a text file containing several
million rows (insurance eligibility files). In additional to sequential
operations, I also need to support a 'seek' on the file. The file itself is
not in a fixed-field format and each line could be different lengths. I
obviously don't want to simply start at the top of the file and read lines
till I hit the requested index.

What other options do I have?
 
Reply With Quote
 
 
 
 
Kevin Spencer
Guest
Posts: n/a
 
      14th Nov 2007
You can open a file stream that is seekable, but you haven't specified how
you want to "seek" in the file. How do you know what you're looking for?

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

"DCW" <(E-Mail Removed)> wrote in message
news1699D64-00C0-482B-B506-(E-Mail Removed)...
> Hello all:
>
> I have a situation where I need to read a text file containing several
> million rows (insurance eligibility files). In additional to sequential
> operations, I also need to support a 'seek' on the file. The file itself
> is
> not in a fixed-field format and each line could be different lengths. I
> obviously don't want to simply start at the top of the file and read lines
> till I hit the requested index.
>
> What other options do I have?



 
Reply With Quote
 
=?Utf-8?B?RENX?=
Guest
Posts: n/a
 
      14th Nov 2007
Opps ... sorry.

This is for a library that will have a few overloads of the seek method. A
typical seek might involve the calling context asking for the line at index
position 567778. I've done similar things like this in the past but I've had
the luxury of fixed-size file formats where I can determine the length of
each line and seek using the position of the file pointer ((NUM_LINES *
LENGTH_PER_LINE) * INDEX). Of course, this concept works with the fixed
length lines but will be a problem when the file fields are 'delimited' and
thus variable length.

I am aware of seekable file streams but determining how to position the file
pointer is the biggest issue imo. There is a little more to it than I've
described but this is the main issue.

Prior to performing the seek operation I do have some info about the file.
Namely, I do a pre-parse that determines the count of lines (although not the
length of each line of course ).

I'm thinking a possible solution might be to create a collection during
pre-parse that stores in each new line's byte position in a generic
dictionary so when the caller seeks I can just go to the position requested,
pull the starting position from the dictionary and seek till I hit the next
dictionary item's starting position.

Any thoughts?




"Kevin Spencer" wrote:

> You can open a file stream that is seekable, but you haven't specified how
> you want to "seek" in the file. How do you know what you're looking for?
>
> --
> HTH,
>
> Kevin Spencer
> Chicken Salad Surgeon
> Microsoft MVP
>


 
Reply With Quote
 
Bill Butler
Guest
Posts: n/a
 
      14th Nov 2007

"DCW" <(E-Mail Removed)> wrote in message
news:4B105DE7-995E-4D5B-8772-(E-Mail Removed)...
> Opps ... sorry.
>
> This is for a library that will have a few overloads of the seek
> method. A
> typical seek might involve the calling context asking for the line at
> index
> position 567778. I've done similar things like this in the past but
> I've had
> the luxury of fixed-size file formats where I can determine the length
> of
> each line and seek using the position of the file pointer ((NUM_LINES
> *
> LENGTH_PER_LINE) * INDEX). Of course, this concept works with the
> fixed
> length lines but will be a problem when the file fields are
> 'delimited' and
> thus variable length.
>
> I am aware of seekable file streams but determining how to position
> the file
> pointer is the biggest issue imo. There is a little more to it than
> I've
> described but this is the main issue.
>
> Prior to performing the seek operation I do have some info about the
> file.
> Namely, I do a pre-parse that determines the count of lines (although
> not the
> length of each line of course ).
>
> I'm thinking a possible solution might be to create a collection
> during
> pre-parse that stores in each new line's byte position in a generic
> dictionary so when the caller seeks I can just go to the position
> requested,
> pull the starting position from the dictionary and seek till I hit the
> next
> dictionary item's starting position.


Basically you need to create your own index into the file.
foreach line in the file record the (index,offset)


 
Reply With Quote
 
Chris Shepherd
Guest
Posts: n/a
 
      14th Nov 2007
DCW wrote:
[...]
> I'm thinking a possible solution might be to create a collection during
> pre-parse that stores in each new line's byte position in a generic
> dictionary so when the caller seeks I can just go to the position requested,
> pull the starting position from the dictionary and seek till I hit the next
> dictionary item's starting position.
>
> Any thoughts?


The solution you suggest here strikes me as being the simplest and
probably the most efficient way of solving your problem.

Chris.
 
Reply With Quote
 
=?Utf-8?B?RENX?=
Guest
Posts: n/a
 
      14th Nov 2007
Thanks guys, I was really looking for confirmation but if someone had a novel
way I'd never thought of, that would have been too. Either way, I do
appreciate the responses.

D


"Chris Shepherd" wrote:

> DCW wrote:
> [...]
> > I'm thinking a possible solution might be to create a collection during
> > pre-parse that stores in each new line's byte position in a generic
> > dictionary so when the caller seeks I can just go to the position requested,
> > pull the starting position from the dictionary and seek till I hit the next
> > dictionary item's starting position.
> >
> > Any thoughts?

>
> The solution you suggest here strikes me as being the simplest and
> probably the most efficient way of solving your problem.
>
> Chris.
>

 
Reply With Quote
 
Chris Mullins [MVP - C#]
Guest
Posts: n/a
 
      14th Nov 2007
I would import the thing into SQL Server (SQL Compact, SQL Express,
whatever..) and do you operations on that. When you're done, just drop the
database.

You could import the data through code, or via an SSIS package.

It's going to (obviously) depend on your uses cases. Importing the file will
take a bit, so it's going to depend how many search operations you're going
to be doing, versus the time to import the file into SQL.

--
Chris Mullins

"DCW" <(E-Mail Removed)> wrote in message
news1699D64-00C0-482B-B506-(E-Mail Removed)...
> Hello all:
>
> I have a situation where I need to read a text file containing several
> million rows (insurance eligibility files). In additional to sequential
> operations, I also need to support a 'seek' on the file. The file itself
> is
> not in a fixed-field format and each line could be different lengths. I
> obviously don't want to simply start at the top of the file and read lines
> till I hit the requested index.
>
> What other options do I have?



 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Dealing with very large binary files sebastian.harko@gmail.com Microsoft C# .NET 5 29th Oct 2010 04:16 PM
Hardware specification for dealing with large files (again) jg Microsoft Excel Discussion 2 15th May 2007 08:01 PM
Hardware specification for dealing with large files (Excel 2003 and 2007) benjrees@gmail.com Microsoft Excel Discussion 17 3rd Oct 2006 08:07 PM
Dealing with large files =?Utf-8?B?R1M=?= Windows XP Video 1 24th Jan 2006 11:25 PM
Dealing with large files (random access) Patrick Microsoft C# .NET 1 29th Jul 2005 05:22 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 06:35 PM.