Large text file - in memory ( > 60mb)

Jurjen de Groot · Oct 10, 2006

My program needs to search a large textfile (>60MB).

At this time I'm using a streamreader to read the file into a
string-variable (objString = sr.ReadToEnd). Before reading the file the
proces running my programm uses about 10mb, after reading the text-file into
the string, it uses over 200mb. I would expect the program to use between 70
and 100mb.

Is there a more efficient way of storing this data in-memory and still be
able to search through it ... ?

TIA,
Jurjen.

Carl Daniel [VC++ MVP] · Oct 10, 2006

Jurjen de Groot said:
My program needs to search a large textfile (>60MB).

At this time I'm using a streamreader to read the file into a
string-variable (objString = sr.ReadToEnd). Before reading the file the
proces running my programm uses about 10mb, after reading the text-file
into the string, it uses over 200mb. I would expect the program to use
between 70 and 100mb.

Is there a more efficient way of storing this data in-memory and still be
able to search through it ... ?

Keep in mind that .NET strings are UTF-16, so reading an ANSI text file will
typically double the size in bytes.

If you read the file into a byte array, you'll use less memory, but you
won't be able to use the .NET string searching facilities (e.g.
System.String member functions, regular expressions, etc).

Depending on your searching requirements, you might be able to use a simpler
search facility, such as the one in this article:

http://www.codeproject.com/cs/algorithms/BoyerMooreSearch.asp

(but you'd have to modify the code to search a byte array instead of a char
array). If your file is MBCS or UTF-8, you're likely best off just sticking
with the .NET string classes.

-cd

Greg Young · Oct 10, 2006

As CD has said this is expected as strings are UTF-16 .. my question would
be how you are searching this file.

Are you just doing keyword searches? Depending on the type of search you
might be much better off doing something like building an index of the file
and loading the index into memory.

Cheers,

Greg

Jurjen de Groot · Oct 11, 2006

Greg,

The text-file consists of records of 80 characters seperated by a NewLine.
These records all have a record type 1 thru 9, a set of records start with
record 1 and end with record 9 at wich point the next set will start with
record type 1.

I search the contents of the file for the search criteria as entered by the
user f.i. 2742281, when I find this sequence I have to make sure it's found
in exactly the right position within the record to make sure I have compared
it to the right field. Then I have to show this record found (wich should be
record type 1) and show all records until I find recordtype 9 (of EOF).

I could create an index but that would complicate the app, I also thought of
maybe creating a datatable to ease the search but I'm pretty sure memory
consumption would be even worse...

I was just wondering why the current app is consuming so much memory wich is
now clear to me. I guess my client will have to make the decision, cheap app
wich will use much memory, little more expensive app using less memory.

Regards,
Jurjen.

Morten Wennevik · Oct 11, 2006

Hi Jurjen,

Sounds to me like you could just use a ReadLine() and do a search per
record. You should use the encoding used in the file. If you don't
specify an encoding, UTF-8 is used. You would need some logic added to
keep track of an entire recordset, which can be a string[] of length 9

StreamReader sr = new StreamReader("", Encoding.Default);
string s = null;
string[] recordset = new string[9];
int index = 0;

while ((s = sr.ReadLine()) != null)
{
int i = GetRecordNumber(s);

if (i >= (index + 1))
;// missing record

recordset[index] = s;

index++;

if (i == 9) // complete record
{
if (SearchRecordSet(recordset))
return true;

Array.Clear(recordset, 0, 9);
index = 0;
}
}

PS! Your system clock is a bit too fast

Jurjen de Groot · Oct 11, 2006

Morten,

Thanks for your reply, I understand what you're doing in the code, but isn't
reading line by line slow ?
The file is over 64mb in size, reading it line by line to do a search seems
like a lot of overhead, especially when the user does many searches while
running the app, it would mean reading/searching the >64mb file many times,
that's why I opted to keep the file in memory which might not be the best
idea.
I'm currently trying to get some more time from my client to try and
optimize by creating an index of the file (which doesn't change that often)
and searching through that and retrieving part of the text-file
corresponding to the index...

Jurjen.

Hi Jurjen,

Sounds to me like you could just use a ReadLine() and do a search per
record. You should use the encoding used in the file. If you don't
specify an encoding, UTF-8 is used. You would need some logic added to
keep track of an entire recordset, which can be a string[] of length 9

StreamReader sr = new StreamReader("", Encoding.Default);
string s = null;
string[] recordset = new string[9];
int index = 0;

while ((s = sr.ReadLine()) != null)
{
int i = GetRecordNumber(s);

if (i >= (index + 1))
;// missing record

recordset[index] = s;

index++;

if (i == 9) // complete record
{
if (SearchRecordSet(recordset))
return true;

Array.Clear(recordset, 0, 9);
index = 0;
}
}

PS! Your system clock is a bit too fast

Morten Wennevik · Oct 11, 2006

I haven't done speed tests, but you may well find the speed taken to
locate a recordset by reading the file line by line is not much
considering the processing power of todays computers.

Morten Wennevik · Oct 11, 2006

I created a ~80mb text file consisting of 80 characters per line and a
known word on the second last line

Opening and searching line by line took ~2.4 seconds each time.
Opening and reading the entire file before a search took ~6.4 seconds,
with ~0.4 seconds for each search.

Added complexity to the processing code will have less impact on the first
method in percentages, considering you then already have a complete
recordset where in the second method you would need additional searchs.

Morten Wennevik · Oct 11, 2006

Given time, the fastest option is probably using a FileStream and a search
algorithm like Boyer-Moore, but the complexity of the code would also
increase accordingly.

Carl Daniel [VC++ MVP] · Oct 11, 2006

Morten Wennevik said:
Given time, the fastest option is probably using a FileStream and a search
algorithm like Boyer-Moore, but the complexity of the code would also
increase accordingly.

See my earlier reply to the original post. That's exactly what I do in a
program that reads similar fixed-format text files: Read the entire file
into a byte array via a single call to FileStream.Read and then use a
Boyer-Moore search on that. It's about 5x faster than reading the entire
file into a string and using string.IndexOf. There's a link to the
Boyer-Moore implementation in my earlier post - the version in the article
works on strings, but it's straightforward to convert it to work on byte
arrays.

I found the byte array search using BM to be about 2x faster than the "Find"
function in Visual Studio (which is quite fast), but about 2x slower than
the "Find" function in Notepad - which is also Boyer-Moore (well,
QuickSearch, actually), but is written in C instead of C#. The array bounds
checking on access to the supplemental arrays used by BM really hurt the
performance - but it's still quite speedy.

-cd

Nick Malik [Microsoft] · Oct 11, 2006

Hi Jurjen,

my napkin calculations show that you have a 750,000 record table that
doesn't change often and that you need to search many times by each user.

Dude... is SQL Server really so bad of an option? c'mon! Why write this
capability into your app when it is available to you for free?

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--

Jurjen de Groot said:
Morten,

Thanks for your reply, I understand what you're doing in the code, but
isn't reading line by line slow ?
The file is over 64mb in size, reading it line by line to do a search
seems like a lot of overhead, especially when the user does many searches
while running the app, it would mean reading/searching the >64mb file many
times, that's why I opted to keep the file in memory which might not be
the best idea.
I'm currently trying to get some more time from my client to try and
optimize by creating an index of the file (which doesn't change that
often) and searching through that and retrieving part of the text-file
corresponding to the index...

Jurjen.

Hi Jurjen,

Sounds to me like you could just use a ReadLine() and do a search per
record. You should use the encoding used in the file. If you don't
specify an encoding, UTF-8 is used. You would need some logic added to
keep track of an entire recordset, which can be a string[] of length 9

StreamReader sr = new StreamReader("", Encoding.Default);
string s = null;
string[] recordset = new string[9];
int index = 0;

while ((s = sr.ReadLine()) != null)
{
int i = GetRecordNumber(s);

if (i >= (index + 1))
;// missing record

recordset[index] = s;

index++;

if (i == 9) // complete record
{
if (SearchRecordSet(recordset))
return true;

Array.Clear(recordset, 0, 9);
index = 0;
}
}

PS! Your system clock is a bit too fast

Greg,

The text-file consists of records of 80 characters seperated by a
NewLine.
These records all have a record type 1 thru 9, a set of records start
with
record 1 and end with record 9 at wich point the next set will start with
record type 1.

I search the contents of the file for the search criteria as entered by
the
user f.i. 2742281, when I find this sequence I have to make sure it's
found
in exactly the right position within the record to make sure I have
compared
it to the right field. Then I have to show this record found (wich
should be
record type 1) and show all records until I find recordtype 9 (of EOF).

I could create an index but that would complicate the app, I also
thought of
maybe creating a datatable to ease the search but I'm pretty sure memory
consumption would be even worse...

I was just wondering why the current app is consuming so much memory
wich is
now clear to me. I guess my client will have to make the decision, cheap
app
wich will use much memory, little more expensive app using less memory.

Regards,
Jurjen.

Click to expand...

Morten Wennevik · Oct 12, 2006

Ah, sorry, didn't catch the last part of your first post.

Jurjen de Groot · Oct 12, 2006

Nick,

SQL isn't an option as this is supposed to be a very simple (+/- 2 hour)
solution to a problem. The programm will most probably only be used for a
week or so. Doing SQL would be overkill in this specific situation.

I would like to thank everyone for their replies.

Jurjen.

Nick Malik said:
Hi Jurjen,

my napkin calculations show that you have a 750,000 record table that
doesn't change often and that you need to search many times by each user.

Dude... is SQL Server really so bad of an option? c'mon! Why write this
capability into your app when it is available to you for free?

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--

Jurjen de Groot said:

Morten,

Thanks for your reply, I understand what you're doing in the code, but
isn't reading line by line slow ?
The file is over 64mb in size, reading it line by line to do a search
seems like a lot of overhead, especially when the user does many searches
while running the app, it would mean reading/searching the >64mb file
many times, that's why I opted to keep the file in memory which might not
be the best idea.
I'm currently trying to get some more time from my client to try and
optimize by creating an index of the file (which doesn't change that
often) and searching through that and retrieving part of the text-file
corresponding to the index...

Jurjen.

Hi Jurjen,

Sounds to me like you could just use a ReadLine() and do a search per
record. You should use the encoding used in the file. If you don't
specify an encoding, UTF-8 is used. You would need some logic added to
keep track of an entire recordset, which can be a string[] of length 9

StreamReader sr = new StreamReader("", Encoding.Default);
string s = null;
string[] recordset = new string[9];
int index = 0;

while ((s = sr.ReadLine()) != null)
{
int i = GetRecordNumber(s);

if (i >= (index + 1))
;// missing record

recordset[index] = s;

index++;

if (i == 9) // complete record
{
if (SearchRecordSet(recordset))
return true;

Array.Clear(recordset, 0, 9);
index = 0;
}
}

PS! Your system clock is a bit too fast

Greg,

The text-file consists of records of 80 characters seperated by a
NewLine.
These records all have a record type 1 thru 9, a set of records start
with
record 1 and end with record 9 at wich point the next set will start
with
record type 1.

I search the contents of the file for the search criteria as entered by
the
user f.i. 2742281, when I find this sequence I have to make sure it's
found
in exactly the right position within the record to make sure I have
compared
it to the right field. Then I have to show this record found (wich
should be
record type 1) and show all records until I find recordtype 9 (of EOF).

I could create an index but that would complicate the app, I also
thought of
maybe creating a datatable to ease the search but I'm pretty sure memory
consumption would be even worse...

I was just wondering why the current app is consuming so much memory
wich is
now clear to me. I guess my client will have to make the decision, cheap
app
wich will use much memory, little more expensive app using less memory.

Regards,
Jurjen.

As CD has said this is expected as strings are UTF-16 .. my question
would
be how you are searching this file.

Are you just doing keyword searches? Depending on the type of search
you
might be much better off doing something like building an index of the
file and loading the index into memory.

Cheers,

Greg

My program needs to search a large textfile (>60MB).

At this time I'm using a streamreader to read the file into a
string-variable (objString = sr.ReadToEnd). Before reading the file
the
proces running my programm uses about 10mb, after reading the
text-file
into the string, it uses over 200mb. I would expect the program to use
between 70 and 100mb.

Is there a more efficient way of storing this data in-memory and still
be
able to search through it ... ?

TIA,
Jurjen.

Click to expand...

Click to expand...

StreamReader Excessive Memory Use	2	Nov 5, 2004
Reading large csv-file and removing duplicates	11	Dec 5, 2009
Find postion a string in a large text file	3	Oct 22, 2003
Help with Streams	6	Jun 18, 2009
Converting text and detecting encoding	3	Jul 4, 2006
Problem with reading a large text file	9	Oct 6, 2005
reading text file using streamreader	1	Aug 8, 2006
VBA Read-Manipulate-Save Text File	5	Jan 25, 2013

Large text file - in memory ( > 60mb)

Jurjen de Groot

Carl Daniel [VC++ MVP]

Greg Young

Jurjen de Groot

Morten Wennevik

Jurjen de Groot

Morten Wennevik

Morten Wennevik

Morten Wennevik

Carl Daniel [VC++ MVP]

Nick Malik [Microsoft]

Morten Wennevik

Jurjen de Groot

Ask a Question

Similar Threads