optimizing file i/o

M

Michael Powe

Hello,

I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.

Thanks.

mp

public class parseLines
{
static string fileIdentifiers =
@"\.gif\s|\.js\s|\.png\s|\.css\s|\.jpg\s";
static Regex reAll = new Regex(fileIdentifiers);
string fileName;

public parseLines(string fileName)
{
this.fileName = fileName;
}

public void getLines()
{
// print nonmatching lines to stdout
}
public Hashtable countMatches()
{
// count individual matches
}
public void filterLines()
{
string newFileName = fileName + ".modified.log";

StreamReader sr = new StreamReader(fileName);
StreamWriter wr = new StreamWriter(newFileName);
string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nextLine);
if (!myMatch.Success)
{
wr.WriteLine(nextLine);
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
}
public Hashtable filterAndCountLines()
{
string newFileName = fileName + ".modified.log";
Hashtable ht = new Hashtable();
char[] sep = {'|'};
string[] newTypeArray = fileIdentifiers.Split(sep);
Regex[] newMatchArray = new Regex[5];

for (int i = 0; i < newTypeArray.Length; i++)
{
Regex item = new Regex(newTypeArray);
newMatchArray = item;
}
foreach (string item in newTypeArray)
{
ht.Add(item,0);
}
ht.Add("total Match",0);
ht.Add("total No Match",0);

StreamReader sr = new StreamReader(fileName);
StreamWriter wr = new StreamWriter(newFileName);

string nextLine = sr.ReadLine();
while (nextLine != null)
{
Match myMatch = reAll.Match(nextLine);
if (!myMatch.Success)
{
wr.WriteLine(nextLine);
ht["total No Match"] =
(int)ht["total No Match"] + 1;
}
else
{
foreach (Regex itemRegex in
newMatchArray)
{
Match arrMatch =
itemRegex.Match(nextLine);
if (arrMatch.Success)
{
ht[itemRegex.ToString()] =
(int)ht[itemRegex.ToString()]
+ 1;
break;
}
}
ht["total Match"] = (int)ht["total
Match"] + 1;
}
nextLine = sr.ReadLine();
}
sr.Close();
wr.Close();
return ht;
}
}
class MainClass
{
public static void Main(string[] args)
{
Hashtable count;
IDictionaryEnumerator countEnumerator;

parseLines pl = new parseLines(args[0]);
count = pl.filterAndCountLines();

countEnumerator = count.GetEnumerator();
while (countEnumerator.MoveNext())
{
Console.WriteLine(countEnumerator.Key.ToString() + " : " +
countEnumerator.Value.ToString());
}
Console.WriteLine("finished");
}
}
}
 
J

Jon Skeet [C# MVP]

Michael Powe said:
I have written a small app to parse web log files and extract certain
lines to another file. There is also functionality to count all the
items that are being filtered out.

I wrote this in c# instead of in perl because the log files are 3-4GB
and I want faster processing than perl would typically provide. And,
I'm learning c#.

There are two issues I would like to address: improve the speed of the
file i/o and control the processing. Right now, this app takes about 20
min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
Processing is implementing a method that both filters and counts. Also,
it pegs my CPU while it's running.

Below are the filtering and filtering/counting methods.

If the CPU is pegged (which I can understand, given the code), then the
I/O speed isn't the problem.

Some suggestions:

1) Don't create the regular expressions freshly each time. I don't know
whether you've got a lot of small files or just a few big ones, but it
would make more sense to create them once, as you don't need to change
them.

2) Use the option to compile the regular expressions when you create
them. This could improve things enormously.

3) Rather than using a hashtable, consider having an array of ints
along with your array of regular expressions. You could then iterate
through the regular expression array by index rather than by value, and
just increment the relevant int - no hashtable lookup, no unboxing and
then reboxing.

4) If most lines in the file will match one of the filters, try getting
rid of the "all" regular expression, working out the result just by
running all the others. It may not help, but it's worth a try.

Finally, use using statements for your stream readers and writers -
that way, if an exception is thrown, you'll still close the file
immediately.
 
M

Michael Powe

"Jon" == Jon Skeet [C# MVP] <Jon> writes:


Jon> If the CPU is pegged (which I can understand, given the
Jon> code), then the I/O speed isn't the problem.

Jon> Some suggestions:

Jon> 1) Don't create the regular expressions freshly each time. I
Jon> don't know whether you've got a lot of small files or just a
Jon> few big ones, but it would make more sense to create them
Jon> once, as you don't need to change them.

Jon> 2) Use the option to compile the regular expressions when you
Jon> create them. This could improve things enormously.

Jon> 3) Rather than using a hashtable, consider having an array of
Jon> ints along with your array of regular expressions. You could
Jon> then iterate through the regular expression array by index
Jon> rather than by value, and just increment the relevant int -
Jon> no hashtable lookup, no unboxing and then reboxing.

Jon> 4) If most lines in the file will match one of the filters,
Jon> try getting rid of the "all" regular expression, working out
Jon> the result just by running all the others. It may not help,
Jon> but it's worth a try.

Jon> Finally, use using statements for your stream readers and
Jon> writers - that way, if an exception is thrown, you'll still
Jon> close the file immediately.

Jon> -- Jon Skeet - <[email protected]> http://www.pobox.com/~skeet
Jon> If replying to the group, please do not mail me too

Thanks very much for the clues, I will follow up. As I mentioned, the
files are large -- 3 to 4 GB, which is why I'm trying C# instead of
using perl.

Your help is much appreciated.

mp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top