Slow String operations...

M

Mugunth

I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

This is my output...
Total files processed so far: 8201
Time taken so far (sec):10.001
Total files processed so far: 13106
Time taken so far (sec):20.002
Total files processed so far: 17661
Time taken so far (sec):30.001
Total files processed so far: 21926
Time taken so far (sec):40.002
Total files processed so far: 26489
Time taken so far (sec):50.018
Total files processed so far: 30703
Time taken so far (sec):60.002
Total files processed so far: 35479
Time taken so far (sec):70.017
Done - 37526 files found!
Time taken so far (sec):74.883


Any help appreciated...
Mugunth
 
J

Jon Skeet [C# MVP]

Mugunth said:
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...
It takes about 75 seconds totally for 38000 files, whereas if the
system had proceeded at the speed with which it started, it should
have taken under 50 seconds...
Why is string operations become progressively slow?

Without posting any code, we'd have to be psychic to know. What does
your code actually do?

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.
 
C

Christopher Ireland

Mugunth,
But, it's surprising to find that, string operations gradually
becoming slower...

If you're concatenating strings, e.g.

string s = "S";
s += "T";

etc.

then try using the StringBuilder class which is substantially faster for
this type of operation.

--
Thank you,

Christopher Ireland
http://shunyata-kharg.doesntexist.com

"Do not seek to follow in the footsteps of the wise. Seek what they
sought."
Matsuo Basho
 
M

Mugunth

foreach (string directory in directoryList)
{
DirectoryInfo cdi = new DirectoryInfo(directory);
DateTime fileStart, end;
foreach (FileInfo fi in cdi.GetFiles(searchPattern))
{
try
{
fileStart = DateTime.Now;
sbFileContents.Remove(0,
sbFileContents.Length);
StreamReader rd = File.OpenText(fi.FullName);
string fileContent = rd.ReadToEnd();
fileContent = fileContent.Replace("&",
"&");
fileContent = fileContent.Replace("&",
"&");
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(fileContent);
String docNo = xdoc.SelectSingleNode("//
DOCNO").InnerText;
String docType = xdoc.SelectSingleNode("//
DOCTYPE").InnerText;
String txtType = xdoc.SelectSingleNode("//
TXTTYPE").InnerText;
String text = xdoc.SelectSingleNode("//
TEXT").InnerText;
sbFileContents.Append(text);

s.StripPunctuation(ref sbFileContents);

string[] tokenizedArray = s.Tokenize(ref
sbFileContents, false); // i need this function call to execute...
will be processing this string array lately...

//s.ConvertToWordID(ref tokenizedArray);

//s.Vectorize(docNo, docType, txtType,
tokenizedArray); some thing like this? Mugunth

count++;
end = DateTime.Now;

double fulldiff = (end -
start).TotalMilliseconds;
double diff = (end -
fileStart).TotalMilliseconds;
//Console.WriteLine(count.ToString() + " - " +
docNo);
//Console.WriteLine("Time taken for this file
(ms):" + diff.ToString());
if (fulldiff / 1000 > prev)
{
Console.WriteLine((((int)(fulldiff/
1000)).ToString()) + "\t" +
count.ToString());
prev += INCRTIME;
//System.GC.Collect();
// test code
}
rd.Close();
}
catch (Exception ex)
{
Console.WriteLine(fi.FullName + " - " +
ex.Message);
return;
}
}


This is a sort of pseudo code...
I've posted the main loop here...
my concern is, string operations get slower progressively..
I'm happy with the initial speed (processing 8500 files per 10
seconds)...
but the next 10 seconds it processes only about 5000...
the files are of near equal size...

Mugunth
 
J

Jon Skeet [C# MVP]

This is a sort of pseudo code...

Well, it's certainly not a short but complete program, unfortunately. I
won't be able to actually *run* it. If you can provide a full program
(which doesn't need to be able to do anything other than show the
problem) it'll be much easier to fix.
I've posted the main loop here...
my concern is, string operations get slower progressively..

*Which* string operations though? Unless you've profiled it, you don't
really know what's slowing down. Is the call to Tokenize getting
slower?
I'm happy with the initial speed (processing 8500 files per 10
seconds)...
but the next 10 seconds it processes only about 5000...
the files are of near equal size...

A few points:

1) Why are you reusing the StringBuilder each time?
Just create a new one for each iteration.
2) Why are you passing the StringBuilder (and arrays) by reference?
They're already reference types, so unless you're actually
reassigning the references, it's just pointlessly complicating
things to pass them *by* reference.
3) You should use "using" statements for things like StreamReader -
currently if an exception is thrown, the file will stay open
until it gets finalized.
4) You don't actually need to use a StreamReader at all - just use
File.ReadAllText to read it in a single statement.
 
M

Mugunth

Thankyou for your answers...
The call to Tokenize and StripPunctuations are the string operations.
For the first 10 seconds, they usually tokenize about 8200 files.
Second 10 seconds they tokenize only 5000 files...
and third 10 second, it's even lesser....
Nearly all the files are of the same size... but the algorithm gets
progressively slower with time...


This is my strip punctuation code

char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents.Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents.CompareTo(punctuations[0]) ==
0||
sbFileContents.CompareTo(punctuations[1]) == 0
||
sbFileContents.CompareTo(punctuations[2]) == 0
||
sbFileContents.CompareTo(punctuations[3]) == 0
||
sbFileContents.CompareTo(punctuations[4]) == 0
||
sbFileContents.CompareTo(punctuations[5]) ==
0)
{
sbFileContents = ' ';
}
}

this is my tokenize code...
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.ToString().
Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

returnArray = new string[strArray.Length];

PorterStemmer ps = new PorterStemmer();
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str);
}
else
{
word = str;
}

if(!IsStopWord(word))
returnArray[count++] = word;
}


return returnArray;


Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?
Is there any way to set the size of the heap at program start?

Regards,
Mugunth
 
K

Kalpesh Shah

Looking at the code, the following lines look weird.
This is just based on the glance at the code & no actual analysis.

string fileContent = rd.ReadToEnd();
fileContent = fileContent.Replace("&amp;","&");
fileContent = fileContent.Replace("&", "&amp;");

1) Use of stringbuilder could help
2) What is the use of line 2 & 3?
I could not understand replacing "&amp;" with "&". And, undoing it in
line 3.
3) Also, I suggest using "using" block to open file.

Kalpesh
 
B

Bill Butler

Mugunth said:
I'm writing a search engine crawler for indexing local files in C#
My dataset is about 38000 XML files and as of now, I've successfully
parsed the file, and tokenized it.
But, it's surprising to find that, string operations gradually
becoming slower...
The system crunches 8200 files in the first 10 seconds, but is able to
do only 5000 in the next 10, and then 3500 in the next 10 and it
reduces gradually...

I sugest that you recheck your math
From your output I get the following

time total dif
10 8201 8201
20 13160 4905
30 17661 4555
40 21926 4265
50 26489 4563
60 30703 4214
70 35479 4776

Besides the first data point, it looks quite linear.
If you calculate the number of files processed in each 10 sec interval it ranges from ~4200-4900 with no noticable dropoff

I am not sure why the first interval was so much faster, but this is not slowing to a crawl
 
J

Jon Skeet [C# MVP]

Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?

Possible, but I wouldn't expect that to be the problem.

Again though, if you could produce a *complete* program it would make
life a lot easier.
It doesn't need to look at different files - just going through the
same file thousands of times should demonstrate the issue given what
you've been saying.

Jon
 
F

Family Tree Mike

Bill Butler's answer makes a good point.

Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.
 
M

Mugunth

Bill Butler's answer makes a good point.

Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.


I understand. but the size in bytes of these files are uniformly
distributed...
It's not like, files from 20000 - 30000 are larger or something like
that...

Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?

Possible, but I wouldn't expect that to be the problem.


Why wouldn't this be a problem?



This is the complete source code..
But dataset is huge.. which I cannot upload...

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using SearchEngine.PorterStemmerAlgorithm;
using System.Collections;
using System.Xml;

namespace SearchEngine
{
class Program
{
static private void GetDirectoryList(DirectoryInfo di, string
searchPattern, ref ArrayList directoryList)
{
foreach (DirectoryInfo d in di.GetDirectories())
{
directoryList.Add(d.FullName);
GetDirectoryList(d, searchPattern, ref directoryList);
}
}

static void Main(string[] args)
{
const int INCRTIME = 1;
Console.WriteLine("Starting Crawler... (removed rep)");
Searcher s = new Searcher();
s.ReadStopWords("stopwords.txt");

if (args.Length != 1)
{
Console.WriteLine("Type a dir name...!");
return;
}

StringBuilder sbFileContents = new StringBuilder();

DateTime start = DateTime.Now;
string searchPattern = "*.txt";
string InitDirectory = args[0];
ArrayList directoryList = new ArrayList();

DirectoryInfo di = new DirectoryInfo(InitDirectory);
GetDirectoryList(di, searchPattern, ref directoryList);

Console.WriteLine("Total Folders: " +
directoryList.Count);
Console.WriteLine("(Time\tFiles Processed)");
int count = 0;
int prev = INCRTIME;
foreach (string directory in directoryList)
{
DirectoryInfo cdi = new DirectoryInfo(directory);
DateTime fileStart, end;
foreach (FileInfo fi in cdi.GetFiles(searchPattern))
{
try
{
fileStart = DateTime.Now;
sbFileContents.Remove(0,
sbFileContents.Length);
StreamReader rd = File.OpenText(fi.FullName);
string fileContent = rd.ReadToEnd();
fileContent = fileContent.Replace("&amp;",
"&");
fileContent = fileContent.Replace("&",
"&amp;");
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(fileContent);
String docNo = xdoc.SelectSingleNode("//
DOCNO").InnerText;
String docType = xdoc.SelectSingleNode("//
DOCTYPE").InnerText;
String txtType = xdoc.SelectSingleNode("//
TXTTYPE").InnerText;
String text = xdoc.SelectSingleNode("//
TEXT").InnerText;
sbFileContents.Append(text);

s.StripPunctuation(ref sbFileContents);

string[] tokenizedArray = s.Tokenize(ref
sbFileContents, true);

//s.ConvertToWordID(ref tokenizedArray);

//s.Vectorize(docNo, docType, txtType,
tokenizedArray); some thing like this? Mugunth

count++;
end = DateTime.Now;

double fulldiff = (end -
start).TotalMilliseconds;
double diff = (end -
fileStart).TotalMilliseconds;
Console.WriteLine("Time taken for this file
(ms):" + diff.ToString());
if (fulldiff / 1000 > prev)
{
Console.WriteLine((((int)(fulldiff/
1000)).ToString()) + "\t" +
count.ToString());
prev += INCRTIME;
//System.GC.Collect();
// test code
}
rd.Close();
}
catch (Exception ex)
{
Console.WriteLine(fi.FullName + " - " +
ex.Message);
return;
}
}
}
DateTime end1 = DateTime.Now;
double finaldiff = (end1 - start).TotalMilliseconds;
Console.WriteLine("Done - " + count + " files found!");
Console.WriteLine("Time taken so far (sec):" + ((int)
(finaldiff/1000)).ToString());
Console.ReadKey();
}
}


public class Searcher
{
private List<string> stopwords = new List<string>();

public void ReadStopWords(string stopWordsFile)
{
TextReader tr;
try
{
tr = new StreamReader(stopWordsFile);
}
catch (System.IO.IOException /*ioe*/)
{
return;
}

while (true)
{
string str;
str = tr.ReadLine();
if (str == null)
break;
stopwords.Add(str);
}
}

public string[] Tokenize(ref StringBuilder fileContents, bool
bStem)
{
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.ToString().
Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

returnArray = new string[strArray.Length];

PorterStemmer ps = new PorterStemmer();
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str);
}
else
{
word = str;
}

if(!IsStopWord(word))
returnArray[count++] = word;
}


return returnArray;
}

private bool IsStopWord(string word)
{
foreach (string str in stopwords)
{
if (str.Equals(word,
StringComparison.OrdinalIgnoreCase))
return true;
}
return false; // not stop word
}

public void StripPunctuation(ref StringBuilder sbFileContents)
{

char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents.Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents.CompareTo(punctuations[0]) ==
0||
sbFileContents.CompareTo(punctuations[1]) == 0
||
sbFileContents.CompareTo(punctuations[2]) == 0
||
sbFileContents.CompareTo(punctuations[3]) == 0
||
sbFileContents.CompareTo(punctuations[4]) == 0
||
sbFileContents.CompareTo(punctuations[5]) ==
0)
{
sbFileContents = ' ';
}
}
/*
char[] punctuations = { '#', '!', '*', '-', '"', ',' };
foreach (char ch in punctuations)
{
sbFileContents = sbFileContents.Replace(ch, ' ');
}*/
}

public void ConvertToWordID(ref string[] tokenizedArray)
{
// just displays the words now... do some processing
here...
// Mugunth_Dummy_Code
foreach (String str in tokenizedArray)
{
Console.WriteLine(str);
}
}
}
}

Mugunth
 
M

Mugunth

Bill Butler's answer makes a good point.
Do the file sizes vary wildly, or are they approximatly the same over the
sample size? This could account for differences. Also, look at your
process. If some files have more replacements than others, then the work
being done in each 10 seconds is not properly counted by file count.

I understand. but the size in bytes of these files are uniformly
distributed...
It's not like, files from 20000 - 30000 are larger or something like
that...

Is it like, as time progresses, the number of Garbage collection calls
are higher and because of that overhead my performance is hampered
over time?

Possible, but I wouldn't expect that to be the problem.

Why wouldn't this be a problem?

This is the complete source code..
But dataset is huge.. which I cannot upload...

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using SearchEngine.PorterStemmerAlgorithm;
using System.Collections;
using System.Xml;

namespace SearchEngine
{
class Program
{
static private void GetDirectoryList(DirectoryInfo di, string
searchPattern, ref ArrayList directoryList)
{
foreach (DirectoryInfo d in di.GetDirectories())
{
directoryList.Add(d.FullName);
GetDirectoryList(d, searchPattern, ref directoryList);
}
}

static void Main(string[] args)
{
const int INCRTIME = 1;
Console.WriteLine("Starting Crawler... (removed rep)");
Searcher s = new Searcher();
s.ReadStopWords("stopwords.txt");

if (args.Length != 1)
{
Console.WriteLine("Type a dir name...!");
return;
}

StringBuilder sbFileContents = new StringBuilder();

DateTime start = DateTime.Now;
string searchPattern = "*.txt";
string InitDirectory = args[0];
ArrayList directoryList = new ArrayList();

DirectoryInfo di = new DirectoryInfo(InitDirectory);
GetDirectoryList(di, searchPattern, ref directoryList);

Console.WriteLine("Total Folders: " +
directoryList.Count);
Console.WriteLine("(Time\tFiles Processed)");
int count = 0;
int prev = INCRTIME;
foreach (string directory in directoryList)
{
DirectoryInfo cdi = new DirectoryInfo(directory);
DateTime fileStart, end;
foreach (FileInfo fi in cdi.GetFiles(searchPattern))
{
try
{
fileStart = DateTime.Now;
sbFileContents.Remove(0,
sbFileContents.Length);
StreamReader rd = File.OpenText(fi.FullName);
string fileContent = rd.ReadToEnd();
fileContent = fileContent.Replace("&amp;",
"&");
fileContent = fileContent.Replace("&",
"&amp;");
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(fileContent);
String docNo = xdoc.SelectSingleNode("//
DOCNO").InnerText;
String docType = xdoc.SelectSingleNode("//
DOCTYPE").InnerText;
String txtType = xdoc.SelectSingleNode("//
TXTTYPE").InnerText;
String text = xdoc.SelectSingleNode("//
TEXT").InnerText;
sbFileContents.Append(text);

s.StripPunctuation(ref sbFileContents);

string[] tokenizedArray = s.Tokenize(ref
sbFileContents, true);

//s.ConvertToWordID(ref tokenizedArray);

//s.Vectorize(docNo, docType, txtType,
tokenizedArray); some thing like this? Mugunth

count++;
end = DateTime.Now;

double fulldiff = (end -
start).TotalMilliseconds;
double diff = (end -
fileStart).TotalMilliseconds;
Console.WriteLine("Time taken for this file
(ms):" + diff.ToString());
if (fulldiff / 1000 > prev)
{
Console.WriteLine((((int)(fulldiff/
1000)).ToString()) + "\t" +
count.ToString());
prev += INCRTIME;
//System.GC.Collect();
// test code
}
rd.Close();
}
catch (Exception ex)
{
Console.WriteLine(fi.FullName + " - " +
ex.Message);
return;
}
}
}
DateTime end1 = DateTime.Now;
double finaldiff = (end1 - start).TotalMilliseconds;
Console.WriteLine("Done - " + count + " files found!");
Console.WriteLine("Time taken so far (sec):" + ((int)
(finaldiff/1000)).ToString());
Console.ReadKey();
}
}

public class Searcher
{
private List<string> stopwords = new List<string>();

public void ReadStopWords(string stopWordsFile)
{
TextReader tr;
try
{
tr = new StreamReader(stopWordsFile);
}
catch (System.IO.IOException /*ioe*/)
{
return;
}

while (true)
{
string str;
str = tr.ReadLine();
if (str == null)
break;
stopwords.Add(str);
}
}

public string[] Tokenize(ref StringBuilder fileContents, bool
bStem)
{
string[] returnArray;
string[] delimiters = { " ", "?", ". " };
int count = 0;
string[] strArray = fileContents.ToString().
Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);

returnArray = new string[strArray.Length];

PorterStemmer ps = new PorterStemmer();
foreach (String str in strArray)
{
string word;
if (bStem)
{
word = ps.stemTerm(str);
}
else
{
word = str;
}

if(!IsStopWord(word))
returnArray[count++] = word;
}

return returnArray;
}

private bool IsStopWord(string word)
{
foreach (string str in stopwords)
{
if (str.Equals(word,
StringComparison.OrdinalIgnoreCase))
return true;
}
return false; // not stop word
}

public void StripPunctuation(ref StringBuilder sbFileContents)
{

char[] punctuations = { '#', '!', '*', '-', '"', ','};
int len = sbFileContents.Length;
for (int i = 0 ; i < len; i ++)
{
if (sbFileContents.CompareTo(punctuations[0]) ==
0||
sbFileContents.CompareTo(punctuations[1]) == 0
||
sbFileContents.CompareTo(punctuations[2]) == 0
||
sbFileContents.CompareTo(punctuations[3]) == 0
||
sbFileContents.CompareTo(punctuations[4]) == 0
||
sbFileContents.CompareTo(punctuations[5]) ==
0)
{
sbFileContents = ' ';
}
}
/*
char[] punctuations = { '#', '!', '*', '-', '"', ',' };
foreach (char ch in punctuations)
{
sbFileContents = sbFileContents.Replace(ch, ' ');
}*/
}

public void ConvertToWordID(ref string[] tokenizedArray)
{
// just displays the words now... do some processing
here...
// Mugunth_Dummy_Code
foreach (String str in tokenizedArray)
{
Console.WriteLine(str);
}
}
}

}

Mugunth


For now you can safely ignore the call to
SearchEngine.PorterStemmerAlgorithm...
 
M

Mugunth

I commented the search engine code and again ran the xml parsing
alone
which is this part...

sbFileContents.Remove(0,
sbFileContents.Length);
StreamReader rd = File.OpenText(fi.FullName);
string fileContent = rd.ReadToEnd();
fileContent = fileContent.Replace("&amp;",
"&");
fileContent = fileContent.Replace("&",
"&amp;");
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(fileContent);
String docNo = xdoc.SelectSingleNode("//
DOCNO").InnerText;
String docType = xdoc.SelectSingleNode("//
DOCTYPE").InnerText;
String txtType = xdoc.SelectSingleNode("//
TXTTYPE").InnerText;
String text = xdoc.SelectSingleNode("//
TEXT").InnerText;
sbFileContents.Append(text);

(Time Files Processed)
1 3191
2 6933
3 8843
4 9621
5 10421
6 11298
7 12102
8 13062
9 13870
10 14656
11 15727
12 16585
13 17562
14 18273
15 19516
16 20417
17 21397
18 22305
19 23207
20 24141
21 25180
22 26362
23 27049
24 28073
25 29071
26 29769
27 30816
28 31940
29 32965
30 33711
31 34611
32 35621
33 36413
34 37271
Done - 37526 files found!
Time taken so far (sec):34

Again, the first second it can parse 3200 files...
the last 5 seconds (30-34) it could parse only 3500 files...
My data set is not this disparate...

It is split into 100 folders each containing about 300-400 files
Each file is about 1 kb in size...

It's a homogenous collection of files...

I hope i'm clear... :(
 
M

Mugunth

<DOC>
<DOCNO> ABC19981002.1830.0000 </DOCNO>
<DOCTYPE> MISCELLANEOUS </DOCTYPE>
<TXTTYPE> CAPTION </TXTTYPE>
<TEXT>
The troubling connections between the global economic crisis and
American
jobs. monica Lewinsky and Linda Tripp, private conversations made
public. Gene autry had died, the most famous singing cowboy of them
all. And the artist who sold only one painting in his lifetime and
is an icon today.
</TEXT>
</DOC>

this is one single file...
 
M

Mugunth

<DOC>
<DOCNO> CNN19981002.1600.0719 </DOCNO>
<DOCTYPE> MISCELLANEOUS </DOCTYPE>
<TXTTYPE> CAPTION </TXTTYPE>
<TEXT>
the top stories are two minutes away, followed by the closing numbers
from wall street in "dollars & sense." and ahead, "sabrina the
teenage
witch" casts a spell over rome in a new made-for-tv movie. oh, peter
lynch. i'm suzie your stress technician. trainee-- not to worry. here
we go. wow. under control. this reminds me of the stock market. it
can't just shoot up without warning. sure can. and then drop. 25%
or more, four times since 1970. so stressful! you know fidelity can
help you better prepare for the market's ups and downs. for the free
guide, "investing in volatile markets" call 1-800-fidelity or visit
fidelity.com. well, you're stress-free. Man: Hi ya, Diane, it's dad.
hi, dad! Just wondering how you're doing in your new home. great!
I wanted to remind you to get a spare key made, and let me know if
there's anything that needs fixin', all right? All right, sweetheart.
bye. Hi, dad again. I forgot to tell you something. Diane, I... I
am so proud of you. for a free guide that can help you on the path
to homeownership, call the fannie mae foundation. we're showing
america
a new way home. checking the top stories --
</TEXT>
</DOC>

This is a more average file's size...
it varies between 450 bytes to 4 kb...

as i said, it has about 100 folders each folder sized about 800 kb...
(around 400 files in each folder)
 
J

Jon Skeet [C# MVP]

Again, the first second it can parse 3200 files...
the last 5 seconds (30-34) it could parse only 3500 files...
My data set is not this disparate...

I would strongly suggest that you modify your code to load a single
file thousands of times. That way you *know* whether the performance
is actually degrading or whether it's just different data.

Jon
 
F

Family Tree Mike

And, again, after the first one or two seconds the processing rate is fairly
constant as Bill pointed out. Average (after second 2) is 917, with a
standard deviation of 140.
 
C

Creativ

Just comment different parts which cost most time one by one out. You
might find the factor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top