Fastest way to search through txt file?

V

Vjay77

Hi,
I haven't posted any problem in quite a while now, but I came to the
point that I really need to ask for help.

I need to create an application which will search through .txt log
file and find all lines where email from hotmail occured.

All these emails need to be printed to list box on the form.

Problem with code you'll see below, is that it takes long time to
search through. On just 10mb file it takes almost 2 minutes. And I
need to process 1-2 gb files. Because this is only middle step and
program has much more functionality, I can't afford to wait this
long.

Right now the way I do the search is, that I load each line into
hidden richtextbox and use its find comman to look if there is any
occurence of hotmail.com, and if it is I display the line in listbox.

This process is extremely slow.

How to extract that email address out of the line, that is another
story, does someone know about any good email parser for vb.net?

Can someone look at the code and tell me what I am doing wrong? Why is
is so slow?
Thanks a lot.

vjay


Dim oFile As System.IO.File
Dim oRead As System.IO.StreamReader
Dim linein As String
Dim Result As Integer
Dim count As Integer

oRead = oFile.OpenText(log.txt)

While oRead.Peek <> -1
count = count + 1
linein = oRead.ReadLine()
RichTextBox1.Text = linein
'StatusBar1.Text = count.ToString

Result = RichTextBox1.Find("hotmail.com",
RichTextBoxFinds.MatchCase)
If Result <> -1 Then
ListBox2.Items.Add(linein)
End If


End While
oRead.Close()
 
C

Cor

Hi Vjay,

I think about an half a year ago I did a test in this newsgroup what was
the fastest method to do a find in a textfile (code supported by different
persons).

This code was supported by someone who had as firstname Jon.

(It counts how many time a word is in a text, you needs the places, that is
of course everytime iStart in this routine ).

\\\
Public Function Test2(ByVal strInput As String, ByVal strDelimiter _
As String) As Int32 'Jon (string)
Dim iStart As Int32, iCount As Int32, iResult As Int32
iStart = 1
iCount = 0
Do
iResult = InStr(iStart, strInput, strDelimiter)
If iResult = 0 Then Exit Do
iCount += 1
iStart = iResult + 1
Loop
Return iCount
End Function
///
This was absolute the fastest if the delimiter is a string (not with a char)

I hope that you can use it, if you still needs help to implement this in
your solution tell it than again?

Cor
 
V

Vjay77

Cor,
thank you but I don't need to count, I need to find a word and display
the line in which it occured in my listbox.
Please let me know if you know how I could amend it so your fast
script does what I need.
Thanks a lot for your help.
 
C

CJ Taylor

You should try loading a file into parts (if you concerned about memory) and
process it using RegularExpressions.

Using a filestream object may be better for memory, I'm not real sure...

I think you will find RegEx is exactly what your looking for, and its VERY
fast.


Vjay77 said:
Cor,
thank you but I don't need to count, I need to find a word and display
the line in which it occured in my listbox.
Please let me know if you know how I could amend it so your fast
script does what I need.
Thanks a lot for your help.
 
J

Jay B. Harlow [MVP - Outlook]

Vjay77,
I don't have a specific .NET example handy.

Have you considered using the Indexing Service instead of coding the search
yourself?

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixintro_297o.asp

You may need to use ADODB instead of ADO.NET, however you should be able to
setup a search of the Indexing Service that returns the list of files with
log entries...

If you have a single log file, instead of multiple log files, then you may
need to continue using the loop you have, however consider moving the search
itself to a second thread...

Hope this helps
Jay

Vjay77 said:
Hi,
I haven't posted any problem in quite a while now, but I came to the
point that I really need to ask for help.

I need to create an application which will search through .txt log
file and find all lines where email from hotmail occured.

All these emails need to be printed to list box on the form.

Problem with code you'll see below, is that it takes long time to
search through. On just 10mb file it takes almost 2 minutes. And I
need to process 1-2 gb files. Because this is only middle step and
program has much more functionality, I can't afford to wait this
long.

Right now the way I do the search is, that I load each line into
hidden richtextbox and use its find comman to look if there is any
occurence of hotmail.com, and if it is I display the line in listbox.

This process is extremely slow.

How to extract that email address out of the line, that is another
story, does someone know about any good email parser for vb.net?

Can someone look at the code and tell me what I am doing wrong? Why is
is so slow?
Thanks a lot.

vjay


Dim oFile As System.IO.File
Dim oRead As System.IO.StreamReader
Dim linein As String
Dim Result As Integer
Dim count As Integer

oRead = oFile.OpenText(log.txt)

While oRead.Peek <> -1
count = count + 1
linein = oRead.ReadLine()
RichTextBox1.Text = linein
'StatusBar1.Text = count.ToString

Result = RichTextBox1.Find("hotmail.com",
RichTextBoxFinds.MatchCase)
If Result <> -1 Then
ListBox2.Items.Add(linein)
End If


End While
oRead.Close()
 
C

Cor

Hi CJ,
I think you will find RegEx is exactly what your looking for, and its VERY
fast.
Yes for your standards not for mine.

:)

As far as I remember me surely 100 times slower than that routine Jon made.

Cor
 
C

Cor

Hi Vjay77,

I thought this was what you where looking for, test it because I never use
the instr.

But that should be the fastest.

I hope it works?

Cor
\\\
Dim sr As New IO.StreamReader(log.txt)
Dim linein As String
Dim Result As Integer
linein = sr.ReadLine
Do Until linein Is Nothing
Result = InStr(linein, "hotmail.com")
If Result <> -1 Then
ListBox2.Items.Add(linein)
End If
linein = sr.ReadLine()
Loop
sr.Close()
///
 
C

CJ Taylor

Hi CJ,
Yes for your standards not for mine.

Yes, but I tried to forget about Microsoft.VisualBasic when much better and
commonly supported methods are in place...

Maybe you should look at String.IndexOf if you want to use something other
than regex..

Just had to start something this morning didn't ya? =)

But for log analysis? come on...
 
C

Cor

Hi CJ,

See the subject. "fastest".

If you see the code I wrote for him you see that I never use it.
(You told you do internet, than you should also only think in indexof in my
opinion)

But that instr is real the fastest, I never use it, I forget always that the
index is 0 + 1

The indexof is 2 times slower.
(this are pico seconds or less)

:)

Cor
 
C

CJ Taylor

Alright, just ran a test, and yes, instr is faster than anything else.

My apologies. =)
Hi CJ,

See the subject. "fastest".

If you see the code I wrote for him you see that I never use it.
(You told you do internet, than you should also only think in indexof in my
opinion)

But that instr is real the fastest, I never use it, I forget always that the
index is 0 + 1

I forget that every time from my VB6 days... Instr you can check for 0 (and
use it as a false flag too!) but not indexOf... that always bothered me, I
just got used to it in Java so I think thats why I like to use it in .NET.
 
V

Vjay77

Thanks a lot everyone.
Instr worked just fine. Very very fast.
Once again, thanks a lot.
vjay
 
J

jerry

What about a Boyer-Moore search? My tests show it is faster
than any built-in string search in .Net.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top