Designing a log search application

Bruce · Mar 15, 2008

Hello
I am designing an app to do log search efficiently. I have gigabytes
of server logs that contain all kind of information - typically I
query about a user name in a certain time span to find out what the
user actually did during that time, what errors he got etc.
I previously just used findstr across these files to do it but I am
finding it slow and inaccurate.
So, I am planning to write a platform that parses all the logs
realtime, stores the words in a database.
For example a line in the log reading "User connected to server"
would result in 4 rows in the database for each of the words, with
information about the file, time, relative location in the log among
other things.
This way, if I query for "bruce connected', I would be able to convert
it into a database query and fetch the results fairly quickly.

I have a couple of questions:
1. I am not using any standard search engine since I don't think they
index and provide the level of detail I would need. So, does my design
of using a database in the above manner sound good?
2. On top of this platform, I plan to build layers that do intelligent
search - say using business logic, it queries and finds out all users
who got errors and displays them in a UI.

I am curious to know whether there is a better approach to this.

Thanks
Bruce

Jon Skeet [C# MVP] · Mar 15, 2008

Bruce said:
I am designing an app to do log search efficiently. I have gigabytes
of server logs that contain all kind of information - typically I
query about a user name in a certain time span to find out what the
user actually did during that time, what errors he got etc.
I previously just used findstr across these files to do it but I am
finding it slow and inaccurate.
So, I am planning to write a platform that parses all the logs
realtime, stores the words in a database.
For example a line in the log reading "User connected to server"
would result in 4 rows in the database for each of the words, with
information about the file, time, relative location in the log among
other things.

This way, if I query for "bruce connected', I would be able to convert
it into a database query and fetch the results fairly quickly.

I have a couple of questions:
1. I am not using any standard search engine since I don't think they
index and provide the level of detail I would need. So, does my design
of using a database in the above manner sound good?

It sounds like you're ignoring the fact that the database itself is
probably good at doing full text indexing. Why not have one row per log
entry, and let the database handle the complicated stuff?

Alternatively, if you're really just building an index, consider using
a product built specifically for indexing, such as Lucene.

2. On top of this platform, I plan to build layers that do intelligent
search - say using business logic, it queries and finds out all users
who got errors and displays them in a UI.

I am curious to know whether there is a better approach to this.

That part sounds perfectly reasonable - although again, it may well be
easier if you have one entry per log entry, with separate columns for
the nature of the entry (error, info etc), the affected user, etc.

Bruce · Mar 15, 2008

It sounds like you're ignoring the fact that the database itself is
probably good at doing full text indexing. Why not have one row per log
entry, and let the database handle the complicated stuff?

Alternatively, if you're really just building an index, consider using
a product built specifically for indexing, such as Lucene.

That part sounds perfectly reasonable - although again, it may well be
easier if you have one entry per log entry, with separate columns for
the nature of the entry (error, info etc), the affected user, etc.

Thanks for the reply Jon.

Why not have one row per log entry, and let the database handle the complicated stuff?

Won't the search get less efficient if I do this?
If the row reads "Bruce sent a message" and I search for Bruce, my
query would be like "Select xx from xxx where logcolumn like "% Bruce
%"". Isn't doing the "like" inefficient since the database would have
to do a full table scan?
(whereas if I just had the word, the index can directly get me the
row. Multiple words would require joins though)
Also, I would not know from a log entry what/where the user name is .

Thanks
Bruce

Bruce · Mar 15, 2008

It sounds like you're ignoring the fact that the database itself is
probably good at doing full text indexing. Why not have one row per log
entry, and let the database handle the complicated stuff?

Alternatively, if you're really just building an index, consider using
a product built specifically for indexing, such as Lucene.

That part sounds perfectly reasonable - although again, it may well be
easier if you have one entry per log entry, with separate columns for
the nature of the entry (error, info etc), the affected user, etc.

Thanks for the reply Jon.

Why not have one row per log entry, and let the database handle the complicated stuff?

Won't the search get less efficient if I do this?
If the row reads "Bruce sent a message" and I search for Bruce, my
query would be like "Select xx from xxx where log like "% Bruce %"".
Isn't doing the "like" inefficient since the database would have to do
a full table scan?
(whereas if I just had the word, the index can directly get me the
row. Multiple words would require joins though)
Also, I would not know from a log entry what/where the user name is to
have a separate column for it.

Thanks
Bruce

Jon Skeet [C# MVP] · Mar 15, 2008

Won't the search get less efficient if I do this?
If the row reads "Bruce sent a message" and I search for Bruce, my
query would be like "Select xx from xxx where logcolumn like "% Bruce
%"". Isn't doing the "like" inefficient since the database would have
to do a full table scan?
(whereas if I just had the word, the index can directly get me the
row. Multiple words would require joins though)

Look into full text indexing. A quick look at the docs for SQL Server
2005 suggest that you'd use a "CONTAINS" clause - but I haven't
actually *used* full text indexing myself, so it's new to me too

I
just know that various databases support it, and it avoids you having
to reinvent the wheel. (I believe it also supports cunning stuff like
word stemming, and possibly even coping with typos etc.)

Also, I would not know from a log entry what/where the user name is .

Is there some way of scanning for known entry patterns when you're
putting the entries into the database, and fetching really useful
information like that once, rather than searching multiple times?

Bruce · Mar 15, 2008

Look into full text indexing. A quick look at the docs for SQL Server
2005 suggest that you'd use a "CONTAINS" clause - but I haven't
actually *used* full text indexing myself, so it's new to me too I
just know that various databases support it, and it avoids you having
to reinvent the wheel. (I believe it also supports cunning stuff like
word stemming, and possibly even coping with typos etc.)

Is there some way of scanning for known entry patterns when you're
putting the entries into the database, and fetching really useful
information like that once, rather than searching multiple times?

Full text indexing sounds pretty interesting, I took a look and it
seems like it does not happen automatically and I need to manually
build the index population. But I will investigate it further.

Regarding fetching multiple entries at once, I think it depends on
what kind of information users are searching for. I will keep the idea
in mind. Thanks for the input Jon.

Bruce

Designing a log search application

Bruce

Jon Skeet [C# MVP]

Bruce

Bruce

Jon Skeet [C# MVP]

Bruce