How to program a really fast product search catalogue

DC · Oct 18, 2004

Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(products, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Sriram Krishnan · Oct 18, 2004

You might want to look into Lucene.NET if you're planning to move away from
database land

Nicholas Paldino [.NET/C# MVP] · Oct 18, 2004

DC,

As far as ignoring the typos, have you considered the SOUNDEX function
(and the DIFFERENCE function as well)? It will allow you to determine what
words are similar (check the SOUNDEX documentation for more details).

Also, I suspect that CONTAINSTABLE is probably causing the slowdown.
However, I can't see why the query expression is slower. Can you post the
query? It should be a simple join (between the item, and the words that
describe it), with a filter on the specific words (an operation that should
be very fast).

Hope this helps.

Fernando Rodríguez · Oct 18, 2004

You might want to look into Lucene.NET if you're planning to move away from
database land

I though Lucene was java only... :-P

After googling for a while, I couldn't
find any reference to a .net version.

Bonj · Oct 18, 2004

you could insert the programming before the data gets to the database, i.e.
instead of searching through the verbose descriptions, parse the verbose
descriptions out into words that are stored in a separate table with a
primary key and a one to many relationship with the products (i.e. one
product has many words associated with it). Then when the user types their
search into the box, you again parse that up into words and use those words
to match products by doing a join on the two tables and a where clause on
the word table (making a big "word='juicy' or word='mac' or word='apple' "
string is faster than using "word in ('juicy', 'apple', 'mac') ")

Sriram Krishnan · Oct 18, 2004

http://sourceforge.net/projects/dotlucene/

DC · Oct 21, 2004

Hi,

thanks for the suggestions! Since the opinion seems to be, that this
should be doable with SQL Server alone, I have tried it again.

I tried different approaches, and the fastest I could find is this:

I take appart all product names. E.g. I have a funtion that returns
the table

ye
yel
yell
yello
yellow
ap
app
appl
apple

when I enter the term "yellow apple". Splitting the tokens saves some
work later, since I don not have to use the "like" operator when
searching (my application gernerally requires right truncation). I
store the result of this function in a table "tbl_products_tokens",
which is 1:n related to an intersection table which is n:1 related to
the product table. The tbl_products_tokens table does have a grouped
index on the tokens. I also store the number of occurences of each
token in tbl_products_tokens.

tbl_products_tokens:
token id incidents
ye 1 200
yel 2 198
yell 3 197
....
apple 100 25000

tbl_products_tokensis:
id_token id_product
1 2
1 3
....
100 1000

tbl_products:
id name
2 yellow apple
3 yellow submarine
....
1000 red apple

Now, if the search phrase is "yel apple", I am first looking up the
matchin tokens, sorted by incidents (asc):
token incidents
yel 197
apple 25000

Now I fill a table variable with the product id's from table
intersections which match "yel". Then I join this with a second table
variable which contains the product id's matching "apple". To
demonstrate this process I append some t-sql.

The overall performance is now about as mediocre as the fulltext index
was, but it does no longer have the limitation of getting corrupted if
a single token (e.g. "apple") matches more than 20000 products.

I do have quiet some search maintenance to do on my own with this,
tbl_products_tokens contains about 800,000 rows, tbl_products_tokensis
10,000,000 rows.

What I cannot archieve with this approach is search tolerance. I
implemented SOUNDEX in an earlier approach but that does not help much
with the (typically non-english) product names.

I also wonder how product like Google Desktop and Lookout (the Outlook
plugin) manage to search huge amounts of fulltext data as lightening
fast as they do. I would carry out the product search in CSharp
outside SQL Server if I could expect to archieve a performance like
that.

Kind regards

Derek

declare @m1 table ([id] int primary key)
declare @m2 table ([id] int primary key)
declare @tokens table (id int identity(1,1), token varchar(63),
incidents int)

insert into @tokens (token, incidents)
select ts.token, sum(incidents)
from tbl_products_tokens t
right join dbo.func_tokenize_select(@searchfor) ts on t.token like
ts.token
group by ts.token
order by sum(incidents) asc

set @mtoks = @@rowcount
set @mtok_id = 1

select * from @tokens

if @mtoks > 0
begin
select top 1 @token = token from @tokens where id = 1
insert into @m1 (id)
select id from dbo.func_search_token(@token)
set @mtok_id = 2
select @mcount = count(id) from @m1
end

while @mtoks >= @mtok_id
begin
select top 1 @token = token from @tokens where id = @mtok_id
if @mtok_id % 2 = 0
begin
insert into @m2
select m1.id from @m1 m1
join dbo.func_search_token(@token) m2 on m2.id = m1.id
delete @m1
select @mcount = count(id) from @m2
end
else
begin
insert into @m1
select m2.id from @m2 m2
join dbo.func_search_token(@token) m1 on m1.id = m2.id
delete @m2
select @mcount = count(id) from @m1
end

set @mtok_id = @mtok_id + 1

if @mcount = 0
break
end

create function func_search_token (@token varchar(63))
returns @matches table ([id] int)

as
begin
insert into @matches
select its.id_product
from tbl_products_tokensis its
join tbl_products_tokens it on it.id = its.id_token
where it.token = @token
return
end

search design feedback	2	Oct 25, 2008
Does FullText warning mean any update to text field will not be reflected?	2	Apr 18, 2008
Excel as a products catalogue?	7	Nov 8, 2007
How do I create a product catalogue using Sharepoint?	1	Aug 17, 2005
Chinese + Russian miitary to replace Microsoft products.	2	Jul 24, 2019
W Search 4.0 Index Not in Catalogue	0	Oct 22, 2008
why is it not possible to find the database	6	Mar 25, 2011
search assistant, indexing, directories & catalogues	1	Aug 18, 2005

How to program a really fast product search catalogue

DC

Sriram Krishnan

Nicholas Paldino [.NET/C# MVP]

Fernando Rodríguez

Bonj

Sriram Krishnan

DC

Ask a Question

Similar Threads