Fuzzy logic searching

G

Guest

I have a Database of 56000 Company Names, when a new record is to be added I
have to determine if the Company exists inthe current Dataset and then
manually decide if it is a new site for that company or updated information
or a genuine new entry.

So much for the problem, now the form of the data is an excel sheet and the
Company Name may be similar but not exact eg DS Smith or D S Smith
Corrugators Ltd.
I would like to step through the existing data and apply human logic to
decide an action.
If the name is similar but has a mismatch of x letters then it may be the
same company and I would then see a list of similar entries to decide the
appropriate action.
From my distant college days (30 years ago) I seen to recall fuzzy logic may
be an answer but old age has caught up and I cannot remember even the basics
of what I was taught.

Thanks in advance
 
M

Marshall Barton

Don said:
I have a Database of 56000 Company Names, when a new record is to be added I
have to determine if the Company exists inthe current Dataset and then
manually decide if it is a new site for that company or updated information
or a genuine new entry.

So much for the problem, now the form of the data is an excel sheet and the
Company Name may be similar but not exact eg DS Smith or D S Smith
Corrugators Ltd.
I would like to step through the existing data and apply human logic to
decide an action.
If the name is similar but has a mismatch of x letters then it may be the
same company and I would then see a list of similar entries to decide the
appropriate action.
From my distant college days (30 years ago) I seen to recall fuzzy logic may
be an answer but old age has caught up and I cannot remember even the basics
of what I was taught.


This is a HUGE problem that will only permit partial
solutions. I too am too old to remembr the details, but I
don't see how fuzzy logic can be applied to this without
reducing the names to numbers. The only way I can think of
doing that is to use some kind of soundx procedure (search
the web for more info) and even then it's an iffy approach.

An approach that I have used is to create a table of
alternate spellings (including spaces and punctuation) for
the names along with the official name. Then join that
table to the raw data to replace the inappropriate spellings
with the official name.

A different query can be used to generate a list of the
names that are not covered in the table so a human can add
them for the future. Before too long that list will cover a
very high percentage of the imported names.

Initially populating the table is still a big job, but you
can create a relatively simple seach form to assist the
human decision process.
 
G

Guest

Thanks guys,

Marshalls approach would certainly work but I suspect it would be a full
time job to get all of the permutations as the records increase by around 250
a month. I tried the Soundex method a while back with very mixed results, it
turned up possible sound alikes that were nowhere near the original request.

John, thanks for the link, I will read it and see where it takes me.
Thanks again for your advice
 
J

John Nurick

I've never had to do this, but I'd probably explore along these lines:

1) Create a procedure to "standardise" a company name by removing all
punctuation and spaces, and chopping any "Incorporated", "LLP", "Corp"
and so on from the end (turning "D. S. Smith (Corrugators) Ltd." into
"DSSmithCorrugators"

2) A second procedure that "standardises" further, by trying to strip
out descriptors (e.g. the "Corrugators" in D S Smith Corrugators,
leaving "DSSmith")

3) Store these two standardised forms for the existing records.

4) When a new record is added, run procedures (1) and (2) on the name
and then use a Levenshtein distance algorithm to find near-matches in
the corresponding fields.

There's lots of scope for refinement. For instance, you could store a
Soundex-like coding too, and then only calculate Levenshtein distances
for names that match on this: this should speed up (4). (Since Soundex
was designed for "last names", it probably would benefit from
modification to tune it for modern company names.)

It may also be worth storing the standardised forms in a related table,
rather than fields in the main table. That way, all N records in the
main table whose names map to "DSSmith" would be linked to a single
record in the related table, and hence would be covered by one
Levenshtein distance calculation when entering a new reocord, rather
than by N.
 
G

Guest

John,

The article on Levenshtein was new to me and it has a lot of possibilities.

Your further ideas could work but still need a fair amount of regular
maintenance which time constraints would cause a problem. For the moment I
will reduce the current names as you suggest and go with the Levenshtein
method.

Much appreciate all of your help
 
G

Guest

I can't find SoundEx. SoundEx is VBA function of function of SQL ?

Thanks
Tran Hong Quang
 
J

John Nurick

Soundex isn't built into VBA or Jet SQL, though I think it is included in
some dialects of SQL. But it is covered in the article by Doug Steele to
which I referred earlier.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top