Extract Unique Values, Then Extract Again to Remove Suffixes

H

Harlan Grove

Karl Burrows wrote...
....
It boils down to taking a column of builder names and stripping off the data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.
....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.
 
K

Karl Burrows

Thus, the problem. Because each builder can have a unique name with some of
those qualifiers in there, it is about impossible to identify their true
name. I think maybe developing a naming convention to add the hyphen or
something else that we can tell it to strip everything right of that
character is going to be the only way to go. As it is now, it is truly
"fuzzy logic!"

Thanks for all the help!

Karl Burrows wrote...
....
It boils down to taking a column of builder names and stripping off the
data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all
the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.
....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.
 
A

Alan Beban

Harlan said:
Karl Burrows wrote...
...


...

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. . . .

Unless I misunderstand, we already know from the OP's 3rd posting in
this thread that it is the case. He made it clear that in Ryan
Townhomes, the builder's name is Ryan and Townhomes is a suffix; and
that in KB Home the builder's name is KB Home and Home is not a suffix.

Alan Beban
 
H

Harlan Grove

Alan Beban said:
Unless I misunderstand, we already know from the OP's 3rd posting
in this thread that it is the case. He made it clear that in Ryan
Townhomes, the builder's name is Ryan and Townhomes is a suffix;
and that in KB Home the builder's name is KB Home and Home is not
a suffix.

Then it's the same situation as parsing surnames from a list of peoples from
many original nationalities but with inappropriate English capitalization
rules applied. E.g.,

Charles Der
Ruud Van Der Aalter
Nguyen Van Tieu

You come up with a rule to handle all these correctly, and you may have a
prayer handling general company names which follow even fewer rules. [This
is rhetorical. It's theoretically possible if there's a sufficiently
complete rules base, but it'd be expedient to use an approach that works
80-90% of the time and correct the rest manually.]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top