Extract Unique Values, Then Extract Again to Remove Suffixes

Harlan Grove · Jun 24, 2005

Karl Burrows wrote...
....

It boils down to taking a column of builder names and stripping off the data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.

....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.

Karl Burrows · Jun 24, 2005

Thus, the problem. Because each builder can have a unique name with some of
those qualifiers in there, it is about impossible to identify their true
name. I think maybe developing a naming convention to add the hyphen or
something else that we can tell it to strip everything right of that
character is going to be the only way to go. As it is now, it is truly
"fuzzy logic!"

Thanks for all the help!

Karl Burrows wrote...
....

It boils down to taking a column of builder names and stripping off the
data
that is not common to all values for that builder, whether that means
removing 60', - 60, - Greenbrier, or any other designation they may come up
with. I think that is where the hang up is. Unless you can compare all
the
builder values and then say well these are pretty much alike other than the
"60" at the end or "townhome" or other, so let's remove that portion. I
haven't had a chance to try to coding and maybe that is what it does.

....

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. For example,
Home and House can be surnames.

If the only added qualifiers you have to deal with involve anything
beginning with a decimal numeral or a hyphen, you could use regular
expressions to remove them. But you also have normal words appended
with no more than a space separating them from the builder name. Unless
*YOU* could compile an exhaustive list of such words that would always
be deleted and never erroneously truncate any builder's name, then you
could use a list of these words as tokens to remove from your records.
Then feed what's left through a dictionary object to eliminate
duplicates.

Alan Beban · Jun 24, 2005

Harlan said:
Karl Burrows wrote...
...

...

The problem is that some of the qualifiers added to some of the builder
names could be legitimate parts of a person's or company's name. I'm
not saying that's in fact the case, just that it could be. . . .

Unless I misunderstand, we already know from the OP's 3rd posting in
this thread that it is the case. He made it clear that in Ryan
Townhomes, the builder's name is Ryan and Townhomes is a suffix; and
that in KB Home the builder's name is KB Home and Home is not a suffix.

Alan Beban

Harlan Grove · Jun 25, 2005

Alan Beban said:
Unless I misunderstand, we already know from the OP's 3rd posting
in this thread that it is the case. He made it clear that in Ryan
Townhomes, the builder's name is Ryan and Townhomes is a suffix;
and that in KB Home the builder's name is KB Home and Home is not
a suffix.

Then it's the same situation as parsing surnames from a list of peoples from
many original nationalities but with inappropriate English capitalization
rules applied. E.g.,

Charles Der
Ruud Van Der Aalter
Nguyen Van Tieu

You come up with a rule to handle all these correctly, and you may have a
prayer handling general company names which follow even fewer rules. [This
is rhetorical. It's theoretically possible if there's a sufficiently
complete rules base, but it'd be expedient to use an approach that works
80-90% of the time and correct the rest manually.]

Extract list of values from a list	6	Apr 10, 2009
Return Just the Unique Entries From a Range of Cells	3	Jul 2, 2009
Finding Unique records	3	Apr 30, 2010
Unique values in an array.	2	Oct 15, 2009
Filter (blanks) with two or more unique values not working	1	Apr 1, 2010
Reverse Unique Values	1	Dec 4, 2009
listing unique values	1	Nov 11, 2009
Extract AutoFilter Column Values?	2	Jun 3, 2005

Extract Unique Values, Then Extract Again to Remove Suffixes

Harlan Grove

Karl Burrows

Alan Beban

Harlan Grove

Ask a Question

Similar Threads