David McRitchie wrote...
....
Separating City when not separated from street address
http://www.mvps.org/dmcritchie/excel/city.htm
With regard to the #steps section, presumably state and zip code are
unambiguous, so wouldn't it make sense to drop a commas into the
address between state and zip code first, then use Text-to-Columns to
parse out these fields, leaving only the street address and city to
process?
As for stringing together city names, if it were practical, better to
use a recursive procedure to string together city names right to left.
However, it's not practical. While your macro handles the common
leading words 'Los' and 'San', it omits the equally common 'Las' and
'Santa', which indicates you've never lived in the southwest US.
There are such California cities as Angels Camp, Costa Del Mar, Corte
Madera, Diamond Bar, El Segundo, Elk Grove, Grand Terrace (which could
just as easily be a street name), Inidan Wells, La Mesa, Lake
Arrowhead, Lake Forrest, Loma Linda, Lone Pine, Mission Viejo, Monte
Sereno, Moss Landing, Mountain View, Palm Desert, Palm Springs, Palo
Alto, Paradise Ridge, Paso Robles, Point Arena, Rancho Cucamongo,
Rancho Palos Verdes, Red Bluff, Rio Vista, Thousand Oaks, Twentynine
Palms, and more, all of which your macro would skip. Heaven help it if
fed Alaskan and Hawaiian addresses, or some parts of northern New
England bordering Quebec.
It includes ' Place' and ' Run', which are also used as street tokens,
e.g.,
123 Stags Run Place Mill Valley, CA 99999
which would become
123 Stags_Run_Place_Mill_Valley, Ca 99999
when it's pretty clear this should be 'Stags Run Place' in 'Mill
Valley'.
Parsing street tokens is easier because there are fewer words used for
them than there are words used in city/towm/village names. But even
that isn't guaranteed to work, especially when there are some
cities/towns that use directional qualifiers after the street token,
e.g.,
987 Main Street East Podunk, IL 55555
Should that be 'Main Street East, Podunk' or 'Main Street, East
Podunk'?
Sample data processed was, the only additional special
change to the macro was for "St. Louis," which had to
distinguish between abbreviation for Street and Saint.
Hopefully your data is for local or at least for a single state.
The 'St. St.' example is actually a usable pattern. It's nearly certain
duplicated adjacent substrings are either typos or not part of the same
feld. If typos, they'll require manual correction anyway. If legit, the
odds are such that they should trigger the break between street name
and city name.
This *IS* a statistical undertaking, meaning some patterns are more
reliable than others, and none will do a perfect job. Pasting together
city names is unlikely to provide as much benefit as pasting together
street names.