automatically extract the City, State, and zip

J

jajoseph

Does anyone have a formula to automatically extract the City, State,
and zip code from the following text string.


4076 St. Andrews Ct. Canfield, OH 44406
 
J

JulieD

please stay with the original thread otherwise you'll get lots of use "text
to columns" answers.
 
J

JE McGimpsey

Are all your cities going to be one-word? If not, how will they be
distinguished from the Street name/type, e.g.,

4076 St. Andrews St. St. Louis, OH 44406
 
J

JE McGimpsey

The last two are easy:

Zip:

=RIGHT(TRIM(A1),5)

State:

=LEFT(RIGHT(TRIM(A1),8),2)


See my other reply regarding City.
 
D

David McRitchie

Did you first of all try to look for a solution in newsgroup archives
before posting. It is not a good idea to post to more than once
as you will cetainly dilute the answers you get.
http://www.mvps.org/dmcritchie/excel/posting.htm

How did you end up with such a mess, you certainly won't
solve it with formulas, nor with single invocation of a macro
unless you have very limited data.

If you just have a few cities that aren't single word cities or can't
be simulated as one word by first replacing
"New " with "New." with a period, and same for
the likes of Los, San and Alta, before doing Data, Text-to-Columns. Then
change the periods back to a space. If you just have Ohio
you could replace " OH " with ",OH,"
before you use text-to-columns with the comma to separate
state and zipcode; otherwise, use text to data once more to
separate State code from zip code this time using a space
as the delimiter. Format the zipcode as text
within text-to-columns.

Myself I would use a macro see SepLastTerm in
Reformat or Rearrange Data in Columns.
http://www.mvps.org/dmcritchie/excel/join.htm#seplastterm
once to move zipcode to the right and then again to
move state to the right, both into empty columns. Format
the column to receive the zip code as Text first.

There is a fixup for 5 digit zip codes in
http://www.mvps.org/dmcritchie/excel/join.htm#fixUSzip5


HTH,
David McRitchie, Microsoft MVP - Excel [site changed Nov. 2001]
My Excel Pages: http://www.mvps.org/dmcritchie/excel/excel.htm
Search Page: http://www.mvps.org/dmcritchie/excel/search.htm
 
H

Harlan Grove

JE McGimpsey wrote...
Are all your cities going to be one-word? If not, how will they be
distinguished from the Street name/type, e.g.,

4076 St. Andrews St. St. Louis, OH 44406
....

Not foolproof, but you could use an array of likely street tokens,
e.g., Street, Avenue, Boulevard, Road, Highway, Court, Close, Place,
Crescent, Circle, etc. along with all abreviations with and without
periods. Find the last one in the string and check if it's immediately
preceded by another. If so, assume the preceding one is the street
token and the following one part of the city name. Royal PITA to
implement in worksheet functions though.
 
J

JE McGimpsey

Harlan Grove said:
JE McGimpsey wrote...
...

Not foolproof, but you could use an array of likely street tokens,
e.g., Street, Avenue, Boulevard, Road, Highway, Court, Close, Place,
Crescent, Circle, etc. along with all abreviations with and without
periods. Find the last one in the string and check if it's immediately
preceded by another. If so, assume the preceding one is the street
token and the following one part of the city name. Royal PITA to
implement in worksheet functions though.

Yup - and I suspect made worse when the real-world data probably has
Street abbreviated both as "St." and "ST" or even "Str." Now, if you
throw in the odd "1234 John Street" and "1234 John Street Road" you can
go nuts.

I'm supporting a contract with a company in Brazil right now. Trying to
bring *those* addresses into the current system is a real hoot.
 
H

Harlan Grove

JE McGimpsey wrote...
....
I'm supporting a contract with a company in Brazil right now. Trying to
bring *those* addresses into the current system is a real hoot.

Nothing beats proper field separators, eh?
 
D

David McRitchie

I have put the entire process onto a web page. Though I can't
help but wonder if you might actually have a double space or
something other than a space before the town name.

The same question comes up at least once a year, and your
question is much harder to answer because you asked in
at least 3 places, which causes confusion. Especially when
you try to put limitations on the solution -- which I did not pay
attention to because it doesn't help get a quick working solution,
and you will have to tailor the macro to your actual data, or modify
your resulting data. Especially if you get towns with triple words
like "Sault Ste. Marie" in Ontario, or like "Salt Lake City" in Utah.

Separating City when not separated from street address
http://www.mvps.org/dmcritchie/excel/city.htm

Sample data processed was, the only additional special
change to the macro was for "St. Louis," which had to
distinguish between abbreviation for Street and Saint.
Hopefully your data is for local or at least for a single state.

12 Boulevard Westfield, NJ 07090
123 High St Crooked Creek, WY 77777
444 University Dr. Chapel Hill, NC 12345
4076 St. Andrews Ct. Canfield, OH 44406
4076 St. Andrews St. St. Louis, OH 44406

The idea is to make the towns with two words into one word
before separation with an underscore and then replace the
space afterwards. You will always have exceptions, and
I am still waiting to hear how/why you got the data in such a mess,
as it would probably be easier and more reliable to tell them what
was wrong with what was supplied.
 
H

Harlan Grove

David McRitchie wrote...
....
Separating City when not separated from street address
http://www.mvps.org/dmcritchie/excel/city.htm

With regard to the #steps section, presumably state and zip code are
unambiguous, so wouldn't it make sense to drop a commas into the
address between state and zip code first, then use Text-to-Columns to
parse out these fields, leaving only the street address and city to
process?

As for stringing together city names, if it were practical, better to
use a recursive procedure to string together city names right to left.
However, it's not practical. While your macro handles the common
leading words 'Los' and 'San', it omits the equally common 'Las' and
'Santa', which indicates you've never lived in the southwest US.

There are such California cities as Angels Camp, Costa Del Mar, Corte
Madera, Diamond Bar, El Segundo, Elk Grove, Grand Terrace (which could
just as easily be a street name), Inidan Wells, La Mesa, Lake
Arrowhead, Lake Forrest, Loma Linda, Lone Pine, Mission Viejo, Monte
Sereno, Moss Landing, Mountain View, Palm Desert, Palm Springs, Palo
Alto, Paradise Ridge, Paso Robles, Point Arena, Rancho Cucamongo,
Rancho Palos Verdes, Red Bluff, Rio Vista, Thousand Oaks, Twentynine
Palms, and more, all of which your macro would skip. Heaven help it if
fed Alaskan and Hawaiian addresses, or some parts of northern New
England bordering Quebec.

It includes ' Place' and ' Run', which are also used as street tokens,
e.g.,

123 Stags Run Place Mill Valley, CA 99999

which would become

123 Stags_Run_Place_Mill_Valley, Ca 99999

when it's pretty clear this should be 'Stags Run Place' in 'Mill
Valley'.

Parsing street tokens is easier because there are fewer words used for
them than there are words used in city/towm/village names. But even
that isn't guaranteed to work, especially when there are some
cities/towns that use directional qualifiers after the street token,
e.g.,

987 Main Street East Podunk, IL 55555

Should that be 'Main Street East, Podunk' or 'Main Street, East
Podunk'?
Sample data processed was, the only additional special
change to the macro was for "St. Louis," which had to
distinguish between abbreviation for Street and Saint.
Hopefully your data is for local or at least for a single state.

The 'St. St.' example is actually a usable pattern. It's nearly certain
duplicated adjacent substrings are either typos or not part of the same
feld. If typos, they'll require manual correction anyway. If legit, the
odds are such that they should trigger the break between street name
and city name.

This *IS* a statistical undertaking, meaning some patterns are more
reliable than others, and none will do a perfect job. Pasting together
city names is unlikely to provide as much benefit as pasting together
street names.
 
D

David McRitchie

It depends on what area is covered, if you are talking about
a few towns or even a few counties, it would be much better
to work on the town because they usually aren't abbreviated
except for Mt., St., Ste which come to mind.

An apartment designation is often at the end of a street name.

There are so many different abbreviations for Road, Street,
Circle, Crescent that I find them hard to work with. Don't expect
to see much of the official US Postal abbreviations, because the
abbreviation for Street had a period after it. And not every
address has such a suffix after the street either.
http://www.usps.com/ncsc/lookups/abbreviations.html#suffix

If it is the entire country it gets much worse.

The post office official abbreviations kind of make my head spin,
No periods, usually no vowels after first letter, but the data shown did not
fit that category.

No correct answer for bad data. I like your example.
 
D

Dr. Vladimir Mindin

May be the most general approach is to query WEB from Excel.
Google returned answer with the map(!) in 0.08SEC
Map of 4076 St Andrews Ct, Canfield, OH 44406
Google Maps - Yahoo! Maps - MapQuest


Best regards,
Dr. Vladimir Mindin
Chair, Computer Sci and Math Dept, Globe Institute of Technology
 
D

David McRitchie

Hardly automated and Google will try for a best match,
and if you use a non existent address (no such street)
Google may return a match anyway, or return no such match.
 
D

Dr. Vladimir Mindin

1. Major point of my approach is to search for particular address in the
database of addresses instead of creating complicated program to "extract
the City, State, and zip".
2. Google is easy accessible, that is why I searched Google (very large
database) to illustrate the idea.
3. "Hardly automated" - not an issue in case of databases + Excel.
4. Statement - "...may return a match anyway, or return no such match" is
just what makes difference between attempt to solve the problem by
algorithm/program and by searching in a database.
In my practical work with database of 45,000 addresses there was a perfectly
structured address that had all parts (city, state, zip, country) in
appropriate fields. The problem was that field country had record "United
States of Brooklyn."
Is it possible to create algorithm that will help with all this REAL things.
NO.
Is it easy to find all kind of strange things that exist in addresses in
REAL databases by comparing with standard database.
I hope you agree that answer is - YES.

Best regards,
Dr. Vladimir Mindin
Chair, Computer Sci and Math Dept, Globe Institute of Technology


David McRitchie said:
Hardly automated and Google will try for a best match,
and if you use a non existent address (no such street)
Google may return a match anyway, or return no such match.
---
HTH,
David McRitchie, Microsoft MVP - Excel [site changed Nov. 2001]

Dr. Vladimir Mindin said:
May be the most general approach is to query WEB from Excel.
Google returned answer with the map(!) in 0.08SEC
Map of 4076 St Andrews Ct, Canfield, OH 44406
Google Maps - Yahoo! Maps - MapQuest
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top