Parsing Address, City ST ZIP

C

Carl S.

Someone brilliant have code to parse messy Address City State ZIP into 4
separate fields? If address has Apt number it should NOT be parsed separate
but included in address fields. Here are examples we get of data downloaded
daily...the main common theme is CA is consistent:

15629 FAIRFORD AVENUE, NORWALK, CA 90650
5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003
Vacant Land, Long Beach, CA
4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022
15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411
4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423
19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326

Example 1 is typical easy parse, majority of data.
Example 2 the word STREET should not have been separated by comma.
Example 3 has no zip code or street numbers.
Example 4 has dash and / in street numbers and (AREA) should be removed.
Example 5 (VAN NUYS AREA) should be removed.
Example 6 has extra comma in address at front.
Example 7 should have UNIT 4 glued to address, and extra commas need removal.

If someone can write something to handle all of this, you are awsome! If
you can get everything except example 3 that is ok....I can strip those out
ahead of time...they either say VACANT LAND or RAW LAND...but I would like to
have them if possible. Thanks in advance!
 
J

Joel

Is this data text data before it was imported into excel. It appears this
data went through some program that created this mess. Do you have the data
before this mess was created. there is no good way of getting STREET out of
the 2nd example unless you search for certain key words being by themselves
lis Street, Ave. Avenue.

You included only a small portion of the lines so without seeing a lot of
data I can't predict what types of algorithms will or will not work. I would
be guessing.

If this data was hand typed and we are trying to correct human errors than
no algorithm will work all the time because humans are not predictable, only
machines are predictable. We can only try are best to fix as many problems
as possible with a macro when humans are the causes of the problems.
 
R

Ron Rosenfeld

Someone brilliant have code to parse messy Address City State ZIP into 4
separate fields? If address has Apt number it should NOT be parsed separate
but included in address fields. Here are examples we get of data downloaded
daily...the main common theme is CA is consistent:

15629 FAIRFORD AVENUE, NORWALK, CA 90650
5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003
Vacant Land, Long Beach, CA
4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022
15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411
4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423
19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326

Example 1 is typical easy parse, majority of data.
Example 2 the word STREET should not have been separated by comma.
Example 3 has no zip code or street numbers.
Example 4 has dash and / in street numbers and (AREA) should be removed.
Example 5 (VAN NUYS AREA) should be removed.
Example 6 has extra comma in address at front.
Example 7 should have UNIT 4 glued to address, and extra commas need removal.

If someone can write something to handle all of this, you are awsome! If
you can get everything except example 3 that is ok....I can strip those out
ahead of time...they either say VACANT LAND or RAW LAND...but I would like to
have them if possible. Thanks in advance!

For the examples you show, this can be done using regular expressions. In the
sample code below, you can either select the range to parse and then run the
macro; or you can use the Functions individually to dynamically extract the
different portions. The latter would be appropriate if you will be editing the
basic list, and want to see changes dynamically.

Notes:

1. I used early binding, so you MUST SET A REFERENCE (see tools/references) to
Microsoft VBScript Regulare Expressions 5.5. If you are going to distribute
this, you might want to use late binding instead.

2. So far as the extraneous commas are concerned, the Address function will
only remove them if they are at the end of the string, or if they are followed
by the word Street or Ave. This may or may not be satisfactory for you.

3. Zip codes are tested for being "proper". They can be either 9 digits, 5
digits, or in the pattern of 00000-0000

4. If your patterns don't conform roughly to what you have posted, then the
regex may fail, and return blanks. If this is a problem, you'll need to post
more examples.

----------------------------------------------------
Option Explicit
Dim re As RegExp, mc As MatchCollection
Sub ParseAddr()
Dim rg As Range, c As Range
Dim s As String

'set up range to parse
'could be done differently
Set rg = Selection
Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear

For Each c In rg
s = c.Text
c.Offset(0, 1).Value = Addr(s)
c.Offset(0, 2).Value = City(s)
c.Offset(0, 3).Value = State(s)
'format as text so don't drop leading "0's"
c.Offset(0, 4).NumberFormat = "@"
c.Offset(0, 4).Value = Zip(s)
Next c
End Sub
'--------------------------------------
Function Addr(s As String) As String
Dim sTemp As String
Set re = New RegExp
re.IgnoreCase = True
re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0)
'strip out some of the commas
re.Pattern = ",(?=$|\s+(Street|Ave))"
Addr = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'-----------------------------------------
Function City(s As String) As String
Dim sTemp As String
Set re = New RegExp
re.Pattern = ",\s*([^,]+),[^,]+$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0).SubMatches(0)
'strip out text enclosed by parentheses
re.Pattern = "\s*\([^)]+\)\s*"
City = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'--------------------------------------
Function State(s As String) As String
Set re = New RegExp
re.Pattern = _
"([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
State = mc(0)
End If
Set re = Nothing
End Function
'----------------------------
Function Zip(s As String) As String
Set re = New RegExp
re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
Zip = mc(0)
End If
Set re = Nothing
End Function
==========================================
--ron
 
C

Carl S.

Hey Joel...if routine can handle the examples given that will handle 99.5% of
the data and I will throw out the rest. It does not have to be 100%.
Currently I'm getting 90%. I can write something, it will take me some
time....but I thought there were some experienced guys out there who could
slam dunk this no problem. I can do it easy in my other language, but am new
to Excel VBA.
 
R

ras

Parse it back-to-front
Take off the zip, if it exists plus any spaces and comma
then the state, if it exists, and any space/comma before it
then look at the next comma-separated section.
If it exists, take it and remove parenthetical text as the city.
What's left is the address...
This seems to meet your requirements.

You can also vet the state by word, abbrev, or start-of-word.
You can vet the city from a larger list, as well as lookup city by
zip, given the right list.
 
C

Carl S.

Thanks! I have no idea what early or late binding is, and I may need to
export the module and send it to a few other offices, and would like it to be
just a straight import to work...will binding affect this? There is a lot
more to Excel VBA than I have brushed with. Bottom line...I was hoping for
something like Cell A1 has address, after running macro cells A1, A2, A3, and
A4 will have parsed data. Your answer goes over my head...I think I will
write something using MID, LEFT, FIND, SUBSTITUTE, ISNUMBER, etc. I
understand it much easier. CA will be chunk I start with, ZIP will be chunk
to right, City will be chunk to left, and chunk before city will be address
and I will strip all commas out...don't need them. Appriciate your effort
though.

Ron Rosenfeld said:
Someone brilliant have code to parse messy Address City State ZIP into 4
separate fields? If address has Apt number it should NOT be parsed separate
but included in address fields. Here are examples we get of data downloaded
daily...the main common theme is CA is consistent:

15629 FAIRFORD AVENUE, NORWALK, CA 90650
5947-5949 SOUTH SAN PEDRO, STREET, LOS ANGELES, CA 90003
Vacant Land, Long Beach, CA
4627-4627 1/2 STRANGE AVENUE, Los Angeles (Area), CA 90022
15522 SYLVAN STREET, (VAN NUYS AREA) LOS ANGELES, CA 91411
4325, 4327 & 4329 Stern Ave., North Hollywood, CA 91423
19200 LAHEY STREET, UNIT 4,, Los Angeles, CA 91326

Example 1 is typical easy parse, majority of data.
Example 2 the word STREET should not have been separated by comma.
Example 3 has no zip code or street numbers.
Example 4 has dash and / in street numbers and (AREA) should be removed.
Example 5 (VAN NUYS AREA) should be removed.
Example 6 has extra comma in address at front.
Example 7 should have UNIT 4 glued to address, and extra commas need removal.

If someone can write something to handle all of this, you are awsome! If
you can get everything except example 3 that is ok....I can strip those out
ahead of time...they either say VACANT LAND or RAW LAND...but I would like to
have them if possible. Thanks in advance!

For the examples you show, this can be done using regular expressions. In the
sample code below, you can either select the range to parse and then run the
macro; or you can use the Functions individually to dynamically extract the
different portions. The latter would be appropriate if you will be editing the
basic list, and want to see changes dynamically.

Notes:

1. I used early binding, so you MUST SET A REFERENCE (see tools/references) to
Microsoft VBScript Regulare Expressions 5.5. If you are going to distribute
this, you might want to use late binding instead.

2. So far as the extraneous commas are concerned, the Address function will
only remove them if they are at the end of the string, or if they are followed
by the word Street or Ave. This may or may not be satisfactory for you.

3. Zip codes are tested for being "proper". They can be either 9 digits, 5
digits, or in the pattern of 00000-0000

4. If your patterns don't conform roughly to what you have posted, then the
regex may fail, and return blanks. If this is a problem, you'll need to post
more examples.

----------------------------------------------------
Option Explicit
Dim re As RegExp, mc As MatchCollection
Sub ParseAddr()
Dim rg As Range, c As Range
Dim s As String

'set up range to parse
'could be done differently
Set rg = Selection
Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear

For Each c In rg
s = c.Text
c.Offset(0, 1).Value = Addr(s)
c.Offset(0, 2).Value = City(s)
c.Offset(0, 3).Value = State(s)
'format as text so don't drop leading "0's"
c.Offset(0, 4).NumberFormat = "@"
c.Offset(0, 4).Value = Zip(s)
Next c
End Sub
'--------------------------------------
Function Addr(s As String) As String
Dim sTemp As String
Set re = New RegExp
re.IgnoreCase = True
re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0)
'strip out some of the commas
re.Pattern = ",(?=$|\s+(Street|Ave))"
Addr = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'-----------------------------------------
Function City(s As String) As String
Dim sTemp As String
Set re = New RegExp
re.Pattern = ",\s*([^,]+),[^,]+$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0).SubMatches(0)
'strip out text enclosed by parentheses
re.Pattern = "\s*\([^)]+\)\s*"
City = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'--------------------------------------
Function State(s As String) As String
Set re = New RegExp
re.Pattern = _
"([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
State = mc(0)
End If
Set re = Nothing
End Function
'----------------------------
Function Zip(s As String) As String
Set re = New RegExp
re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
Zip = mc(0)
End If
Set re = Nothing
End Function
==========================================
--ron
 
C

Carl S.

Thanks, I am thinking along those lines too. Problem is I don't have to the
technical know how to do it easily. Kind of like knowing how a jet engine
works but unable to write a manual on it in Japanese. Oh well. Guess I have
to tough it out, nothing in life is free but I have got some great snippits
here in the past. Problably take me two days to write the code. Was hoping
someone had a routine for this they could slam dunk to me.
 
R

Ron Rosenfeld

Thanks! I have no idea what early or late binding is, and I may need to
export the module and send it to a few other offices, and would like it to be
just a straight import to work...will binding affect this? There is a lot
more to Excel VBA than I have brushed with. Bottom line...I was hoping for
something like Cell A1 has address, after running macro cells A1, A2, A3, and
A4 will have parsed data. Your answer goes over my head...I think I will
write something using MID, LEFT, FIND, SUBSTITUTE, ISNUMBER, etc. I
understand it much easier. CA will be chunk I start with, ZIP will be chunk
to right, City will be chunk to left, and chunk before city will be address
and I will strip all commas out...don't need them. Appriciate your effort
though.

Well, below is the code re-written so you don't have to worry about the
"binding" and can merely export it as is to wherever.

It seems to work fine on the data you presented; removing both the stuff within
parentheses as well as the comma after UNIT 4 in #7 and the comma before Street
in #2.

It also handles #3 properly.

It's easy (actually easier) to remove all the commas in the address field, but
I wasn't sure that's what you wanted.

In any event, if you get tired of working with the text worksheet functions,
give this a try. It can be tweaked pretty easily once you figure out the
"rules".

As I wrote before, you can either Select the range of addresses, and then run
the macro to Parse out the segments; or you can use the individual UDF's.

Good luck!

=============================================
Option Explicit
Dim re As Object, mc As Object
Sub ParseAddr()
Dim rg As Range, c As Range
Dim s As String

'set up range to parse
'could be done differently
Set rg = Selection
Range(rg.Offset(0, 1), rg.Offset(0, 4)).Clear

For Each c In rg
s = c.Text
c.Offset(0, 1).Value = Addr(s)
c.Offset(0, 2).Value = City(s)
c.Offset(0, 3).Value = State(s)
'format as text so don't drop leading "0's"
c.Offset(0, 4).NumberFormat = "@"
c.Offset(0, 4).Value = Zip(s)
Next c
End Sub
'--------------------------------------
Function Addr(s As String) As String
Dim sTemp As String
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = "^.*(?=,\s*[^,]+,[^,]+$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0)
'strip out some of the commas
re.Pattern = ",(?=$|\s+(Street|Ave))"
Addr = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'-----------------------------------------
Function City(s As String) As String
Dim sTemp As String
Set re = CreateObject("vbscript.regexp")
re.Pattern = ",\s*([^,]+),[^,]+$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
sTemp = mc(0).SubMatches(0)
'strip out text enclosed by parentheses
re.Pattern = "\s*\([^)]+\)\s*"
City = re.Replace(sTemp, "")
End If
Set re = Nothing
End Function
'--------------------------------------
Function State(s As String) As String
Set re = CreateObject("vbscript.regexp")
re.Pattern = _
"([A-Z]{2})(?=(\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$)|\s*$)"
If re.Test(s) = True Then
Set mc = re.Execute(s)
State = mc(0)
End If
Set re = Nothing
End Function
'----------------------------
Function Zip(s As String) As String
Set re = CreateObject("vbscript.regexp")
re.Pattern = "\s+(\d{9}|\d{5}-\d{4}|\d{5})\s*$"
If re.Test(s) = True Then
Set mc = re.Execute(s)
Zip = mc(0)
End If
Set re = Nothing
End Function
====================================
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top