adress regex help

M

mikewolfbaltimore

Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\.|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S|E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|South|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas
 
B

Ben Voigt

Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\.|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S|E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|South|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

Without having a database to find out whether the city has a "South Milton
Avenue", it's ambiguous. Why isn't number "709 S" on "Milton Ave" as valid
as number "709" on "S Milton Ave".

Moreover, your regex is going to go crazy over
P.O. Box 6000
 
K

Kevin Spencer

The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address." You
have only apparently noticed one or two. In fact, an "address" can take many
combinations of many forms, and include many combinations of abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps can
do it fairly well, but Microsoft and Google have a lot of money to throw at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.
 
M

mikewolfbaltimore

Thanks guys... couple reasponses....

1) 709 S | Milton Ave is not as valid as 709 | S | Milton ave because
they want the direction seperate... 709 S is not the street number 709
is and S Milton is not the street milton is.

2) Kevin, yah what I was suspecting but not wanting to think about.
Alternative for the client is to have 4 seperate fields on the ui
[number] [direction] [street] [type] .... but I hate this as that its
not intuitive.... or web standard.

thanks for your input guys

mike

Kevin said:
The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address." You
have only apparently noticed one or two. In fact, an "address" can take many
combinations of many forms, and include many combinations of abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps can
do it fairly well, but Microsoft and Google have a lot of money to throw at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\.|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S|E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|South|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas
 
K

Kevin Spencer

Keep in mind that addresses don't always follow that (or any similar)
format. Here are a few examples:

John Smith
Smith Enterprises
P.O. Box 12345
Anytown, Nebraska
00000

Jack and Jill Hill
RR 5 Box 909
Podunk, WI 12345-7890

MR S HOLMES
2978 W MAIN ST # 12
MINNEAPOLIS MN 23976-4542

May December
Bowers Holiday Village
Bldg 91 Apt. 2-A
12 31st Street
Baltimore, Maryland
79797
USA

Herrn
Günther Meyer
Goethestraße 25
20002 HAMBURG
Federal Republic of Germany

SGT NICK FURY
HEADQUARTERS COMPANY
7TH ARMY TRAINING CENTER
ATTN: AETT-AG
UNIT 28130
APO AE 09114

CUSTOMS ATTACHE
AMERICAN EMBASSY CARACAS
UNIT 4964
APO AA 34037

MS HELEN SAUNDERS
1010 CLEAR STREET
OTTAWA ON K1A 0B1
CANADA

MS JOYCE BROWNING
2045 ROYAL ROAD
06570 ST PAUL
FRANCE

MS JOYCE BROWNING
2045 ROYAL ROAD
LONDON WIP 6HQ
ENGLAND

RUFUS LANGDON
LAW DEPARTMENT
US POSTAL SERVICE
475 L'ENFANT PLZ SW RM 6627
WASHINGTON DC 202360-1120

I have found a few references for you. However, again, this is a huge task.
There is commercial software out there that you can buy to do this sort of
parsing. Just Google for it. Here are some links to references:

http://www.columbia.edu/kermit/postal.html
http://pe.usps.com/text/pub28/welcome.htm
http://www.grcdi.nl/whitepapers.htm
http://aurora.regenstrief.org/v3dt/PAS.html
http://www.cicc.or.jp/english/hyoujyunka/databook/219.htm

Good luck!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.


Thanks guys... couple reasponses....

1) 709 S | Milton Ave is not as valid as 709 | S | Milton ave because
they want the direction seperate... 709 S is not the street number 709
is and S Milton is not the street milton is.

2) Kevin, yah what I was suspecting but not wanting to think about.
Alternative for the client is to have 4 seperate fields on the ui
[number] [direction] [street] [type] .... but I hate this as that its
not intuitive.... or web standard.

thanks for your input guys

mike

Kevin said:
The first thing you've got to do is figure out all of the possible
permutations of combinations of tokens that may comprise an "address."
You
have only apparently noticed one or two. In fact, an "address" can take
many
combinations of many forms, and include many combinations of
abbreviations
of various kinds. In addition, the order of the elements (tokens) in an
address can be ordered in any number of ways, particularly if these
addresses come from different countries, and especially if these
addresses
have been provided by human beings rather then machines.

IOW, you've opened up a huge can of worms for yourself. What you need is
not
just a regular expression, but a bit of AI to solve this problem. I have
seen it done, but I'm not sure *how* it's done. MapPoint and Google Maps
can
do it fairly well, but Microsoft and Google have a lot of money to throw
at
this sort of problem.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Hello all

have a regex question... I want to split an address into descrete parts

so

709 S Milton Ave is split into
number = 709
Direction = S
Name = Milton
Type = Ave

So I have the following regex

(?<number>^\d*(\s\w|\w|\-\w|\s\d/\d))\s(?<direction>(n\.|N\.|s\.|S\.|E\.|e\.|W\.|w\.|NE\.|ne\.|SE\.|se\.|NW\.|nw\.|SW\.|sw\.|n|N|s|S|E|e|W|w|NE|ne|SE|se|NW|nw|SW|sw|North|East|West|South|north|south|west|east)*)(?<street>(.*[^street|place|drive|st|pl|dr|ave|av])*)(?<type>.*)

Which works for the folowing address

709 S S Milton ave (as in 709 S South Milton ave)

as that S is part of the number

but does not work for

709 S Milton ave
because it thinks that the S is part of the number and not the
direction....

any ideas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top