Regular expression to match person's name

J

Johnny Williams

I'm struggling to create a regular expression for use with VB .Net which
matches a person's name in a string
of words.

For example in "physicist Albert Einstein was born in Germany and"
I want to match "Albert Einstein"

In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
Mathematica"
I want to match "Sir Isaac Newton"

In all cases the names are capitalised and the first word in the string
starts with a lower case character and the first word after the name starts
with a lower case character.

A regex which matches from the first uppercase character to the first
lowercase character preceded by a space would work, but all my attempts have
so far failed!

Thanks for any help.
 
F

Fabio

I'm struggling to create a regular expression for use with VB .Net which
matches a person's name in a string
of words.

For example in "physicist Albert Einstein was born in Germany and"
I want to match "Albert Einstein"

In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
Mathematica"
I want to match "Sir Isaac Newton"

In all cases the names are capitalised and the first word in the string
starts with a lower case character and the first word after the name
starts
with a lower case character.

A regex which matches from the first uppercase character to the first
lowercase character preceded by a space would work, but all my attempts
have so far failed!

(?<name>[A-Z][a-z]+)

The real problem is: how do you distinguish Albert from Germany as a valid
name?
 
G

Guest

A regex which matches from the first uppercase character to the first
lowercase character preceded by a space would work, but all my
attempts have so far failed!

What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
etc? Is that a person's name? :)

Name matching is quite hard to to do ... might be easier to preload a list
of known names to match ... or some sort of fulltext search engine.
 
J

Johnny Williams

Fabio said:
(?<name>[A-Z][a-z]+)

The real problem is: how do you distinguish Albert from Germany as a valid
name?

Thanks for your reply Fabio. Your regex is the standard one for matching
capitalised words, and as such will match Albert, Einstein and Germany as
separate words, and as you say there is no way of determining which of these
is a valid name. I would like to avoid this problem by extracting the whole
name in a single match.

The regex in the following VB code nearly does the job:

===== Start Module1.vb ======

'Visual Basic Console Application

Imports System.Text.RegularExpressions

Module Module1

Sub Main()

Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",
RegexOptions.Compiled)
Dim reMatch As Match
Dim input, name As String

input = "physicist Albert Einstein was born in Germany and"

reMatch = re.Match(input)

If reMatch.Success Then
name = reMatch.Groups(1).Value
Debug.WriteLine("|" + name + "|")
End If

End Sub

End Module

======= End Module1.vb ========

The above regex correctly matches "Albert Einstein" in the input string as a
single match, and also "Sir Isaac Newton" in my other test string.

However, it's not quite correct because I also want it to match when there
is nothing following the name, e.g. in "physicist Albert Einstein".

Any ideas? Thanks.
 
J

Johnny Williams

Spam Catcher said:
What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
etc? Is that a person's name? :)

Name matching is quite hard to to do ... might be easier to preload a list
of known names to match ... or some sort of fulltext search engine.

Yes, in my case they are 'names'. I'm not trying to determine whether
they are real names or known names.

For my purposes, a 'name' within a string is a sequence of one or more
capitalised words. Put simply, all the characters from the first uppercase
letter in the string to the first lowercase letter preceded by a space is a
name.
 
F

Fabio

"Johnny Williams" <[email protected]> ha scritto nel
messaggio
The regex in the following VB code nearly does the job:
Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",

I don't think this would work.
It catch also "A&&2373%% xyz" that it's sure it isn't a valid name.
However, it's not quite correct because I also want it to match when there
is nothing following the name, e.g. in "physicist Albert Einstein".

Any ideas? Thanks.

Mine do this, and your too but your fails if there is something after the
name.
I don't understand why the one I suggested don't works for you.
 
J

Johnny Williams

Fabio said:
"Johnny Williams" <[email protected]> ha scritto nel
messaggio
The regex in the following VB code nearly does the job:
Dim re As Regex = New Regex("([A-Z].*?)([ ][a-z])",

I don't think this would work.
It catch also "A&&2373%% xyz" that it's sure it isn't a valid name.

There won't be funny characters like that in the names, so that isn't an
issue.
Mine do this, and your too but your fails if there is something after the
name.
I don't understand why the one I suggested don't works for you.

This is your regex:

(?<name>[A-Z][a-z]+)

As I said, that's the standard regex to match all capitalised words in a
string, and match them as separate strings. I'd like a regex which matches
the names as a single string.

To restate the problem with examples:

1. "xxx xxxxx Firstname Lastname xxx" must match "Firstname Lastname" as a
single string.
2. "xx xxx Firstname Middlename Lastname xx xxx" must match "Firstname
Middlename Lastname" as a single string.
3. "xxxx Firstname Lastname" (nothing after Lastname) must match "Firstname
Lastname" as a single string.

That is the scope of the problem; nothing more, nothing less.

Thanks for your help!
 
J

Jay B. Harlow [MVP - Outlook]

Johnny,
You should be able to take the expression to find one Word, and modify it to
find a Word followed by one or more Words separated (preceded really) by
whitespace...

Something like:

Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b"
Static parser As New Regex(pattern, RegexOptions.Compiled)

Dim inputs() As String = {"physicist Albert Einstein was born in
Germany and", _
"scientist Sir Isaac Newton wrote the Philosophiae Naturalis
Principia Mathematica"}

For Each input As String In inputs
For Each match As Match In parser.Matches(input)
Debug.WriteLine(match.Value)
Next
Next

Produces:
Albert Einstein
Germany
Sir Isaac Newton
Philosophiae Naturalis Principia Mathematica

FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll}
matches any lower case letter, not just a-z; For example accented & umlated
letters or letters in other alphabets...

http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx

The \b ensures the phrases start on & end on a "word boundary" (Albert will
match, but Bert in alBert will not).

--
Hope this helps
Jay B. Harlow [MVP - Outlook]
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net


|
| | > | >
| >> A regex which matches from the first uppercase character to the first
| >> lowercase character preceded by a space would work, but all my
| >> attempts have so far failed!
| >
| > What about New York, New Hampshire, New Orleans, Abu Dhabi, Big Island,
| > etc? Is that a person's name? :)
| >
| > Name matching is quite hard to to do ... might be easier to preload a
list
| > of known names to match ... or some sort of fulltext search engine.
|
| Yes, in my case they are 'names'. I'm not trying to determine whether
| they are real names or known names.
|
| For my purposes, a 'name' within a string is a sequence of one or more
| capitalised words. Put simply, all the characters from the first
uppercase
| letter in the string to the first lowercase letter preceded by a space is
a
| name.
|
|
 
J

Johnny Williams

Jay B. Harlow said:
Johnny,
You should be able to take the expression to find one Word, and modify it
to
find a Word followed by one or more Words separated (preceded really) by
whitespace...

Something like:

Dim pattern As String = "(\b\p{Lu}\p{Ll}+)(\s+\p{Lu}\p{Ll}+)*\b"
Static parser As New Regex(pattern, RegexOptions.Compiled)

Dim inputs() As String = {"physicist Albert Einstein was born in
Germany and", _
"scientist Sir Isaac Newton wrote the Philosophiae Naturalis
Principia Mathematica"}

For Each input As String In inputs
For Each match As Match In parser.Matches(input)
Debug.WriteLine(match.Value)
Next
Next

Produces:
Albert Einstein
Germany
Sir Isaac Newton
Philosophiae Naturalis Principia Mathematica

FWIW: \p{Lu} matches any upper case letter; not just A-Z; while \p{Ll}
matches any lower case letter, not just a-z; For example accented &
umlated
letters or letters in other alphabets...

http://msdn2.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx

The \b ensures the phrases start on & end on a "word boundary" (Albert
will
match, but Bert in alBert will not).

--
Hope this helps
Jay B. Harlow [MVP - Outlook]
.NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net

Brilliant! That works nicely.

Thanks Jay.
 
L

Lars Graeve

Hi,

try this:

Dim pattern As String = "^(\S+\s)(([A-Z]+\S*\s)+)((\S*\s*)*)$"

Label1.Text = Regex.Replace(TextBox1.Text, pattern, "$2")
 
C

C-Services Holland b.v.

Johnny said:
I'm struggling to create a regular expression for use with VB .Net which
matches a person's name in a string
of words.

For example in "physicist Albert Einstein was born in Germany and"
I want to match "Albert Einstein"

In "scientist Sir Isaac Newton wrote the Philosophiae Naturalis Principia
Mathematica"
I want to match "Sir Isaac Newton"

In all cases the names are capitalised and the first word in the string
starts with a lower case character and the first word after the name starts
with a lower case character.

A regex which matches from the first uppercase character to the first
lowercase character preceded by a space would work, but all my attempts have
so far failed!

Thanks for any help.

Have you concidered names like Ferdinand von Zeppelin or mine Rinze van
Huizen. The complete name *includes* the von or van part, yet saying a
name always has capitalised sequential words is wrong in this case.
 
J

Johnny Williams

C-Services Holland b.v. said:
Have you concidered names like Ferdinand von Zeppelin or mine Rinze van
Huizen. The complete name *includes* the von or van part, yet saying a
name always has capitalised sequential words is wrong in this case.

Hi Rinze, no I hadn't considered names like yours. In my case the full name
always consists of 2 or 3 capitalised names so this issue won't arise.

Thanks for your contribution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top