Compiling Email Addresses from Text

R

Ralph

Hi All,

I frequently need to extract email addresses from huge amounts of text, like
40+ pages long, etc... for large web design clients of mine...

I was wondering if a macro could be developed somehow to leave me with a
stack of email addresses. I suppose the macro would have to test each piece
of text for an "@" and a ".com" and then stack only those terms in a column
somewhere. Any ideas on how to do this? I am not macro savvy AT ALL...

thanks!
 
G

Gord Dibben

Can you copy and paste a small sample of the data you currently have with email
addresses contained therein?

6-10 lines should suffice but do not attach a file.


Gord Dibben MS Excel MVP
 
R

Ron Rosenfeld

Hi All,

I frequently need to extract email addresses from huge amounts of text, like
40+ pages long, etc... for large web design clients of mine...

I was wondering if a macro could be developed somehow to leave me with a
stack of email addresses. I suppose the macro would have to test each piece
of text for an "@" and a ".com" and then stack only those terms in a column
somewhere. Any ideas on how to do this? I am not macro savvy AT ALL...

thanks!

How is the text organized?

Does your source data present itself in an Excel workbook?

It can be difficult to decide on whether a string is a valid email address,
without actually using it to see if it "goes through". But you can certainly
use a "regular expression" that would detect most of them. It would certainly
make it easier if they all ended in ".com", but there are many top level
domains that are valid.

Here's an example of a routine that steps through a range of cells, and
extracts everything that looks like a valid email address with specific
top-level domains.

In this case it "prints" the results to the immediate window, but obviously
both the source and destinations can be modified fairly readily, once we have
more knowledge of what we are dealing with.

For example, the range to search could be changed to a file; and the
destination to sequential cells in a column.

====================================
Option Explicit
Sub ExtractEmails()
Dim c As Range
Dim str As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
'This regex probably complies with RFC 2822 and matches all
'country code top level domains, and specific common top level domains.
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+" & _
"(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@" & _
"(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?\.)+" & _
"(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name" & _
"|aero|biz|info|mobi|jobs|museum)\b"
For Each c In Range("A1:A100")
str = c.Value
Set mc = re.Execute(str)
For Each m In mc
Debug.Print m.Value
Next m
Next c
End Sub
=======================================
--ron
 
R

Ralph

i cant to protect my clients privacy but i'll make up an example -

blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah (e-mail address removed) blah
blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah (e-mail address removed) blah blah blah
blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah (e-mail address removed) blah blah blah blah blah
blah blah blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah (e-mail address removed) blah blah
blah
 
R

Ralph

im not sure how to answer that or even how to use what you just posted, i
probabyl need an excel remedial group lol ... the text is unorganized...

Ron Rosenfeld said:
Hi All,

I frequently need to extract email addresses from huge amounts of text, like
40+ pages long, etc... for large web design clients of mine...

I was wondering if a macro could be developed somehow to leave me with a
stack of email addresses. I suppose the macro would have to test each piece
of text for an "@" and a ".com" and then stack only those terms in a column
somewhere. Any ideas on how to do this? I am not macro savvy AT ALL...

thanks!

How is the text organized?

Does your source data present itself in an Excel workbook?

It can be difficult to decide on whether a string is a valid email address,
without actually using it to see if it "goes through". But you can certainly
use a "regular expression" that would detect most of them. It would certainly
make it easier if they all ended in ".com", but there are many top level
domains that are valid.

Here's an example of a routine that steps through a range of cells, and
extracts everything that looks like a valid email address with specific
top-level domains.

In this case it "prints" the results to the immediate window, but obviously
both the source and destinations can be modified fairly readily, once we have
more knowledge of what we are dealing with.

For example, the range to search could be changed to a file; and the
destination to sequential cells in a column.

====================================
Option Explicit
Sub ExtractEmails()
Dim c As Range
Dim str As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
'This regex probably complies with RFC 2822 and matches all
'country code top level domains, and specific common top level domains.
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+" & _
"(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@" & _
"(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?\.)+" & _
"(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name" & _
"|aero|biz|info|mobi|jobs|museum)\b"
For Each c In Range("A1:A100")
str = c.Value
Set mc = re.Execute(str)
For Each m In mc
Debug.Print m.Value
Next m
Next c
End Sub
=======================================
--ron
 
R

Ron Rosenfeld

im not sure how to answer that or even how to use what you just posted, i
probabyl need an excel remedial group lol ... the text is unorganized...

It may be "unorganized", which is fine, but in what sort of format do you
receive it?

Excel file? If so, are you dealing with multiple worksheets, single worksheet?
Is everything in a single cell?

Word file?

Text file?

Small pieces of paper? <g>


--ron
 
I

iliace

If each blah is in a cell, then you can use this. It will insert a
new worksheet and put there all cell values from the active sheet that
contain both a "@" and a "." in that order.

Public Sub findEmail()
Dim rng As Excel.Range
Dim src As Excel.Worksheet
Dim wsh As Excel.Worksheet
Dim i As Long

Set src = Application.ActiveSheet
Set wsh = Application.ActiveWorkbook.Worksheets.Add

For Each rng In src.UsedRange
If (InStr(1, rng.Value, ".") - InStr(1, rng.Value, "@") > 1 And _
InStr(1, rng.Value, "@") > 0) Then
i = i + 1
wsh.Cells(i, 1).Value = rng.Value
End If
Next rng

If i = 0 Then
Application.DisplayAlerts = False
wsh.Delete
Application.DisplayAlerts = True
End If
End Sub


If it's all in one cell on a line, then the approach needs to
different. You could potentially use Text to Columns, then use the
any of the macros supplied.
 
R

Ron Rosenfeld

If it's all in one cell on a line, then the approach needs to
different. You could potentially use Text to Columns, then use the
any of the macros supplied.

The approach also has to be different if there is anything in the cell besides
the email address; and you can also return phrases that contain @ and "." but
are not email addresses.

For example, the above paragraph in a cell would be extracted in its entirety!

Something like the code below, should extract only email addresses, even if
there are multiple addresses in each cell, and write them sequentially in a
column someplace (rDest) which could be a separate sheet.

But how this should be done really depends on information which the OP has not
yet provided.

===========================================
Option Explicit
Sub ExtractEmails()
Dim c As Range
Dim rDest As Range
Dim str As String
Dim i As Long
Dim re As Object, mc As Object, m As Object

i = 1
Set rDest = [m1]
rDest.EntireColumn.ClearContents
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
'This regex probably complies with RFC 2822 and matches all
'country code top level domains, and specific common top level domains.
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+" & _
"(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@" & _
"(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?\.)+" & _
"(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name" & _
"|aero|biz|info|mobi|jobs|museum)\b"
For Each c In Selection
str = c.Value
Set mc = re.Execute(str)
For Each m In mc
rDest(i, 1).Value = m.Value
i = i + 1
Next m
Next c
End Sub
=====================================
--ron
 
R

Ron Rosenfeld

If (InStr(1, rng.Value, ".") - InStr(1, rng.Value, "@") > 1 And _
InStr(1, rng.Value, "@") > 0) Then

By the way, although it doesn't correct any of the other issues I raised, this
part of your code can be simplified to:

If rng.Value Like "*@*.*" Then


--ron
 
R

Ralph

Ron and everyone, I'd love to try your macros but I don't know how to enter
them into excel, my apologies, can anyone take the time to explain to me how
to take this code and put it into a macro?

also Ron, the a1:e100 is a bit limited for my purposes would it be possible
to expand the range from like say a1 to L1000?
 
R

Ron Rosenfeld

Ron and everyone, I'd love to try your macros but I don't know how to enter
them into excel, my apologies, can anyone take the time to explain to me how
to take this code and put it into a macro?

also Ron, the a1:e100 is a bit limited for my purposes would it be possible
to expand the range from like say a1 to L1000?


To enter a macro:

<alt-F11> opens the VB Editor.
Ensure your project is highlighted in the Project Explorer window, then, from
the top menu bar, select Insert/Module and paste the code below into the window
that opens.

See the notes below about changing your source and destination.

===========================================
Option Explicit
Sub ExtractEmails()
Dim c As Range
Dim rDest As Range
Dim str As String
Dim i As Long
Dim re As Object, mc As Object, m As Object

i = 1
Set rDest = [m1]
rDest.EntireColumn.ClearContents
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
'This regex probably complies with RFC 2822 and matches all
'country code top level domains, and specific common top level domains.
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+" & _
"(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@" & _
"(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?\.)+" & _
"(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name" & _
"|aero|biz|info|mobi|jobs|museum)\b"
For Each c In Selection
str = c.Value
Set mc = re.Execute(str)
For Each m In mc
rDest(i, 1).Value = m.Value
i = i + 1
Next m
Next c
End Sub
=====================================

The code above as been written to work on "Selection" instead of a hard coded
range. So just select the range you wish to process.

Change rDest to reflect where you want the results. The routine will clear the
entire column first, so either change this or don't have anything else in that
column. rDest can also be on a new or different sheet.

After making the appropriate changes, <alt-F8> opens the Macro Dialog Box.
Select the Macro and <RUN>.
--ron
 
R

Ralph

cowabunga!!!!!! THANKS RON, THIS WORKED GREAT!!!!

Ron Rosenfeld said:
Ron and everyone, I'd love to try your macros but I don't know how to enter
them into excel, my apologies, can anyone take the time to explain to me how
to take this code and put it into a macro?

also Ron, the a1:e100 is a bit limited for my purposes would it be possible
to expand the range from like say a1 to L1000?


To enter a macro:

<alt-F11> opens the VB Editor.
Ensure your project is highlighted in the Project Explorer window, then, from
the top menu bar, select Insert/Module and paste the code below into the window
that opens.

See the notes below about changing your source and destination.

===========================================
Option Explicit
Sub ExtractEmails()
Dim c As Range
Dim rDest As Range
Dim str As String
Dim i As Long
Dim re As Object, mc As Object, m As Object

i = 1
Set rDest = [m1]
rDest.EntireColumn.ClearContents
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
'This regex probably complies with RFC 2822 and matches all
'country code top level domains, and specific common top level domains.
re.Pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+" & _
"(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@" & _
"(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?\.)+" & _
"(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name" & _
"|aero|biz|info|mobi|jobs|museum)\b"
For Each c In Selection
str = c.Value
Set mc = re.Execute(str)
For Each m In mc
rDest(i, 1).Value = m.Value
i = i + 1
Next m
Next c
End Sub
=====================================

The code above as been written to work on "Selection" instead of a hard coded
range. So just select the range you wish to process.

Change rDest to reflect where you want the results. The routine will clear the
entire column first, so either change this or don't have anything else in that
column. rDest can also be on a new or different sheet.

After making the appropriate changes, <alt-F8> opens the Macro Dialog Box.
Select the Macro and <RUN>.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top