Need to create email address from URL

Cooper Marcus · Jun 18, 2004

Here is what I have:
http://www.goldenhour.co.il/pearl.html
http://web.mit.edu/atic/www/po.html
http://thoughtstream.org/
http://www.classactionpl.com/download.jsp
http://students.med.wisc.edu/pdainfo.htm

Here is what I need:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)

Any ideas on what UDF or functions to use to accomplish this transformation?

Norman Harker · Jun 18, 2004

Hi Cooper!!

Rather than a very long formula add a number of helper columns adjacent to
these entries (assuming first is in A1)

First formula: [In B1]
=RIGHT(A1,LEN(A1)-7)
Copy down
Returns:
www.goldenhour.co.il/pearl.html
web.mit.edu/atic/www/po.html
thoughtstream.org/
www.classactionpl.com/download.jsp
students.med.wisc.edu/pdainfo.htm

Second Formula: [in C1]
=IF(OR(LEFT(B1,3)="www",LEFT(B1,3)="web"),RIGHT(B1,LEN(B1)-4),B1)
Copy down
Returns:
goldenhour.co.il/pearl.html
mit.edu/atic/www/po.html
thoughtstream.org/
classactionpl.com/download.jsp
students.med.wisc.edu/pdainfo.htm

Third Formula: [in D1]
=LEFT(C1,FIND("/",C1)-1)
Copy down
Returns:
goldenhour.co.il
mit.edu
thoughtstream.org
classactionpl.com
students.med.wisc.edu

Fourth Formula: [in E1]
=IF(LEN(D1)-LEN(SUBSTITUTE(D1,".",""))=3,MID(SUBSTITUTE(D1,".","*",2),FIND("
*",SUBSTITUTE(D1,".","*",2))+1,255),D1)
Copy down
Returns:
goldenhour.co.il
mit.edu
thoughtstream.org
classactionpl.com
wisc.edu

Fifth Formula: [In F1]
="info@"&E1
Copy down
Returns:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)

Which is what you wanted.

Having checked you have what you need:

Select the last column of formulas and use

Copy > Edit > Paste Special > Values > OK

You can then delete all the intermediate steps.
--
--
Regards
Norman Harker MVP (Excel)
Sydney, Australia
(e-mail address removed)
Excel and Word Function Lists (Classifications, Syntax and Arguments)
available free to good homes.

Cooper Marcus said:
Here is what I have:
http://www.goldenhour.co.il/pearl.html
http://web.mit.edu/atic/www/po.html
http://thoughtstream.org/
http://www.classactionpl.com/download.jsp
http://students.med.wisc.edu/pdainfo.htm

Here is what I need:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)

Any ideas on what UDF or functions to use to accomplish this

transformation?

Daniel.M · Jun 18, 2004

Hi,

What are the rules here?
If you're rules are more precise, advise.

If you're removing from the domain name "www" and "web" here's one solution:

Assuming your entry in A1, the following formula:
="info@"&subst(A1,"http://((www|web)\.)?([^/]*)/.*","$3")

You'll need Harlan's Grove Subst() function.

' Author : Harlan Grove
'Similar to Excel's SUBSTITUTE but using VBScript's perl-like regexps
'Required:
' orig_text - string to search through,
' match_pat - regexp to find,
' replace_pat - replacement pattern
'Optional:
' instance - which matched substring to replace or 0 for all (default)
'-----------------------------------------
Function Subst(orig_text As String, _
match_pat As String, _
replace_pat As String, _
Optional instance As Variant) As Variant
'-----------------------------------------
Dim regex As Object, matches As Object, m As Object

If IsMissing(instance) Then
instance = 0#

ElseIf TypeName(instance) <> "Double" Then
Subst = CVErr(xlErrValue) 'invalid instance type
instance = -1#

ElseIf CDbl(instance) <= 0.5 Then
Subst = CVErr(xlErrNum) 'invalid instance value
instance = -1#

Else
instance = Int(instance + 0.5)

End If

If instance = -1# Then Exit Function 'do nothing quickly

Set regex = CreateObject("vbscript.regexp")
regex.Pattern = match_pat
regex.Global = True

If instance = 0# Then
Subst = regex.Replace(orig_text, replace_pat)

Else
Set matches = regex.Execute(orig_text)

If instance > matches.Count Then
Subst = orig_text 'matchnum out of bounds - do nothing

Else
Set m = matches.Item(instance - 1)
Subst = Left(orig_text, m.FirstIndex) & _
regex.Replace(m.Value, replace_pat) & _
Right(orig_text, Len(orig_text) - m.FirstIndex - m.Length)

End If

End If

End Function

Regards,

Daniel M.

Harlan Grove · Jun 18, 2004

Here is what I have:
http://www.goldenhour.co.il/pearl.html
http://web.mit.edu/atic/www/po.html
http://thoughtstream.org/
http://www.classactionpl.com/download.jsp
http://students.med.wisc.edu/pdainfo.htm

Here is what I need:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)
(e-mail address removed)

Any ideas on what UDF or functions to use to accomplish this transformation?

If all urls begin with 'http://', and if the domain for the e-mail address is
always jsut the first token to the left of the high level domain qualifier
(.com, .org, .co, .ac, etc.), and if there's no port garbage dangling at the end
of the url's domain (e.g., http://www.foobar.com:8765/), then no udf is needed,
but two defined names are essential.

Seq referring to =ROW(INDIRECT("1:1024"))
HLDQ referring to ={".com";".org";".edu";".biz";".net";".co";".ac"}

[Note: HLDQ isn't exhaustive - up to you to ensure it's complete.]

Then given a url in A1 complying with the restrictions I've mentioned above, the
e-mail address you seek would be given by the array formula

="info@"&MID(LEFT(SUBSTITUTE(A1,"http://","",1),FIND("/",SUBSTITUTE(A1,
"http://","",1)&"/")-1),MAX(IF((Seq<MIN(FIND(HLDQ,LEFT(SUBSTITUTE(A1,
"http://","",1),FIND("/",SUBSTITUTE(A1,"http://","",1)&"/")-1)&HLDQ)))
*(MID(SUBSTITUTE(A1,"http://","",1),Seq,1)={".","/"}),Seq))+1,1024)

Harlan Grove · Jun 18, 2004

Rather than a very long formula add a number of helper columns adjacent to
these entries (assuming first is in A1)

First formula: [In B1]
=RIGHT(A1,LEN(A1)-7)
Copy down
Returns:
www.goldenhour.co.il/pearl.html
web.mit.edu/atic/www/po.html
thoughtstream.org/
www.classactionpl.com/download.jsp
students.med.wisc.edu/pdainfo.htm

Or just copy the original range, paste in another range, and replace 'http://'
with nothing.

Second Formula: [in C1]
=IF(OR(LEFT(B1,3)="www",LEFT(B1,3)="web"),RIGHT(B1,LEN(B1)-4),B1)
Copy down
Returns:
goldenhour.co.il/pearl.html
mit.edu/atic/www/po.html
thoughtstream.org/
classactionpl.com/download.jsp
students.med.wisc.edu/pdainfo.htm

Too limited in scope. Consider, for example,

http://groups.google.com/groups?selm=...

Haven't you been participating long enough to know that OPs don't provide
exhaustive examples? Haven't you learned that you need to anticipate
complications that could arise from overly simplistic 'solutions'?

Also, your OR expression is suboptimal. Better to have used

OR(LEFT(A1,4)={"www.","web."}))

not only because it involves just one LEFT call but also because OR is subject
to the limit on the number of function arguments. OR(LEFT(.,.)=ArrayRef) can
handle as many entries as an array can hold. Wouldn't that be millions or only
limited by memory?

Third Formula: [in D1]
=LEFT(C1,FIND("/",C1)-1)
Copy down
Returns:
goldenhour.co.il
mit.edu
thoughtstream.org
classactionpl.com
students.med.wisc.edu

Again, only works for the sample provided as-is. Note that the 3rd url could
equally validly have been given as

http://thoughtstream.org

in which case your wonderful formula would return #VALUE!. If you expect to
always find a substring and are only interested in determining where it occurs,
append the substring to string to ensure there's one to be found. So,

=LEFT(C1,FIND("/",C1&"/")-1)

Fourth Formula: [in E1]
=IF(LEN(D1)-LEN(SUBSTITUTE(D1,".",""))=3,MID(SUBSTITUTE(D1,".","*",2),
FIND("*",SUBSTITUTE(D1,".","*",2))+1,255),D1)
Copy down
Returns:
goldenhour.co.il
mit.edu
thoughtstream.org
classactionpl.com
wisc.edu

...

Again, not general. Consider, for example,

http://www.math.technion.ac.il/iic/ela/

which would become ac.il rather than the correct technion.ac.il, or

http://www.math.odu.edu/~bogacki/lat/

which would become math.odu.edu rather than the correct odu.edu, or

http://math.nist.gov/lapack++/

which would become math.nist.gov rather than the nist.gov, or

http://www1.fpl.fs.fed.us/linear_algebra.html

which would become www1.fpl.fs.fed.us rather than the fs.fed.us.

If your expertise isn't in text parsing and you're unclear on how to find domain
names in urls *generally*, perhaps you should reconsider responding to such
posts.

Harlan Grove · Jun 19, 2004

Daniel.M said:
What are the rules here?
If you're rules are more precise, advise.

The rules are specified in a number of internet RFCs.

If you're removing from the domain name "www" and "web" here's one
solution:

Assuming your entry in A1, the following formula:
="info@"&subst(A1,"http://((www|web)\.)?([^/]*)/.*","$3")

....

More general,

=subst(A7,".+://([^.:/]+\.)*([^.:/]+\.(co\.|com|ac\.|edu|net|org|gov|mil|biz
|fed\.)[^:/]*)([:/].*)?",
"info@$2")

Note that this is all one line with nothing between '\.' and '(co|'.
Note also that the high level domain qualifer list isn't exhaustive.

Daniel.M · Jun 19, 2004

Hi Harlan,

The rules are specified in a number of internet RFCs.

I've looked at http://www.cse.ohio-state.edu/cgi-bin/rfc/rfc1738.html but I
don't pretend to comply as the OP's problem concentrate more on the host part of
the URI.

<snip>

More general,

=subst(A7,".+://([^.:/]+\.)*([^.:/]+\.(co\.|com|ac\.|edu|net|org|gov|mil|biz
|fed\.)[^:/]*)([:/].*)?",
"info@$2")

I'd already changed my pattern and it was somewhat close to yours.
We still haven't heard anything about the OP and I suspect we could go on and on
about the proper way to handle this (not aware of any special cases he has).
Still, it's interesting :-)

That being said, I would choose the following as a pattern
^.+://([^.:/]+\.)*?([^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|gov|
mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$

It's non-greedy on $1
It needs to be since it allows many consecutives HDLQ, (.com.mx, .qc.ca,
..com.au)

http://www.earthlink.com.au/ ' (e-mail address removed)
http://www.gouv.qc.ca/ '(e-mail address removed)
http://www1.ca.dell.com/content/default.aspx?c=ca&l=en&s=gen ' (e-mail address removed)

As for the last example, one could claim result should be (e-mail address removed) but I'll
put that one aside ;-)

As you stated, we can add/remove some HDLQ in the list. I also think all HLDQ
should be _explicitly_ listed in the OR list (us should be in there).

Regards,

Daniel M.

Harlan Grove · Jun 19, 2004

Daniel.M said:
That being said, I would choose the following as a pattern
^.+://([^.:/]+\.)*?([^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|gov|
mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$

It's non-greedy on $1

Nongreedy on $1 is problematic. It needs to be greedy sometimes and
nongreeds other times when viewed just as a string parsing exercise.
However, if you take the semantics of domain names into consideration, $1
should always be greedy, eating up all tokens to the left of the domain
name. Unfortunately, there are difficulties, e.g.,

http://www.ci.stockton.ca.us/

which should transform into (e-mail address removed) , which means cities in
California at least require special handling, so one regexp probably won't
suffice (unless there are some complex look-ahead assertions - a
self-challenge). The point here is that $1 still needs to be greedy, but the
domain fits the pattern ci\.[^.:/]+\.[a-z]{2}\.us . There may be similar
quirks in other countries.

It needs to be since it allows many consecutives HDLQ, (.com.mx,
.qc.ca,.com.au)

http://www.earthlink.com.au/ ...

My original regexp returns (e-mail address removed) .

http://www.gouv.qc.ca/ ...

This is a similar challenge. It appears that [a-z]{2}.us and [a-z]{2}.ca
present a similar challenge.

http://www1.ca.dell.com/content/default.aspx?c=ca&l=en&s=gen ...

My original regexp returns (e-mail address removed) , but . . .

As for the last example, one could claim result should be
(e-mail address removed) but I'll put that one aside ;-)

....

Depends whether Dell US would relay the message to Dell Canada. Then there's
the picky question whether e-mail addressed to (e-mail address removed) would go through
or bounce.

Back to the governmental subdivisions, looks like the $2 regexp needs to be
((..)|(..)) where one of the alternatives looks for .com, .co, etc. while
the other looks for \.[a-z]{2}\.(ca|us), and at least the California ones
need to include leading ci\. in the domain name.

Daniel.M · Jun 20, 2004

Harlan,

...

That being said, I would choose the following as a pattern
^.+://([^.:/]+\.)*?([^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|gov|
mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$

It's non-greedy on $1

Click to expand...

Nongreedy on $1 is problematic. It needs to be greedy sometimes and
nongreeds other times when viewed just as a string parsing exercise.

However, if you take the semantics of domain names into consideration, $1
should always be greedy, eating up all tokens to the left of the domain
name.

I disagree about the 'always' part: It depends on the definition of what you
want to retrieve (see my next remark about your 'stockon' example).

Unfortunately, there are difficulties, e.g.,
http://www.ci.stockton.ca.us/
which should transform into (e-mail address removed) ,

Initially, I saw: (e-mail address removed)

This is an good example of user rules I expected the OP to provide us with.
My rule was : As long as the ending parts of the host are HDLQs (present in the
'OR' list), keep them, and then, keep the previous entry to those.

If you want to include "ci." as well when it's just before the first entry kept,
maybe this pattern:
^.+://([^.:/]+\.)*?((ci.)?[^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|
gov|mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$

... said:
My original regexp returns (e-mail address removed) .

My bad :-(
I provided new HDLQs (au) not included in your initial list.
http://www.langley.edu.net/ would have been a better example.

My original regexp returns (e-mail address removed) , but . . .

... <snip>

My 'Dell' humorous remark was just an example of how rules can bring about very
complex situations. For _me_, (e-mail address removed) is OK.

Back to the governmental subdivisions, looks like the $2 regexp needs to be
((..)|(..)) where one of the alternatives looks for .com, .co, etc. while
the other looks for \.[a-z]{2}\.(ca|us), and at least the California ones
need to include leading ci\. in the domain name.

OP? Confirmation?

BTW, the ".qc.ca" I was referring to is much closer to
home (mine) than California. :-)

Norman Harker · Jun 20, 2004

Hi!

Re: Haven't you been participating long enough to know that OPs don't
provide
exhaustive examples? Haven't you learned that you need to anticipate
complications that could arise from overly simplistic 'solutions'?

Direct response from OP to me was:

"Thanks so much for your prompt, clear, and completely accurate
response to my inquiry on transforming URLs into email addresses in
Excel! You are an angel!
Cooper : ) "

Maybe not optimal or anticipatory, but it did the job.

Harlan Grove · Jun 20, 2004

Daniel.M said:
I disagree about the 'always' part: It depends on the definition
of what you want to retrieve (see my next remark about your
'stockon' example).

Initially, I saw: (e-mail address removed)

Except that there's a mailto link on the web page above that includes 'ci.'.
Other cities in California show the same thing - the 'ci.' just before the
city name included in e-mail addresses. However, I'd interpret this as the
'ci.' must be part of the domain name, and everything to the left of 'ci.'
must be discarded, so $1 must be greedy.

This is an good example of user rules I expected the OP to provide
us with.

This has nothing to do with user rules. The only way the OP has any control
to exercise over the urls provided in the original message is if the OP runs
the firm hosting those organizations' web sites. If the OP is a host admin,
s/he already should have tools that can do what s/he wants far more easily
than s/he can do so in Excel.

If the OP doesn't host these urls, s/he has no control over them, and is
looking for rules to extract domain names from host names. That seems to be
a bit of work.

My rule was : As long as the ending parts of the host are HDLQs
(present in the 'OR' list), keep them, and then, keep the previous
entry to those.

And that seems to make sense except for the city of Stockton, CA url above.

If you want to include "ci." as well when it's just before the
first entry kept,
maybe this pattern:
^.+://([^.:/]+\.)*?((ci.)?[^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|
gov|mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$

....

Except it may be more complicated than this. For all I know, there may be
some host names out there that have .ci. before the domain name that don't
end in \.[a-z]{2}\.(us|ca). In those cases, it's unlikely the 'ci.' should
be included in the domain name.

I provided new HDLQs (au) not included in your initial list.
http://www.langley.edu.net/ would have been a better example.

....

OK, this argues either for locating the HLDQ rightmost in the list and
including any immediately preceding tokens that match other HLDQs, or doing
so only when '.net' is rightmost.

BTW, the ".qc.ca" I was referring to is much closer to
home (mine) than California.

....

Figured that. The point I was trying to make is that the 'qc' part isn't in
and of itself an HLDQ. It's the ending '.ca' which presumably could be
preceded by '.bc', '.yk', '.ab', etc. Despite the singular nature of Québec
(and I'm impressed that its web page has both English and Spanish
alternatives, thus handling the two most common, er, foreign languages), if
you put qc in the or-list, you need the other provinces and territories, and
you'd also need all the US states plus DC. Gets a bit long.

Harlan Grove · Jun 20, 2004

Norman Harker said:
Maybe not optimal or anticipatory, but it did the job.

....

For now, perhaps.