Daniel.M said:
I disagree about the 'always' part: It depends on the definition
of what you want to retrieve (see my next remark about your
'stockon' example).
Initially, I saw: (e-mail address removed)
Except that there's a mailto link on the web page above that includes 'ci.'.
Other cities in California show the same thing - the 'ci.' just before the
city name included in e-mail addresses. However, I'd interpret this as the
'ci.' must be part of the domain name, and everything to the left of 'ci.'
must be discarded, so $1 must be greedy.
This is an good example of user rules I expected the OP to provide
us with.
This has nothing to do with user rules. The only way the OP has any control
to exercise over the urls provided in the original message is if the OP runs
the firm hosting those organizations' web sites. If the OP is a host admin,
s/he already should have tools that can do what s/he wants far more easily
than s/he can do so in Excel.
If the OP doesn't host these urls, s/he has no control over them, and is
looking for rules to extract domain names from host names. That seems to be
a bit of work.
My rule was : As long as the ending parts of the host are HDLQs
(present in the 'OR' list), keep them, and then, keep the previous
entry to those.
And that seems to make sense except for the city of Stockton, CA url above.
If you want to include "ci." as well when it's just before the
first entry kept,
maybe this pattern:
^.+://([^.:/]+\.)*?((ci.)?[^.:/]+(\.(us|co|il|qc|ca|com|ac|edu|net|org|
gov|mil|uk|au|mx|info|biz|fed))+)(|[:/?].*)$
....
Except it may be more complicated than this. For all I know, there may be
some host names out there that have .ci. before the domain name that don't
end in \.[a-z]{2}\.(us|ca). In those cases, it's unlikely the 'ci.' should
be included in the domain name.
I provided new HDLQs (au) not included in your initial list.
http://www.langley.edu.net/ would have been a better example.
....
OK, this argues either for locating the HLDQ rightmost in the list and
including any immediately preceding tokens that match other HLDQs, or doing
so only when '.net' is rightmost.
BTW, the ".qc.ca" I was referring to is much closer to
home (mine) than California.
....
Figured that. The point I was trying to make is that the 'qc' part isn't in
and of itself an HLDQ. It's the ending '.ca' which presumably could be
preceded by '.bc', '.yk', '.ab', etc. Despite the singular nature of Québec
(and I'm impressed that its web page has both English and Spanish
alternatives, thus handling the two most common, er, foreign languages), if
you put qc in the or-list, you need the other provinces and territories, and
you'd also need all the US states plus DC. Gets a bit long.