Extracting specific data from inconsistant multiple word documents

M

Matt Bennette

Hi Guys,

I have about a 1000 word documents and need to extract all the email
addresses from them. Ideally the result would place the filename in column A
and the email address in column B. None of the documents are consistant in
format. I have got as far as listing the file names.

Many thanks.
 
R

Ron Rosenfeld

Hi Guys,

I have about a 1000 word documents and need to extract all the email
addresses from them. Ideally the result would place the filename in column A
and the email address in column B. None of the documents are consistant in
format. I have got as far as listing the file names.

Many thanks.

There are ways of recognizing email addresses within text strings. For
example, depending on your data, you could look for the "@" character. Or
perhaps you could look for a string in which the "@" is followed by one or more
characters; then a ".", then a few more characters.

--ron
 
M

Matt Bennette

Hi Ron

I am pretty new to VBA, but was hoping that I could engineer a script that
would produce a spreadsheet with the file name in column A and the Email
address in Column B

This is what I am currently using to get the filenames

Function GetFileList(FileSpec As String) As Variant
' Returns an array of filenames that match FileSpec
' If no matching files are found, it returns False

Dim FileArray() As Variant
Dim FileCount As Integer
Dim FileName As String

On Error GoTo NoFilesFound

FileCount = 0
FileName = Dir(FileSpec)
If FileName = "" Then GoTo NoFilesFound

' Loop until no more matching files are found
Do While FileName <> ""
FileCount = FileCount + 1
ReDim Preserve FileArray(1 To FileCount)
FileArray(FileCount) = FileName
FileName = Dir()
Loop
GetFileList = FileArray
Exit Function

' Error handler
NoFilesFound:
GetFileList = False
End Function


Sub test()
Dim p As String, x As Variant

p = "C:\Documents and Settings\Administrator\Desktop\Chefs\*.*"
x = GetFileList(p)

Select Case IsArray(x)
Case True 'files found
MsgBox UBound(x)
Sheets("Sheet1").Range("A:A").Clear
For i = LBound(x) To UBound(x)
Sheets("Sheet1").Cells(i, 1).Value = x(i)
Next i
Case False 'no files found
MsgBox "No matching files"
End Select
End Sub

Which I confess was donated by one of the fab subscribers here

The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this

Sorry I'm a complete novice and would really appreciate some help on this

Many thanks
 
R

Ron Rosenfeld

The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this

I don't know about "VBA referencing Word Commands"

The following uses Regular Expressions:

Is there only one email address per file? Or could there be multiple email
addresses in a file.

Here is a routine which, when applied against a text string of unspecified
length, will return the first string that looks like an email address. The
pattern does NOT contain all the rules for validating an email address, but
perhaps this method will get you started. Post back with more questions as
needed.

=====================================
Option Explicit
Public Const str As String = "now is the time for (e-mail address removed)"
'--------------------------------------

Sub ExtrEmail()
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"
If re.test(str) = True Then
Set mc = re.Execute(str)
Debug.Print mc(0).Value
End If
End Sub
==================================
--ron
 
M

Matt Bennette

In general, there is only one email address, however I dont understand how I
can apply this code to multiple word documents.

Ideally I need the result to list the file name and the extracted email
address.

Ron Rosenfeld said:
The strings would be of differing lengths, so how would wild card work is
there a specific syntax i.e *?*@*?*.*?*. And what about VBA referencing Word
Commands to perform this

I don't know about "VBA referencing Word Commands"

The following uses Regular Expressions:

Is there only one email address per file? Or could there be multiple email
addresses in a file.

Here is a routine which, when applied against a text string of unspecified
length, will return the first string that looks like an email address. The
pattern does NOT contain all the rules for validating an email address, but
perhaps this method will get you started. Post back with more questions as
needed.

=====================================
Option Explicit
Public Const str As String = "now is the time for (e-mail address removed)"
'--------------------------------------

Sub ExtrEmail()
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"
If re.test(str) = True Then
Set mc = re.Execute(str)
Debug.Print mc(0).Value
End If
End Sub
==================================
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top