data extraction / cleansing a string column in a workbook

K

Keith

I have a data extraction / cleansing problem.

A computer help desk system has ticket information with textual detail
placed in a field called 'Comments'. This data has been extracted to
Excel spreadsheets for the last few months. I cannot know exactly
where the 'Comments' column may be in each spreadsheet, however I know
that the ending column location for each row is 'BA'.

We need to extract phone numbers and extensions out of this data.
Phone numbers can be 10 digit telephone numbers such as: 123-123-1234
or 1231231234.

The 10 digit telephone numbers can have 4 or 5 digit extensions (or
PBX stations) such as 12345 or 1234.

The telephone combination at times can be put together such as
1231231234/12345.

Phone numbers can also be 7 digit (local) numbers such as 1231234 or
123-1234.

This information is intermixed with text and I need to extract the
phone numbers out of the Comments.
Comments are like this:

"Joe Blow 1231231234/12345 needs to have his extension moved to new
desk. Call site manager Jack Black 1231232345/09876 for access."
* I need to pull out "1231231234", "12345", "1231232345" and "09876"
from the above and put the extracted data at the end of the
spreadsheet row. Given the end of the row is at BA, and extracted
numbers from the above, put "1231231234" in that row's BB column,
"12345" in BC column, "1231232345" in BD column and "09876" in BE
column.

"Add VM 1234, updated dictionary, added 2345 & 2346. Completed by Jay
Smith."

* I need to pull out "1234", "2345" and "2346" and put "1234" in BB
column, "2345" in BC column and "2346" in BD column.

"AS000AD00SK000DD000R0E0R0WQ0E0D0D0 Jane Smith 4441231237. Setup vmail
for ext 1237. Added to call pickup group as 4948."
* I need to pull out "4441231237", "1237" and "2346" and put
"4441231237" in BB column, "1237" in BC column and "2346" in BD
column.

"4441234567 thru 7890 have been added to VNET Thanks!"
* I need to pull out "4441234567" and "7890" and put "4441234567" in
BB column and "7890" in BC column.

I have found a VBA utility that can pull numbers out of a string from
http://www.google.com/groups?hl=en&lr=&ie=UTF-8&[email protected]
(or http://www.mvps.org/dmcritchie/excel/grove_digitsid.htm), but that
only returns one number.

I was thinking about adding a parent function which would get the
column, tokenize the contents of 'Comments' on space, check the length
of the tokenized sting content length > 3 and if it is pass the
tokenized sting content to the above routine. If it returns a good
number do some comparisons to be sure it passes the above business
rules and put the results in the end of that row.

Being a good reuse programmer, I was wondering if anyone has something
that does something kind-of similar to what I need before I start on
my coding journey.

Please reply to the group, so that other people can search on this and
(hopefully) find a good answer.

Thanks,
Keith
 
K

kkknie

Keith,

I was a bit bored with my lunch so I worked this routine to do the dat
extraction.


Code
-------------------
Sub test()

Dim r As Range
Dim s() As String
Dim i As Integer
Dim iCol As Integer

Set r = Range("A3")

Call ExtractIt(r, s())

iCol = 54
If s(0) <> "NONE" Then
For i = 0 To UBound(s)
'Ignore numbers less than 4 characters
If Len(s(i)) > 3 Then
Cells(iCol, r.Row).Value = s(i)
iCol = iCol + 1
End If
Next
End If

End Sub

Sub ExtractIt(rInput As Range, sOutput() As String)

Dim iMin As Long
Dim iMax As Long
Dim iFound As Long
Dim i As Long
Dim strChar As String
Dim strLine As String
Dim r As Range
Dim iLoop As Integer

Set r = rInput
strLine = r.Value
ReDim Preserve sOutput(0)
sOutput(0) = "NONE"

iLoop = 0

Do Until strLine = ""

iMin = 99999

For i = 0 To 9
iFound = InStr(1, strLine, i)
If iFound > 0 And iFound < iMin Then iMin = iFound
Next

If iMin = 99999 Then
Exit Do
End If

For i = iMin To Len(strLine)
strChar = Mid(strLine, i, 1)
If Not IsNumeric(strChar) And strChar <> "-" Then
iMax = i
Exit For
End If
Next

ReDim Preserve sOutput(iLoop)
sOutput(iLoop) = Mid(strLine, iMin, iMax - iMin)

strLine = Right(strLine, Len(strLine) - iMax + 1)

iLoop = iLoop + 1
Loop

End Su
-------------------

The code considers a hyphen as a part of a number, but nothing else.
It could easily be modified to recognize other characters as well.

The initial routine (test) calls the function and handles not usin
returned strings less than 4. As for finding the comments column
cannot answer with the amount of info posted. I tested it on some o
your examples with good results.

You will have to modify it to use the ranges you want. It should a
least be a start.
 
K

Keith

kkknie said:
Keith,

I was a bit bored with my lunch so I worked this routine to do the data
extraction.

<Snip>

kkknie,
Thanks for your bored time! I will try this out tomorrow, got
caught up in one of those fires now (don't you love those things!).

I really appreciate your effort!

Thanks,
Keith
 
K

Keith

Just in case someone needs it (including myself), here is the answer
that I came up with. It may be able to be done in a more efficent
way, but the important part is that it works :)

- K

Sub do_all_sheets()
Dim i As Integer
Dim numSheets As Integer

numSheets = Application.Sheets.Count

For i = 1 To numSheets
On Error Resume Next
If Worksheets(i).Visible = True Then
Worksheets(i).Select
On Error GoTo 0
do_sheet
End If
Next
End Sub
Sub do_sheet()

Dim r As Range
Dim r2 As Range
Dim i As Integer
Dim intLen As Integer
Dim iCol As Integer
Dim intCurCol As Integer
Dim lastRow As Long
Dim numRangeRows As Long
Dim loopRangeCounter As Long
Dim intCurRow As Integer
Dim s() As String
Dim lastCol As Long
Dim strCellValue As String
Dim strMaxCellValue As String
Dim strCommentRangeName As String

'
' see if we have a comments column
'
Rows("1:1").Select
Range("A1").Activate

On Error Resume Next
Selection.Find(What:="comments", After:=ActiveCell,
LookIn:=xlFormulas, _
LookAt:=xlPart, SearchOrder:=xlByRows,
SearchDirection:=xlNext, _
MatchCase:=False, SearchFormat:=False).Activate
On Error GoTo 0

'
' if we find the comments, good. If not, exit.
'
If UCase(Application.ActiveCell.Value) <> "COMMENTS" Then
Exit Sub
End If
'
' looks like we have a comments column, get the column name
'
'strCommentsColName = "AI"
strCommentsColName =
getCellColumnFromAddress(Application.ActiveCell.Address)

'
' go the last column / row on the sheet
'
Cells.Find(What:="*", After:=[A1],
SearchDirection:=xlPrevious).Select
strMaxCellValue = Application.ActiveCell.Address(False, False)
lastRow = getCellRowNumberFromAddress(strMaxCellValue)
'lastCol = getCellColumnFromAddress(strMaxCellValue)
'lastRow = getCellRowNumberFromAddress( _
' Application.ActiveSheet.Cells.SpecialCells(xlCellTypeLastCell).Row)
lastCol = getLastColumnFromSheet(lastRow)
'
' set range name of "Comments" row: 2 to the End (whatever that
is)
'
strCommentRangeName = strCommentsColName & "2:" & _
strCommentsColName & CStr(lastRow)
'Debug.Print "Comments range: " & strCommentRangeName

Set r = Range(strCommentRangeName)
'
' check to see if we actually have any data in the "comment"
column,
' if not we exit.
'
If IsEmpty(r.Cells.Value2) Then
Exit Sub
End If

numRangeRows = r.Count

'
' check the column, we normally put things in last col,, if there
' is anything in the worksheet in a further column, we don't want
to
' overwrite the contents
'
iCol = 55
If lastCol + 1 > iCol Then
iCol = lastCol + 1
End If

For loopRangeCounter = 1 To numRangeRows

ReDim s(0)
'
'
strCellValue = r.FormulaR1C1(loopRangeCounter, 1)
Call ExtractIt(strCellValue, s())
intCurRow = r.Row - 1 + loopRangeCounter
intCurCol = iCol

If s(0) <> "NONE" Then
For i = 0 To UBound(s)
Cells(intCurRow, intCurCol).NumberFormat = "@"
Cells(intCurRow, intCurCol).Font.Name = "Arial"
Cells(intCurRow, intCurCol).Font.Size = 9
Cells(intCurRow, intCurCol).Value = s(i)
intCurCol = intCurCol + 1
Next
End If
Next

End Sub

Sub ExtractIt(strCellValue As String, sOutput() As String)

Dim iMin As Long
Dim iMax As Long
Dim iFound As Long
Dim i As Long
Dim lngStrLen As Long
Dim strChar As String
Dim incomingLine As String
Dim tempStr As String
Dim tempStrLen As String
Dim r As Range
Dim iLoop As Integer
Dim b As Boolean
Dim bBadPrefix As Boolean
Dim bAtEndOfString As Boolean
Dim bLengthTooShort As Boolean
Dim bIsADate As Boolean
Dim bDashCheck As Boolean
Dim bSlashCheck As Boolean
Dim bAddCell As Boolean

ReDim Preserve sOutput(0)
incomingLine = strCellValue
sOutput(0) = "NONE"

iLoop = 0

' loop thru the entire line

Do Until incomingLine = ""

iMin = 99999

' looking for a number bit

For i = 0 To 9
iFound = InStr(1, incomingLine, i)
If iFound > 0 And iFound < iMin Then
iMin = iFound
End If
Next

If iMin = 99999 Then
Exit Do
End If

lngStrLen = Len(incomingLine)
bAtEndOfString = False
iMax = -1

' check number portion of the string
For i = iMin To Len(incomingLine)
strChar = Mid(incomingLine, i, 1)
Select Case strChar
Case 1 To 9, 0, "-", "/"
b = False
Case Else
b = True
iMax = i
Exit For
End Select

'
' check to see if we have a good number at the end of the
line
'
If lngStrLen = i And b = False Then
bAtEndOfString = True
Exit For
End If
Next

If iMax > -1 Or bAtEndOfString = True Then
ReDim Preserve sOutput(iLoop)

If bAtEndOfString = True Then
tempStr = Mid(incomingLine, iMin)
incomingLine = ""
iMax = Len(tempStr)
Else
tempStr = Mid(incomingLine, iMin, iMax - iMin)
incomingLine = Right(incomingLine, Len(incomingLine) -
iMax + 1)
End If

tempStrLen = Len(tempStr)

bAddCell = True
bLengthTooShort = False
bIsADate = False
bDashCheck = False
bHasSlashes = False

'
' check to see if there is more than one slash
'
bHasSlashes = checkMoreThanOneSlash(tempStr)

If bHasSlashes = False Then
'
' length < minimum
'
If tempStrLen < 4 Then
bLengthTooShort = True
Else
'
' check to see if there is a date x/xx or xx/xx
'
If tempStrLen < 6 Then
bIsADate = checkOneSlashDate(tempStr)
End If

If bIsADate = False Then
'
' check to see if there is a dash on a short
string
' "-xxx" or "-xxxx"
'
If tempStrLen < 6 Then
bDashCheck = checkOneDashCheck(tempStr)
End If
End If
End If
End If

If bHasSlashes = True _
Or bBadPrefix = True _
Or bIsADate = True _
Or bDashCheck = True _
Or bLengthTooShort = True Then
bAddCell = False
End If

'
' if all is ok, write it out
'
If bAddCell = True Then
sOutput(iLoop) = tempStr
iLoop = iLoop + 1
End If
Else
b = False
End If
Loop
End Sub

Function checkMoreThanOneSlash(incomingLine As String) As Boolean
'
' check to see if this string is a date-like field 11/11/11 or
11/11/1111
'
Dim varPosition As Variant
Dim firstPos As Integer

intPosition = -1
checkMoreThanOneSlash = False
varPosition = InStr(1, incomingLine, "/", vbTextCompare)

If IsNull(varPosition) = True Then
Else
If varPosition > 0 Then
firstPos = varPosition + 1
varPosition = InStr(firstPos, incomingLine, "/",
vbTextCompare)

If IsNull(varPosition) = True Then
Else
If varPosition > 0 Then
checkMoreThanOneSlash = True
End If
End If
End If
End If

End Function
Function getLastColumnFromSheet(lastRowNumber As Long) As Long
'
'
'
Dim strChar As String
Dim rowNumber As String
Dim maxColumn As Long

rowNumber = ""
maxColumn = -1
Application.ScreenUpdating = False

For i = lastRowNumber + 1 To 2 Step -1
Application.Cells(i + 1, 1).Select
Application.ActiveSheet.Cells.Find(What:="*",
After:=ActiveCell, SearchDirection:=xlPrevious).Select
If Application.ActiveCell.Column > maxColumn Then
maxColumn = Application.ActiveCell.Column
End If
Next
Application.ScreenUpdating = True

getLastColumnFromSheet = maxColumn
End Function
Function getCellRowNumberFromAddress(cellName As String) As Long

Dim strChar As String
Dim rowNumber As String

rowNumber = ""
For i = 1 To Len(cellName)
strChar = Mid(cellName, i, 1)
Select Case strChar
Case 1 To 9, 0
rowNumber = rowNumber & strChar
Case Else
End Select
Next

getCellRowNumberFromAddress = CLng(rowNumber)
End Function
Function getCellColumnFromAddress(cellAddress As String) As String
Dim strChar As String
Dim columnName As String

rowNumber = ""
For i = 1 To Len(cellAddress)
strChar = Mid(cellAddress, i, 1)
Select Case strChar
Case "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", _
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", _
"U", "V", "W", "X", "Y", "Z", _
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", _
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", _
"u", "v", "w", "x", "y", "z"
columnName = columnName & strChar
Case Else
End Select
Next

getCellColumnFromAddress = columnName

End Function

Function checkOneSlashDate(incomingLine As String) As Boolean
'
' check to see if this string is a date-like field 11/11 or 1/11
'
Dim varPosition As Variant

checkOneSlashDate = False
varPosition = InStr(1, incomingLine, "/", vbTextCompare)

If IsNull(varPosition) = True Then
Else
If varPosition > 0 Then
checkOneSlashDate = True
End If
End If

End Function
Function checkOneDashCheck(incomingLine As String) As Boolean
'
' check to see if this string is a short dash like "xxx-" or "-xxx"
'
Dim varPosition As Variant

checkOneDashCheck = False
varPosition = InStr(1, incomingLine, "-", vbTextCompare)

If IsNull(varPosition) = True Then
Else
If varPosition > 0 Then
checkOneDashCheck = True
End If
End If

End Function
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top