Developing TEXT scrambler kind of FUNCTIONS in Excel

H

Hari Prasadh

Hi,

To put things in perspective, I analyse Market research data.

Let's say I have some string type data starting from Cell A2 to Last Cell in
column A Also let's say I have some string type data starting from Cell I2
to Last Cell in column I.

The data in a sample cell of column A (let's say cell A2) would be something
like " I use C++ , Visual Basic and Win2K Server at my workplace. At home I
dabble with C++ and Qualcomm". Basically column A would be complete
sentences and out of that sentence I would be interested only in some of the
words. Like if I'm tracking usage of software tools (and if am not
interested in Operating systems) then for me only C++ and VB would be my
point of interest. This is where Column I plays its part.

With full help from NewsGroup (Tim Williams - "Generating count of unique
words in a cell or cells" ) I have been able to get a nice piece of module
which enables me to get a count of unique words ( frequency of a word in
Column A) . After running the module , I scan the results and expunge those
words which are not point of interest in my study. Like based on the above
example - the words "I" , "use" , "Win2K server" and "Qualcomm" etc. would
be removed. I then take the remaining list of unique words and paste them in
column I (starting from row 2 ). Hence, in column I would have a list of
RELEVANT words only.

The part which I explained above, I naively refer to as Text Mining.

After this I developed a macro ( by copying snippets of syntax from variety
of sources and Recording feature). This macro basically compares the CELLS
in Column A to Column I
and display the Matching words in Column B through E. What I mean is cells
in columns B thru G display a list of words which appear in the
corresponding cell of Column A AND also appears within any cell in Column I.

Taking the above example cell B2 would say "C++" and Cell C2 would say "VB"
because Column I would not be having rest of the words which are there in
Cell A2. (cell D2 and E2 would be left blank. Please note if there were no
matches then B thru E will be left blank.)

Presently the problem is the text in column A would be having TYPOS. Like
somebody may say in cell A2 "I use Visula Basic" and another person may say
in cell A3 "I use Visul Basic". Now, I wont be getting any data matches in
Column B because column I would be having "Visual Basic" but not "Visula
Basic" or Visul Basic".

So, I want to develop a TEXT Scrambler function(S) which can :-

a) First function - SCRAMBLE a single letter of the word in Column I . That
is if Column I has "Visual basic" then any 2 adjacent non empty letters are
swapped. That is function should be capable of giving out results like
"Visula Basic" , "Visual Baisc" and similar permutations of adjacent letters
only. I hope that at a time only "one" transformation of adjacent letters
would be sufficient. (first letter might not be permuted as my understanding
is that people dont commit typing errors in their first letter.) I dont want
to swap the "space" between 2 words, that is in a particular transformation
I would just swap any 2 adjacent LETTERS of a particular WORD within the
STRING.

b) Second function - MISS or remove a single letter of the word in column I.
That is if a particular cell in column I has "Visual Basic" then it could
give me permutations like "Viual Basic" , "Visual Baic" etc.

c) Third function - SUBSTITUTE a single letter of the word in column I with
any of the other 25 letters of the English alphabet. That is if column I has
"Visual basic" then it would be able to give me "Vidual Basic" , "Visual
Nasic" and similar permutations.

d) Fourth function - Am being too ambitious but.... Would like to have a
function which can combine the effects of a), b) c) simultaneously though
each of them are individually transformed only once. (Would doing this be
disastrous from computing resources point of view ?)

I want all the above to be FUNCTIONS and not macros . I'm aware about the
difference between 2 only to the extent that in case of a function I can
write a statement like :-
If StringSubsetFromColumnA = ScrambledCellofColumnI(..,...) Then
CellinColumnB = UnscrambledcellofColumnI
End if

I hope I have been able to express my needs correctly. Im posting my present
unscrambled macro in the follow-up post to this as I didnt want to make my
post too big. (Not posting everything in one mail, is that a correct
practice in Newsgroups ?)


Thanks a lot,
Hari
India
 
H

Hari Prasadh

Hi,

This is the code which I have written for automatic mapping of data.

If some sample data (10 rows and 2 columns ) is required, please tell me and
I would be happy to paste that as well in the future post.

Option Explicit
Public basearray() As String
Public arrWords As Variant

Sub readingarrayofuniquewords()

Dim p As Integer
Dim BaseArrLength As Integer

Range("i65536").Select
Selection.End(xlUp).Select
p = Selection.Row - 2

ReDim basearray(p)
For BaseArrLength = 0 To p
basearray(BaseArrLength) = Cells(BaseArrLength + 2, "i")
Next BaseArrLength

End Sub

Sub Upcoding()

Dim R As String
Dim z As Integer
Dim g As Integer
Dim hu As Integer
Dim hg As Integer
Dim msgboxresult As String
Dim tempwithspace As String
Dim tempwithoutspace As String
Dim flag As Integer

Application.ScreenUpdating = False

msgboxresult = MsgBox("Columns B through F will be cleared" & vbLf & " Press
no if you want to exit out of the macro", vbYesNo, " Warning")
If msgboxresult = vbNo Then Exit Sub

Range("B2:G65536").Select
Selection.ClearContents


Range("A65536").Select
Selection.End(xlUp).Select
R = ActiveCell.Row

Call readingarrayofuniquewords

For z = 2 To R

flag = 0
Splitwords ActiveSheet.Range("A" & z).Value
Range("B" & z).Select
'here put the function for combining elements of array


For hu = UBound(arrWords) To LBound(arrWords) Step -1

For hg = hu To UBound(arrWords) Step 1

If ActiveCell.Column > 7 Then
flag = 1
Exit For
End If

tempwithspace = MergingElementsOfArrayWithSpace(arrWords,
hu, hg)
tempwithoutspace =
MergingElementsOfArrayWithoutSpace(arrWords, hu, hg)


For g = LBound(basearray) To UBound(basearray)


If UCase(tempwithspace) = UCase(basearray(g)) Then


ActiveCell.Value = basearray(g)
' u have to put the logic for more than 4 then exit
loop below
ActiveCell.Offset(0, 1).Range("A1").Select
Exit For

ElseIf UCase(tempwithoutspace) = UCase(basearray(g))
Then

ActiveCell.Value = basearray(g)
' u have to put the logic for more than 4 then exit
loop below
ActiveCell.Offset(0, 1).Range("A1").Select
Exit For

End If

Next g

Next hg

If flag = 1 Then
Exit For
End If

Next hu

Next z

Application.ScreenUpdating = True

End Sub

Sub Splitwords(sText As String)

Dim x As Integer
Dim arrReplace As Variant

arrReplace = Array(vbTab, ":", ";", ".", ",", "-", Chr(10), Chr(13))
For x = LBound(arrReplace) To UBound(arrReplace)
sText = Replace(sText, arrReplace(x), " ")
Next x

arrWords = Split(Application.WorksheetFunction.Trim(sText), " ")

End Sub

Function MergingElementsOfArrayWithoutSpace(concatarray As Variant, hi As
Integer, ti As Integer) As Variant
Dim tmp As String
Dim f As Integer

tmp = ""

'see whether the range ti - hi to ti is correct or if it is _
to be increased by 1.

For f = ti - hi To ti
tmp = tmp & concatarray(f)
Next f

MergingElementsOfArrayWithoutSpace = Application.WorksheetFunction.Trim(tmp)

End Function

Function MergingElementsOfArrayWithSpace(concatarray As Variant, hi As
Integer, ti As Integer) As Variant
Dim tmp As String
Dim f As Integer

tmp = ""

'see whether the range ti - hi to ti is correct or if it is _
to be increased by 1.

For f = ti - hi To ti
tmp = tmp & concatarray(f) & " "
Next f

MergingElementsOfArrayWithSpace = Application.WorksheetFunction.Trim(tmp)

End Function
 
T

Tom Ogilvy

It seems to me, if you build your list of interest from the text of the
entries, you will already have a list of misspellings. It would be far
easier to construct a cross walk table from that than what you are asking (I
would think).
 
H

Hari Prasadh

Hi Tom,

2 aspects to it

a) While generating a unique list, I get the frequency count of each word's
appearance also. Please note a single person / response in column A could
mention more than 1 tool. Now suppose in column A I have responses from 3000
people and if I see the frequency then for the software tools question I
might get "Visual basic" having let's say 100 appearances, which is
reasonable number for it to be added to the unique list. Now, since there
will be typos so I might get a count of "Visul basic" being 2 and count of
"Viswl Basic" being 1 and similarly .... lots and lots of such FALSE
instances of "Visual Basic" which have very low counts. Now this happens for
Each software tool. I cannot use these false instances as part of unique
list as it wouldnt serve my purpose. As, once am through with the mapping I
would assign a numeric code to "Visual Basic" and load the data in SPSS
(Stats software) and run some statistics on it. If I upload the false
instances of visual basic also then they would play havoc with my stats.

b) The point made in a) gets compounded because every time its a new market
research study. Like if am tracking software tools today, tomorrow I might
be on to tracking responses to a question like " How would you describe the
Denim area at this Superstore". So every time it will be preparation of a
new and unique list and accordingly the column A also changes. So, I cannot
invest my time in manipulating column I by having a "messy kind of list".
That might offset the whole point of automation.
 
P

Peter T

Hi Hari,

You may well eventually develop some amazing routine that appears to do what
you want. But I find it difficult to imagine how it will ever be foolproof,
leading to a false sense of confidence and false results.

Have you tried working with the Spell checker. Even manually I don't suppose
it would take too long to run through 1,000 rows, particularly once it has
been "trained" to your topic (can be automated to some extent). There is
also AutoCorrect, I notice it does nothing with pasted cells until you
F2/enter. Not sure if it can be automated - haven't tried.

Not the answer you are looking for - just a thought.

Regards,
Peter T
 
T

Tim Williams

Hari,

I'd second the opinions of the other posters: it seems a better approch to
maintain a list of common mis-spellings and use replace() on your original
data.

Tim
 
R

Robin Hammond

Hari,

Here's a thought on something you might have a look at, although I have only
seen it described in SQL's books online rather than trying to use it myself.
Not sure if it will work for you but it might be worth investigating.

There are two functions: SoundEx and Difference, defined in T-SQL that might
help. This is from BOL:
SOUNDEX converts an alpha string to a four-character code to find
similar-sounding words or names. The first character of the code is the
first character of character_expression and the second through fourth
characters of the code are numbers. Vowels in character_expression are
ignored unless they are the first letter of the string. String functions can
be nested.

e.g. in SQL
select soundex('visual basic'), soundex('viswl basic')
returns an identical result

select difference('visual basic','viswl basic')
returns 4, which apparently means they are as similar as possible.

There's a function on John Walkenbach's site written by Richard Yanco that
purports to convert text to a Soundex value that you might be able to test
for use in Excel.

And, strangely enough, if you do a google groups search on Soundex for the
*Excel* groups, there appear to be a few discussions of less complex, but
similar challenges to the one you are trying to crack.

HTH, and let us know if it works,

Robin Hammond
www.enhanceddatasystems.com
 
H

Hari Prasadh

Hi Robin,

Thnx a lot for that wonderful idea. I dont think I would have been able to
found on my own by doing Google searches as I was thinking only in terms of
scrambling "codes" and not in terms of
"phonetics" as given in the NIST site. (Tom - thnx for that link, i could
paste it in to my VB editor and it works nicely).

Im setting up some test data and would try this thing over the weekend and
post back on the results.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top