Pattern match and grouped items

  • Thread starter Thread starter SauQ
  • Start date Start date
S

SauQ

Dear all

I have in :

--> Column A, a list of items say,


1 YORK A/COND
2 SANYO 1.0HP air-cond
3 CANON laser printer (model : L-P'TER*LBP-3050)
4 EPSON - Batching system printer
5 EPSON - Dot matrix printer
6 Nokia 1200 - handphone
7 HP printer
8 Panasonic K Printer
9 Panasonic X printer
10 Panasonic P-P1121 printer
11 Nokia N6070 handphone
12 Nokia 1200 handphone
13 Microphone
14 Panasonic printer
15 2nd hand hand phone

I am looking for a function / vba / anyway if excel can do a pattern
match
and :

-> highlight (with color) items 1,2
and return in column B the word "cond" (i.e. same word contains in
both the the items)

-> highlight (with different color) items 3,4,5,7,8,9,10,14
and return in column C the word "printer" (i.e. same word contains in
all these items)

-> highlight (with different color) items 6,11,12,13,15
and return in column D the word "phone" (i.e. same word contains in
all these items)

Any suggestion or help on this is greatly appreciated.

Thanks heaps ! Everyone.
SauQ
 
Dear all

I have in :

--> Column A, a list of items say,


1 YORK A/COND
2 SANYO 1.0HP air-cond
3 CANON laser printer (model : L-P'TER*LBP-3050)
4 EPSON - Batching system printer
5 EPSON - Dot matrix printer
6 Nokia 1200 - handphone
7 HP printer
8 Panasonic K Printer
9 Panasonic X printer
10 Panasonic P-P1121 printer
11 Nokia N6070 handphone
12 Nokia 1200 handphone
13 Microphone
14 Panasonic printer
15 2nd hand hand phone

I am looking for a function / vba / anyway if excel can do a pattern
match
and :

-> highlight (with color) items 1,2
and return in column B the word "cond" (i.e. same word contains in
both the the items)

-> highlight (with different color) items 3,4,5,7,8,9,10,14
and return in column C the word "printer" (i.e. same word contains in
all these items)

-> highlight (with different color) items 6,11,12,13,15
and return in column D the word "phone" (i.e. same word contains in
all these items)

Any suggestion or help on this is greatly appreciated.

Thanks heaps ! Everyone.
SauQ

How do you define which of the words or portions of a word you want to match?

Examples: EPSON is found in lines 4,5
printer is found in multiple lines and the range containing
"printer" overlaps the range containing "EPSON".

How to color ranges with overlapping duplicates?

Also, you write that you want to match the word "phone", yet that is
only a portion of the word "handphone" in a number of the lines you are
matching.

--ron
 
Dear Ron, you are absolutely correct, silly of me not to notice there
are other duplicate words (Nokia, Panasonic, Epson) as well.

In this case, i need excel to highlight rows with certain % of words
matched. Maybe an option to choose / select the % to be highlighted
(e.g: above 50% match found is to be highlighted).

Many thanks again,
SauQ
 
Dear Ron, you are absolutely correct, silly of me not to notice there
are other duplicate words (Nokia, Panasonic, Epson) as well.

In this case, i need excel to highlight rows with certain % of words
matched. Maybe an option to choose / select the % to be highlighted
(e.g: above 50% match found is to be highlighted).

Many thanks again,
SauQ

I think you need a better description of what might constitute a valid match;
and also rules for what should happen when the same line matches other lines,
but with different words (e.g. Epson and printer have overlapping matchings).

Perhaps it would be best to back up and look at the purpose of this process.
For example, if you are trying to find how many entries there are that refer
to, let us say, "phones", it would probably be a lot simpler to develop a list
of keywords that you would search for.

For example, if you decided that your list of key words was

phone
printer
cond

you could enter those words in a series of cells, and then just check the data
against those cells to see what's where. You'd still have the problem of
multiple matchings, but you could probably flag the entry somehow.
--ron
 
Hi Ron

The purpose of this process is to come up with a total cost for all
similar items (e.g the total cost of printers, regardless of brand -
Epson/Panasonic).

The problem is to come up with a list of keywords of items, as there
are a few thousand rows of different items with different brands (and
many are overlapping matches as you have pointed out). I do not know
entirely what are the items contain in the list.

I was hoping that if excel could highlight / identify these items
(say, keyword=printer) which are similar then I could use the
SUMPRODUCT function to total up all cost based on the keywords.

But then again, excel obviously would not know how to distinguish
between an item (handphone, printer, cond, etc) from brands (Epson,
panasonic, etc).

I reckon the best way is to get excel to list all duplicate words
(regardless if its item / brand), then I will manually pick out the
items from the list.

I wonder if there is a better way around this.
Many thanks again for your effort and time in looking at this.
SauQ
 
Hi Ron

The purpose of this process is to come up with a total cost for all
similar items (e.g the total cost of printers, regardless of brand -
Epson/Panasonic).

The problem is to come up with a list of keywords of items, as there
are a few thousand rows of different items with different brands (and
many are overlapping matches as you have pointed out). I do not know
entirely what are the items contain in the list.

I was hoping that if excel could highlight / identify these items
(say, keyword=printer) which are similar then I could use the
SUMPRODUCT function to total up all cost based on the keywords.

But then again, excel obviously would not know how to distinguish
between an item (handphone, printer, cond, etc) from brands (Epson,
panasonic, etc).

I reckon the best way is to get excel to list all duplicate words
(regardless if its item / brand), then I will manually pick out the
items from the list.

I wonder if there is a better way around this.
Many thanks again for your effort and time in looking at this.
SauQ


I think you will have to develop your keyword list manually. There are
routines that can generate a list of individual words, sorted either
alphabetically or by word count. Then you can manually decide which you want
to use as keywords (perhaps with wild cards) in a SUMPRODUCT or SUMIF formula.

Given your initial example:

YORK A/COND
SANYO 1.0HP air-cond
CANON laser printer (model : L-P'TER*LBP-3050)
EPSON - Batching system printer
EPSON - Dot matrix printer
Nokia 1200 - handphone
HP printer
Panasonic K Printer
Panasonic X printer
Panasonic P-P1121 printer
Nokia N6070 handphone
Nokia 1200 handphone
Microphone
Panasonic printer
2nd hand hand phone


And defining a "word" as having only letters, digits or hyphens; and also being
at least two characters in length, here is a list of individual words sorted by
frequency (descending):

printer 8
Panasonic 4
Nokia 3
handphone 3
COND 2
EPSON 2
1200 2
hand 2
YORK 1
SANYO 1
0HP 1
air-cond 1
CANON 1
laser 1
model 1
L-P 1
Batching 1
system 1
Dot 1
matrix 1
HP 1
P-P1121 1
N6070 1
Microphone 1
2nd 1
phone 1

And here is a list sorted alphabetically (ascending):

0HP 1
1200 2
2nd 1
air-cond 1
Batching 1
CANON 1
COND 2
Dot 1
EPSON 2
hand 2
handphone 3
HP 1
laser 1
L-P 1
matrix 1
Microphone 1
model 1
N6070 1
Nokia 3
Panasonic 4
phone 1
P-P1121 1
printer 8
SANYO 1
system 1
YORK 1

There are instructions in the macro posted below for how to change from numeric
to alpha sorting; and also how to change from ascending to descending sorts.

Also, in the macro, as of now, the data source (rSrc) is set to Selection; and
the output destination is set to start at C1 (rDest). These can be changed or
set up in different ways.

Maybe this will help.

To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub):
Select your data range (unless you have modified the Sub to do it
automatically).
Then <alt-F8> opens the macro dialog box. Select the macro by name, and
<RUN>.

===================================
Option Explicit
Option Compare Text
Sub UniqueWordList()
Dim rSrc As Range, rDest As Range, c As Range
Dim cWordList As Collection
Dim res() As Variant
Dim w() As String
Dim i As Long

Set cWordList = New Collection
Set rSrc = Selection
Set rDest = Range("C1")
rDest.EntireColumn.NumberFormat = "@"
For Each c In rSrc
w = Split(c.Value)
For i = 0 To UBound(w)
w(i) = StripWord(w(i))
If Not w(i) = "" Then
On Error Resume Next
cWordList.Add Item:=w(i), Key:=w(i)
On Error GoTo 0
End If
Next i
Next c

'transfer words to results array
ReDim res(1 To cWordList.Count, 0 To 1)
For i = 1 To cWordList.Count
res(i, 0) = cWordList(i)
Next i

'get counts
For i = LBound(res) To UBound(res)
For Each c In rSrc
res(i, 1) = res(i, 1) + CountWord(c.Value, res(i, 0))
Next c
Next i

'sort alpha: d=0; sort numeric d=1
'there are various ways of sorting
BubbleSort res, 0

rDest.CurrentRegion.Clear
For i = LBound(res) To UBound(res)
rDest.Offset(i, 0).NumberFormat = "@"
rDest.Offset(i, 0).Value = res(i, 0)
'For just lowercase output, use:
'rDest.Offset(i, 0).Value = LCase(res(i, 0))
rDest.Offset(i, 1).Value = res(i, 1)
Next i
End Sub

Private Function StripWord(s As String) As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[-\w]{2,}"
If re.test(s) = True Then
Set mc = re.Execute(s)
StripWord = mc(0).Value
End If
Set re = Nothing
End Function

Private Function CountWord(ByVal s As String, sPat) As Long
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = "\b" & sPat & "\b"

Set mc = re.Execute(s)
CountWord = mc.Count
End Function

Private Sub BubbleSort(TempArray As Variant, d As Long) 'd is 0 based dimension
Dim temp(0, 1) As Variant
Dim i As Integer
Dim NoExchanges As Integer

' Loop until no more "exchanges" are made.
Do
NoExchanges = True

' Loop through each element in the array.
For i = LBound(TempArray) To UBound(TempArray) - 1

' If the element is less than the element
' following it, exchange the two elements.
' change "<" to ">" to sort ascending
If TempArray(i, d) > TempArray(i + 1, d) Then
NoExchanges = False
temp(0, 0) = TempArray(i, 0)
temp(0, 1) = TempArray(i, 1)
TempArray(i, 0) = TempArray(i + 1, 0)
TempArray(i, 1) = TempArray(i + 1, 1)
TempArray(i + 1, 0) = temp(0, 0)
TempArray(i + 1, 1) = temp(0, 1)

End If
Next i
Loop While Not (NoExchanges)
End Sub
=====================================================
--ron
 
Hi Ron,
I have tried your sub on my other longer list and it works like charm.
The results spanned slightly over a 100 lines, compared to a few
thousand lines that I am stuck with earlier on,
I' ll say your solution is a major improvement in efficiency though I
still have to pick up the ones I want manually.
My zillion thanks to you. Heartfelt appreciation for your effort and
time.
SauQ

Ron said:
Hi Ron

The purpose of this process is to come up with a total cost for all
similar items (e.g the total cost of printers, regardless of brand -
Epson/Panasonic).

The problem is to come up with a list of keywords of items, as there
are a few thousand rows of different items with different brands (and
many are overlapping matches as you have pointed out). I do not know
entirely what are the items contain in the list.

I was hoping that if excel could highlight / identify these items
(say, keyword=printer) which are similar then I could use the
SUMPRODUCT function to total up all cost based on the keywords.

But then again, excel obviously would not know how to distinguish
between an item (handphone, printer, cond, etc) from brands (Epson,
panasonic, etc).

I reckon the best way is to get excel to list all duplicate words
(regardless if its item / brand), then I will manually pick out the
items from the list.

I wonder if there is a better way around this.
Many thanks again for your effort and time in looking at this.
SauQ


I think you will have to develop your keyword list manually. There are
routines that can generate a list of individual words, sorted either
alphabetically or by word count. Then you can manually decide which you want
to use as keywords (perhaps with wild cards) in a SUMPRODUCT or SUMIF formula.

Given your initial example:

YORK A/COND
SANYO 1.0HP air-cond
CANON laser printer (model : L-P'TER*LBP-3050)
EPSON - Batching system printer
EPSON - Dot matrix printer
Nokia 1200 - handphone
HP printer
Panasonic K Printer
Panasonic X printer
Panasonic P-P1121 printer
Nokia N6070 handphone
Nokia 1200 handphone
Microphone
Panasonic printer
2nd hand hand phone


And defining a "word" as having only letters, digits or hyphens; and also being
at least two characters in length, here is a list of individual words sorted by
frequency (descending):

printer 8
Panasonic 4
Nokia 3
handphone 3
COND 2
EPSON 2
1200 2
hand 2
YORK 1
SANYO 1
0HP 1
air-cond 1
CANON 1
laser 1
model 1
L-P 1
Batching 1
system 1
Dot 1
matrix 1
HP 1
P-P1121 1
N6070 1
Microphone 1
2nd 1
phone 1

And here is a list sorted alphabetically (ascending):

0HP 1
1200 2
2nd 1
air-cond 1
Batching 1
CANON 1
COND 2
Dot 1
EPSON 2
hand 2
handphone 3
HP 1
laser 1
L-P 1
matrix 1
Microphone 1
model 1
N6070 1
Nokia 3
Panasonic 4
phone 1
P-P1121 1
printer 8
SANYO 1
system 1
YORK 1

There are instructions in the macro posted below for how to change from numeric
to alpha sorting; and also how to change from ascending to descending sorts.

Also, in the macro, as of now, the data source (rSrc) is set to Selection; and
the output destination is set to start at C1 (rDest). These can be changed or
set up in different ways.

Maybe this will help.

To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub):
Select your data range (unless you have modified the Sub to do it
automatically).
Then <alt-F8> opens the macro dialog box. Select the macro by name, and
<RUN>.

===================================
Option Explicit
Option Compare Text
Sub UniqueWordList()
Dim rSrc As Range, rDest As Range, c As Range
Dim cWordList As Collection
Dim res() As Variant
Dim w() As String
Dim i As Long

Set cWordList = New Collection
Set rSrc = Selection
Set rDest = Range("C1")
rDest.EntireColumn.NumberFormat = "@"
For Each c In rSrc
w = Split(c.Value)
For i = 0 To UBound(w)
w(i) = StripWord(w(i))
If Not w(i) = "" Then
On Error Resume Next
cWordList.Add Item:=w(i), Key:=w(i)
On Error GoTo 0
End If
Next i
Next c

'transfer words to results array
ReDim res(1 To cWordList.Count, 0 To 1)
For i = 1 To cWordList.Count
res(i, 0) = cWordList(i)
Next i

'get counts
For i = LBound(res) To UBound(res)
For Each c In rSrc
res(i, 1) = res(i, 1) + CountWord(c.Value, res(i, 0))
Next c
Next i

'sort alpha: d=0; sort numeric d=1
'there are various ways of sorting
BubbleSort res, 0

rDest.CurrentRegion.Clear
For i = LBound(res) To UBound(res)
rDest.Offset(i, 0).NumberFormat = "@"
rDest.Offset(i, 0).Value = res(i, 0)
'For just lowercase output, use:
'rDest.Offset(i, 0).Value = LCase(res(i, 0))
rDest.Offset(i, 1).Value = res(i, 1)
Next i
End Sub

Private Function StripWord(s As String) As String
Dim re As Object, mc As Object, m As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[-\w]{2,}"
If re.test(s) = True Then
Set mc = re.Execute(s)
StripWord = mc(0).Value
End If
Set re = Nothing
End Function

Private Function CountWord(ByVal s As String, sPat) As Long
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = "\b" & sPat & "\b"

Set mc = re.Execute(s)
CountWord = mc.Count
End Function

Private Sub BubbleSort(TempArray As Variant, d As Long) 'd is 0 based dimension
Dim temp(0, 1) As Variant
Dim i As Integer
Dim NoExchanges As Integer

' Loop until no more "exchanges" are made.
Do
NoExchanges = True

' Loop through each element in the array.
For i = LBound(TempArray) To UBound(TempArray) - 1

' If the element is less than the element
' following it, exchange the two elements.
' change "<" to ">" to sort ascending
If TempArray(i, d) > TempArray(i + 1, d) Then
NoExchanges = False
temp(0, 0) = TempArray(i, 0)
temp(0, 1) = TempArray(i, 1)
TempArray(i, 0) = TempArray(i + 1, 0)
TempArray(i, 1) = TempArray(i + 1, 1)
TempArray(i + 1, 0) = temp(0, 0)
TempArray(i + 1, 1) = temp(0, 1)

End If
Next i
Loop While Not (NoExchanges)
End Sub
=====================================================
--ron
 
Hi Ron,
I have tried your sub on my other longer list and it works like charm.
The results spanned slightly over a 100 lines, compared to a few
thousand lines that I am stuck with earlier on,
I' ll say your solution is a major improvement in efficiency though I
still have to pick up the ones I want manually.
My zillion thanks to you. Heartfelt appreciation for your effort and
time.
SauQ

I'm glad it was useful. Thanks for the feedback. And yes, I don't know of a
good way to avoid manually picking up the keywords on which you want to focus.
But once you do, it should be relatively simple to derive appropriate
functions/formulas to analyze.
--ron
 
Back
Top