Delete matching cells

gary · Jan 15, 2012

From the responses and their results, I think it'd be best to re-state
my OP:

I need a list of the values in Col A that are NOT found in Col B.

My spreadsheet contains:

A B
0000000021957 0000000022002
0000000022002 0000000032002
0000000031957 0000000032003
0000000032002 0000000042002
0000000032003 0000000052002
0000000042002 0000000052003
0000000052002 0000000062002
0000000052003 0000000102002
0000000061967 0000000121996
0000000061968 0000000142002
0000000062002 0000000152002
0000000081963 0000000162002
0000000102002 0000000481994
0000000121996 0000000481995
0000000142002 0000000481996
0000000152002 0000000481997
0000000162002 0000000481998
0000000341991 0000000481999
0000000401961 0000000482000
0000000431978 0000000482002
0000000481994 0000000482008
0000000481995 0000000482009
0000000481996 0000000631995
0000000481997 0000000631996
0000000481998 0000000631997
0000000481999 0000000631998
0000000482000 0000000631999
0000000482002 0000000632000
0000000482008 0000000632001
0000000482009 0000000642000
0000000482010 0000000681994
0000000482011 0000000681995
0000000491959 0000000681996
0000000511958 0000000681997
0000000591982 0000000681998
0000000591983 0000000911997
0000000591984 0000000962001
0000000591990 0000000962003
0000000591991 0000001001997
0000000611962 0000001082006
0000000631993 0000001381994
0000000631995 0000001381995
0000000631996 0000002001994
0000000631997 0000002122007
0000000631998 0000002291995
0000000631999 0000002291996
0000000632000 0000002291997
0000000632001 0000002601999
0000000641984 0000002602000
0000000642000 0000002641998
0000000661957 0000002731994
0000000681994 0000003031994
0000000681995 0000003161994
0000000681996 0000003161995
0000000681997 0000003161996
0000000681998 0000003161997
0000000691959 0000003392009
0000000751990 0000003901998
0000000811961 0000004062006
0000000811991 0000004091994
0000000811992 0000004091995
0000000811993 0000004131998
0000000821959 0000004231998
0000000851958 0000004371995
0000000881990 0000004521995
0000000911997 0000004522000
0000000951959 0000004541997
0000000962001 0000004542000
0000000962003 0000004542001
0000001001997 0000005001998
0000001031957 0000005002002
0000001082006 0000005121997
0000001121970 0000005181994
0000001121973 0000005181998
0000001121974 0000005381995
0000001121975 0000005381996
0000001121976 0000005471994
0000001181960 0000005471995
0000001191952 0000005471996
0000001311961 0000005581996
0000001341959 0000005622000
0000001381994 0000005622001
0000001381995 0000005622002
0000001411981 0000005971994
0000001411982 0000006202004
0000001411983 0000006491995
0000001411991 0000006511994
0000001411992 0000006511996
0000001421956 0000006571994
0000001451982 0000006571995
0000001471982 0000006571996
0000001541990 0000007291994
0000001561957 0000007291996
0000001631971 0000007291998
0000001631972 0000007321994
0000001681959 0000007341997
0000001711991 0000007341998
0000001781953 0000007432003
0000001871955 0000007751995
0000001881952 0000007931996
0000001881955 0000008071994
0000001881959 0000008071995
0000001921986 0000008071997
0000001951958 0000008432000
0000001981958 0000008432001
0000002001954 0000008432002
0000002001955 0000008631995
0000002001994 0000008631996
0000002011957 0000008662000
0000002031958 0000008681998
0000002101957 0000008702003
0000002122007 0000008702004
0000002191957 0000008702005
0000002191963 0000008702006
0000002211979 0000008771994
0000002211980 0000008771995
0000002241960 0000008771996
0000002251958 0000008771997
0000002271976 0000008771998
0000002281981 0000008772000
0000002281982 0000008772003
0000002291982 0000008961994
0000002291983 0000008991996
0000002291984 0000008992005
0000002291985 0000008992006
0000002291986 0000009061997
0000002291987 0000009061998
0000002291993 0000009062006
0000002291995 0000009062007
0000002291996 0000009062009
0000002291997 0000009171995
0000002331955 0000009171997
0000002331961 0000009172003
0000002381990 0000009221994
0000002391960 0000009221996
0000002411958 0000009361995
0000002411967 0000009362003
0000002421958 0000009362004
0000002461981 0000009401994
0000002461982 0000009581998
0000002461983 0000009691994
0000002531986 0000009691995
0000002571990 0000009691996
0000002571991 0000009821996
0000002591977 0000009831996
0000002601963 0000009831997
0000002601999 0000009971994
0000002602000 0000009971995
0000002641998 0000009971996
0000002691990 0000009971997
0000002691991 0000009971998
0000002711961 0000009971999
0000002731990 0000009972000
0000002731992 0000009972001
0000002731993 0000009972002
0000002731994 0000009972003
0000002741956 0000009972004
0000002741990 0000009972005
0000002871977 0000009972006
0000002891956 0000010001998
0000002921961 0000010011994
0000002971956 0000010011995
0000002971991 0000010321994
0000002981959 0000010361997
0000003001989 0000010411996
0000003001990 0000010411997
0000003021990 0000010411998
0000003031994 0000010411999

GS · Jan 15, 2012

Ron Rosenfeld wrote on 1/15/2012 :

Gary,

When I interrupted it it had been running for 219 seconds. At that point in
time it had eliminated 617 entries from the column A list.

I then started up my "dictionary" routine. It ran for about 58.6 seconds and
eliminated the remaining 260,493 duplicated entries.

To set up the sample data, I enter a formula like:

A1 & B1: =text(randbetween(1,10^6),"0000000000000")
Fill down to row 500,000.
Then copy/Paste Values

For timing I use the HiRes timer.

I initially tried an approach like yours:
Examine each cell
If the data is invalid, delete the cell and rearrange the rest (delete
xlshiftup)

After some thought, I decided it should be faster to
Get all the good data into a sequential array.
Delete ALL the original data
Write back the good data array.

The approach I used, using the dictionary, works pretty fast. It's
disadvantage is that if duplicates in the original data should be retained,
it would have to be modified. (i.e. if there are multiple 0000000123456's in
column A, and none of that value in Column B, and the multiple values all
need to be retained in column A; and they need to be retained in their
original order). Fortunately, that is not the case.

And if I had Excel 2010, the Advanced Filter might work. I would
filter/copy; then delete the original and write back the copy. That would
work even with duplicates. But it won't work in Excel 2007 with this data
base (and seems to run slower even with smaller databases).

Thanks, Ron.
I suspected it would take a long time since it writes the worksheet in
the loop. My source code does everything in memory using the arrays and
sett matches to an empty string. Note that the source data is read n
from a data logger output file, NOT from a worksheet. I just dump the
result back into a blank sheet. I can't use the dictionary because I
need to preserve duplicates. The match criteria is an ambient
temperature value in a line of text at a specific position and so if
it's not '>=' then I reset the array element to an empty string, then
use the Filter() function to dump the resulting data into a worksheet.
Since there's only 1 array to loop once only the process is really
fast.

GS · Jan 16, 2012

Your code took 46 secs to run your sample data on my machine. That's
amazing! I'm curious, now, how it performs if we did similar using a
Collection so we can have duplicates in ColA!

Jim Cone · Jan 16, 2012

Ron,
Actually, I misspoke, the Dictionary is zero based, but the recommendation stands.
Also, the two lines near the end should read...
rColA.ClearContents
Set rColA = rColA.Resize(dColA.Count, 1)
'---
Jim Cone

"Jim Cone" <[email protected]>
wrote in message
news:[email protected]...

GS · Jan 16, 2012

GS expressed precisely :

Your code took 46 secs to run your sample data on my machine. That's amazing!
I'm curious, now, how it performs if we did similar using a Collection so we
can have duplicates in ColA!

I modified my sub to use your idea to use Dictionary, but NOT put colA
in a dictionary and it shaved 11 secs off the ET...

Sub StripDupes()
Dim i&, j&, lRows1&, lRows2& lMatchesFound& 'as long
Dim vRngA, vRngB, vRngOut() 'as variant
Dim dRngB As Dictionary

lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
vRngA = Range("A1:A" & lRows1): vRngB = Range("B1:B" & lRows2)

Set dRngB = New Dictionary
Debug.Print Now()
For j = LBound(vRngB) To UBound(vRngB)
With dRngB
If Not .Exists(Key:=vRngB(j, 1)) Then _
.Add Key:=vRngB(j, 1), Item:=vRngB(j, 1)
End With
Next 'j
sTemp = Mid$(sTemp, 2)
' Debug.Print Now()

For i = LBound(vRngA) To UBound(vRngA)
If dRngB.Exists(Key:=vRngA(i, 1)) Then _
vRngA(i, 1) = "": lMatchesFound = lMatchesFound + 1
Next 'i
' Debug.Print Now()

j = 0: ReDim vRngOut(UBound(vRngA) - lMatchesFound, 1 To 1)
For i = LBound(vRngA) To UBound(vRngA)
If Not vRngA(i, 1) = "" Then
vRngOut(j, 1) = vRngA(i, 1): j = j + 1
End If
Next 'i
' Debug.Print Now()

Range("A1:A" & lRows1).ClearContents
Range("A1").Resize(UBound(vRngOut), 1) = vRngOut

Debug.Print Now()
End Sub

GS · Jan 16, 2012

ET on my machine was 35 secs as per timing method used as shown. I
didn't think this task deserved the trouble to setup and use
cHiResTimer class.

GS · Jan 16, 2012

Ron Rosenfeld wrote :

That should preserve the duplicates in col A also, to answer your previous
question.

Hhm.., that's quite true where non-matches occur. I suppose that might
be a better way to go when comparing 2 or more items. Won't help my
data logger file parser, though. It just loops 1 array, but it does
parse each element into a temp array for the test. It runs blazingly
fast on my machine (1.6Ghz Intel dual-core on a Dell Precision series
laptop w/2GB RAM).

Jim Cone · Jan 16, 2012

Garry,
More FWIW...
I ran Ron's code on my XP, i3 machine (xl2007) against 600,000 6 digit random numbers.
360,000 in col A and 240,000 in col B. It took about 7 1/2 seconds.
It returned ~87000 numbers not in col B.
'---
Jim Cone

GS · Jan 16, 2012

Ron Rosenfeld submitted this idea :

GS,
Technique question:
Why, on the "Next" line, do you comment out the counter variable that you
are looping on? I've not commented out, and have had the VBE help me out when
I might be using nested loops.

-- Ron

Ron,
Just something I picked up from the Classic VB crowd. It falls in the
same bucket as the dif using Mid() and Mid$(), and how VB handles this
at runtime. Sorry, but I can't give you technical details about these
without going back over a few years of posts. I include the comment for
notation purposes so I know which counter is repeating in nested or
long loops. Otherwise, I don't see any problem with leaving the
apostrophe out if desired. My choice to use it was formed a long time
ago because I didn't want to lose the notation. (You'll see other code
samples that use a similar technique for Select Case, If, While, and Do
constructs as well. I also do similar for end of procedures because
it's helpful when reading through modules in a text editor outside the
VBE.

HTH

GS · Jan 16, 2012

Jim Cone explained on 1/15/2012 :

Garry,
More FWIW...
I ran Ron's code on my XP, i3 machine (xl2007) against 600,000 6 digit random
numbers.
360,000 in col A and 240,000 in col B. It took about 7 1/2 seconds.
It returned ~87000 numbers not in col B.
'---
Jim Cone

Thanks, Jim. Can you run my final version on your sample data and
report back. I'd be curious to know the results. I'm running XP SP3 and
did the test in xl2007. Thanks in advance...

Jim Cone · Jan 16, 2012

Garry,
It may turn out to be one of those days, especially if the snow sticks.

I plugged in a new set of sample data into xl2010 this morning.
It appears that your code is returning mismatched items... items in col A that are not in Col B.
But it is not eliminating duplicates.
Column A has 360,000 random 6 digit numbers.
Column B has 240,000 random 6 digit numbers.

Ron's code returns 231,414 unique entries.
Your code returns 279,200 entries: 231,514 unique and 47,686 duplicates.
(i ran my own unique counter on your returned data)

It's too early in the day for me to try to figure out why. <g>
'---
Regards,
Jim Cone

wrote in message

Thanks, Jim. Can you run my final version on your sample data and report back. I'd be curious to
know the results. I'm running XP SP3 and did the test in xl2007. Thanks in advance...

GS · Jan 16, 2012

Garry,
It may turn out to be one of those days, especially if the snow sticks.

I plugged in a new set of sample data into xl2010 this morning.
It appears that your code is returning mismatched items... items in col A
that are not in Col B.
But it is not eliminating duplicates.
Column A has 360,000 random 6 digit numbers.
Column B has 240,000 random 6 digit numbers.

Ron's code returns 231,414 unique entries.
Your code returns 279,200 entries: 231,514 unique and 47,686 duplicates.
(i ran my own unique counter on your returned data)

It's too early in the day for me to try to figure out why. <g>
'---
Regards,
Jim Cone

wrote in message

Jim,
Ron explains the what/why of my version of the task fairly well. What
I'm more interested in is how long it took on your machine to process
the same amount of data as when you ran Ron's version.

Jim Cone · Jan 16, 2012

Garry,
Ran your code and Ron's twice each.
Ron: 14.0 seconds
Garry: 8.2 seconds
Times were identical for tests on each.
Xl2010 on WindowsXP - 360,000 nums in col A, 240,000 nums in col B.

Changes from yesterday: xl2010 vs. xl2007 and more data overlap between columns.
I'm thinking that the xl2010 vba Rnd function may be different.
'---
Jim Cone
Portland, Oregon USA
http://blog.contextures.com/archives/2011/07/18/find-last-row-with-excel-vba/
(workbook with "universal" Last Row function code - free)

GS · Jan 16, 2012

Jim Cone wrote :

Garry,
Ran your code and Ron's twice each.
Ron: 14.0 seconds
Garry: 8.2 seconds
Times were identical for tests on each.
Xl2010 on WindowsXP - 360,000 nums in col A, 240,000 nums in col B.

Changes from yesterday: xl2010 vs. xl2007 and more data overlap between
columns.
I'm thinking that the xl2010 vba Rnd function may be different.
'--- Jim Cone
Portland, Oregon USA
http://blog.contextures.com/archives/2011/07/18/find-last-row-with-excel-vba/
(workbook with "universal" Last Row function code - free)

Thanks, Jim. I guess I was expecting a slower time as compared to Ron's
(approx RonsTime * 0.75), but I'm very happy that you report it was
better by almost half.

Obviously, the Scripting Dictionary is the better way to compare
columns of data. What I find interesting is how slow doing VB
comparison using arrays is. I've learned something valuable here..! My
thanks to you and Ron for your efforts...

GS · Jan 16, 2012

Thanks, Ron. I really appreciate your efforts!

I was thinking to now create a function that returns a boolean on
success, and accepts "Optional AllowDupes As Boolean = True" as its arg
so the faster code runs unless the user wants to remove dupes in ColA.
Does that make sense?

GS · Jan 16, 2012

Something to play with...

Function StripDupes(Optional AllowDupes As Boolean = True) As Boolean
' Compares colA to colB and removes colA matches found in colB.
' Args In: AllowDupes: True by default
' Keeps duplicate values found in colA that are not found
in colB;
' If False, duplicate values in colA not found in colB are
removed.
'
' Returns: True if matches are found --AND-- no error occurs;
' False if matches are NOT found --OR-- error occurs.
'
' Sources: Ron Rosenfeld, Jim Cone, Garry Sansom

Dim i&, j&, lRows1&, lRows2&, lMatchesFound& 'as long
Dim vRngA, vRngB, vRngOut() 'as variant
Dim dRngB As New Dictionary

On Error GoTo ErrExit

lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
vRngA = Range("A1:A" & lRows1): vRngB = Range("B1:B" & lRows2)

For j = LBound(vRngB) To UBound(vRngB)
With dRngB
If Not .Exists(Key:=vRngB(j, 1)) Then _
.Add Key:=vRngB(j, 1), Item:=vRngB(j, 1)
End With
Next 'j

If AllowDupes Then '//fastest
For i = LBound(vRngA) To UBound(vRngA)
If dRngB.Exists(Key:=vRngA(i, 1)) Then _
vRngA(i, 1) = "": lMatchesFound = lMatchesFound + 1
Next 'i

j = 0: ReDim vRngOut(UBound(vRngA) - lMatchesFound, 0)
For i = LBound(vRngA) To UBound(vRngA)
If Not vRngA(i, 1) = "" Then _
vRngOut(j, 0) = vRngA(i, 1): j = j + 1
Next 'i

Else '//slowest
Dim dRngA As New Dictionary
For i = LBound(vRngA) To UBound(vRngA)
If Not dRngB.Exists(vRngA(i, 1)) Then
With dRngA
If Not .Exists(Key:=vRngA(i, 1)) Then _
.Add Key:=vRngA(i, 1), Item:=vRngA(i, 1)
End With 'dRngA
End If 'Not dRngB.Exists(vRngA(i, 1))
Next 'i

Dim v As Variant
j = 0: ReDim vRngOut(dRngA.Count, 0)
For Each v In dRngA
vRngOut(j, 0) = dRngA(v): j = j + 1
Next 'v
End If 'AllowDupes

Range("A1:A" & lRows1).ClearContents
Range("A1").Resize(UBound(vRngOut), 1) = vRngOut

ErrExit:
StripDupes = (Err = 0)
End Function 'StripDupes()

GS · Jan 16, 2012

We can shave off even more time if we eliminate the checks when adding
items to the dictionary because the dictionary won't allow dupes...

Function StripDupes(Optional AllowDupes As Boolean = True) As Boolean
' Compares colA to colB and removes colA matches found in colB.
' Args In: AllowDupes: True by default. Keeps duplicate values
' found in colA that are not found in colB. If False,
' duplicate values in colA not found in colB are removed.
'
' Returns: True if matches found and no error occurs;
' False if matches not found --OR-- error occurs.
'
' Sources: Ron Rosenfeld, Jim Cone, Garry Sansom

Dim i&, j&, lRows1&, lRows2&, lMatchesFound& 'as long
Dim vRngA, vRngB, vRngOut() 'as variant
Dim dRngB As New Dictionary

On Error GoTo ErrExit

lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
vRngA = Range("A1:A" & lRows1): vRngB = Range("B1:B" & lRows2)

On Error Resume Next
For j = LBound(vRngB) To UBound(vRngB)
dRngB.Add Key:=vRngB(j, 1), Item:=vRngB(j, 1)
Next 'j
On Error GoTo 0

If AllowDupes Then '//fastest
For i = LBound(vRngA) To UBound(vRngA)
If dRngB.Exists(Key:=vRngA(i, 1)) Then _
vRngA(i, 1) = "": lMatchesFound = lMatchesFound + 1
Next 'i

j = 0: ReDim vRngOut(UBound(vRngA) - lMatchesFound, 0)
For i = LBound(vRngA) To UBound(vRngA)
If Not vRngA(i, 1) = "" Then _
vRngOut(j, 0) = vRngA(i, 1): j = j + 1
Next 'i

Else '//slowest
Dim dRngA As New Dictionary
On Error Resume Next
For i = LBound(vRngA) To UBound(vRngA)
If Not dRngB.Exists(vRngA(i, 1)) Then _
dRngA.Add Key:=vRngA(i, 1), Item:=vRngA(i, 1)
Next 'i
On Error GoTo 0

Dim v As Variant
j = 0: ReDim vRngOut(dRngA.Count, 0)
For Each v In dRngA
vRngOut(j, 0) = dRngA(v): j = j + 1
Next 'v
End If 'AllowDupes

Range("A1:A" & lRows1).ClearContents
Range("A1").Resize(UBound(vRngOut), 1) = vRngOut

ErrExit:
StripDupes = (Err = 0)
End Function 'StripDupes()

Jim Cone · Jan 17, 2012

Garry,
I wonder if gary (the original poster) is still around?
He would have his moneys worth by now. <g>

Its time for further posts, if any, on this subject to go into a brand new post.
It has gotten a little unwieldy.
'---
Jim Cone

"GS" <[email protected]>
wrote in message
news:[email protected]...

GS · Jan 17, 2012

I was thinking the same thing! I'll report back in a new thread as Jim
suggested...

GS · Jan 17, 2012

I like how you were able to eliminate the final If...Then construct.<g>