hash function for large strings

G

Guest

Excel vlookup function returns #VALUE when the lookup value exceeds 256
characters. I need a hash function to transform large strings into a value
that does not exceed the 256 character limit. Are there any quick solutions
to this problem?
 
B

Bernie Deitrick

It depends on your strings, and how similar they are. Anything from a simple =LEFT(A1,255) to a
more colplicated sampling every third (or fourth or thenth) letter could work....

HTH,
Bernie
MS Excel MVP
 
N

Niek Otten

It depends on what possible values of each character of the string you can
exclude. If they can be each of the possible 256 characters, there are
limited hashing possibilities; you can treat upper- and lowercase letters
the same, since VLOOKUP will do that anyway. But that doesn't save you a
lot. If there are just the alphabet and 0-9, you can exclude a lot more.
Then you can use a number base conversion function to realize the gain in
storage requirement. It may, however, be difficult to find one that uses all
256 possibilities. I have seen one (from Ron Rosenfeld) that uses 0-9 and
upper- and lowercase letters.

I find it difficult to believe that there isn't any logic behind the
population of your keys. Still that is the "key" to hashing.

Can you tell a bit more about the nature of your keys?
 
N

Niek Otten

<Still that is the "key" to hashing>

That isn't true. But it is if you want to use it with VLOOKUP.
Maybe you should tell us more about the nature of the lookup problem too.
 
G

Guest

I am pulling log entries into Excel from Essbase, a multi-dimensional
database, and want to assign categories to each entry using a vlookup. The
entries could use pretty much any character. Below is a sample of a large
string entry:
'Calculating [
Accounts(AC_IS_Asset_Based_Fixed,AC_IS_Conditional_Sales_Contracts_Fixed,AC_IS_Construction_Fixed,AC_IS_Factoring_Fixed,AC_IS_Finance_Leases_Fixed,AC_IS_Floorplan_Fixed,AC_IS_Inventory_Fixed...]
with fixed members [Business_Type(BT_Adds_Non_Recourse_NB_Recv_Amt,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
BT_Adds_Non_Recourse_EB_Recv_Amt, BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ]

Regards,
Jon
 
P

Pete_UK

Looking at your example string, you can see many repeating groups of
characters, eg:

AC_IS_ _Fixed BT_Adds_Non_Recourse _Cost_Amt _Recv_Amt

to pick just five. You could envisage, then, a user-defined function
which is applied to the original string and returns a reduced string by
repeated application of SUBSTITUTE( ) to change these character groups
to some other character which is not contained within any of the
strings.

If you apply the UDF to the lookup table as well, then you can used the
modified string as the key field.

Is this the kind of thing you were looking for?

Pete
 
N

Niek Otten

Also, it is almost only letters, commas, underscores and round and square
brackets. That is less than 30 possibilities and should fit in just 5 bits
instead of 8. Or, store 8 positions in 1 character.

But probably Pete's suggestion for repeating strings is easier to code.
 
G

Guest

Unfortunately, the repeating strings, while present in this example, don't
occur in most cases. This example is more of an anomoly in that regard.
 
N

Niek Otten

I can't see how you would use such a string as a key. As a search argument,
OK, but not as an identifier that uniquely extracts one key.
Can you tell us how you intend to use these items? How is you search key
assembled/where does it come from? What data are you trying to look up using
the key?
 
H

Harlan Grove

jheby wrote...
I am pulling log entries into Excel from Essbase, a multi-dimensional
database, and want to assign categories to each entry using a vlookup. The
entries could use pretty much any character. Below is a sample of a large
string entry:
....

There are tasks for which Excel is most definitely not the most
appropriate tool, and in this case, Excel may be one of the worst tools
for the task. Most scripting languages (Perl, Pyton, Ruby, gawk)
provide associative arrays that handle hashing arbitrarily long
strings. But if you want a simple spreadsheet approach, get a capable
spreadsheet like the one in OpenOffice. OpenOffice Calc doesn't choke
on string operations with string lengths > 256 characters. This is one
of the ugly little backwaters of Excel functionality that hasn't been
changed in more than a decade.
 
G

Guest

The log that I'm analyzing consist of perhaps 5K unique entries that get
recorded perhaps 300K times over a period of several days. I've created a
map that groups the 5K unique entries into about a hundred categories. I
want to go back and tag each of the 300K instances with its category so that
I can summarize & analyze the log entries by those category groupings -
that's where the need for the vlookup comes in. The majority of the unique
entries are well under 255 characters, but unfortunately, a significant
number of them are not.
 
G

Guest

thanks - I hope that the Excel designers are planning to fix this along with
a number of other items in the new product (including the eight level nesting
limit and the 65K row limit)
 
P

Pete_UK

Well, you could wait for Excel 12 later in the year, but if you want to
do this now, the following illustrates my point:

Your example text string is 428 characters long. I put it in A1 and
this formula in B1:

=SUBSTITUTE(A1,"BT_Adds_Non_Recourse",CHAR(161))

The length came down to 333 characters.

Changing the formula to this:

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE
(A1,"BT_Adds_Non_Recourse",CHAR(161)),"_Cost_Amt",CHAR(162)),
"_Recv_Amt",CHAR(163)),"AC_IS_",CHAR(164)),"_Fixed",CHAR(165))

brought the length down to 233 characters - so you could use this as a
key in your VLOOKUP formula. The formula merely changes the character
groups identified earlier into a single character.

Hopefully, you can see how this method could be extended by using a UDF
to pick up other character groups.

Pete
 
G

Guest

Thanks - that approach will definately work in this instance. It still
appears to me that I will have to create a VBA-based hashing function to
address the general case.

Regards,
Jon
 
P

Pete_UK

Hi Jon,

You won't be able to extend the above formula very much because of the
limits on nesting. Here's an example UDF to do the same thing:

Function reduce(test As String) As String
reduce = test
If Len(reduce) < 255 Then Exit Function
reduce = Application.WorksheetFunction.Substitute(reduce,
"BT_Adds_Non_Recourse", Chr(161))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
Chr(162))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
Chr(163))
reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
Chr(164))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
Chr(165))
reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
Chr(166))
reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
Chr(167))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Finance_Leases", Chr(168))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Conditional_Sales_Contracts", Chr(169))
End Function

I've added a few extra functions to the end, which reduce the length of
the original string to 184 characters when used as:

=reduce(A1)

The function does not change any original text which is already less
than 255 characters in length. Obviously you can apply this to other
strings by copying down, but you can also extend the UDF to cope with
other patterns - just change the character number at the end of the
line.

Hope this helps.

Pete
 
H

Harlan Grove

Pete_UK wrote...
You won't be able to extend the above formula very much because of the
limits on nesting. Here's an example UDF to do the same thing:

Function reduce(test As String) As String
reduce = test
If Len(reduce) < 255 Then Exit Function
reduce = Application.WorksheetFunction.Substitute(reduce,
"BT_Adds_Non_Recourse", Chr(161))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
Chr(162))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
Chr(163))
reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
Chr(164))
reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
Chr(165))
reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
Chr(166))
reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
Chr(167))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Finance_Leases", Chr(168))
reduce = Application.WorksheetFunction.Substitute(reduce,
"Conditional_Sales_Contracts", Chr(169))
End Function

I've added a few extra functions to the end, which reduce the length of
the original string to 184 characters when used as:

=reduce(A1)

The function does not change any original text which is already less
than 255 characters in length. Obviously you can apply this to other
strings by copying down, but you can also extend the UDF to cope with
other patterns - just change the character number at the end of the
line.

And thinking holistically, how does this udf help? While it may reduce
the length of the 1st argument to VLOOKUP, it's not usable directly for
the first column of the 2nd argument to VLOOKUP. It's not even usable
directly in an INDEX/MATCH approach because it can't handle range or
array arguments and return values. At best, the OP would need to use
one additional cell per log file record with each of those cells
calling this UDF with the corresponding record value. Calling *any* udf
a few thousand times even on a superfast system will bring back fond
memories of the responsiveness of 110 baud dial-up mainframe
connections.

MATCH is afflicted with the same 256 character limit on comparisons as
VLOOKUP. However, the = operator isn't so afflicted. Use INDEX/MATCH.
Given A1:A6 containing

=REPT("0123456789ABCDEF",16)&" - "&CHAR(64+ROW())

and B1:B6 containing

=ROW()

Then instead of VLOOKUP(x,A1:B6,2,0) use arrays formulas like

=INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))

A kludge, but this will be faster than any udf.
 
N

Niek Otten

Hi Harlan,

An ingenious solution! I like the Out of the box approach.

However, I'm not sure I agree with what you say about performance; in my
experience array formulas are rather slow.
I tested your solution (somewhat modified because of the limits of CHAR()).
Because it requires some efforts to arrange for more records than there are
rows, I tested with 30,000 (instead of 300,000) of your formulas; that took
(on my machine) 284 seconds.
I built a reducekey function similar to Pete's one; 30,000 reducements took
15 seconds.
If the 5K table is or can be sorted, Pete's solutions (which indeed requires
some more work, like adding the reduced key to the lookup table) might be
quite a lot faster than applying array formulas.

I know UDF's have a bad reputation for performance. Indeed the overhead of
switching between Worksheet and VBA is significant. But often the
performance problems seem to be caused by inefficient programming,
especially wrong data types.
My actuarial function system manages 10,000 calls from a worksheet per
second.

It would be a lot of work, but it would be nice to test Jon's actual case
with the different approaches suggested.
 
H

Harlan Grove

Niek Otten wrote...
....
However, I'm not sure I agree with what you say about performance; in my
experience array formulas are rather slow.
....

It's not the array formulas per se, it's the 30K LONG string
comparisons. If the OP had relatively static data (apparently the
case), a macro essentially performing repeated Edit > Replace
operations on copies of the log file records would be better than UDFs.
One VBA invocation beats 30K separate ones.

The greater difficulty is whether the chaff that Pete_UK has
*hardcoded* into his udf would apply to *all* the OP's records. Usually
it's a bad idea to generalize from one sample record. But there's also
the question whether given the *entire* population of Essbase fields,
categories, dimensions, etc. could prove awkward. For instance, the
OP's single sample record contained

BT_Adds_Non_Recourse_NB_Recv_Amt,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt,
BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
BT_Adds_Non_Recourse_EB_Recv_Amt,
BT_Adds_Non_Recourse_EB_Equip_Cost_Amt,

Pete_UK's udf would reduce these to

,
Equip,
RE,
,
Equip,

Do you begin to see a problem?

GIGO processing can be very runtime efficient. When you're after
correct results, however, it may require more calculation time and
perhaps a wee bit more time figuring out the full extent of the
problem.

Myself, I don't believe the OP can *RELIABLY* accomplish his/her task
by removing any text other than space characters on either side of
punctuation.
 
P

Pete_UK

Unfortunately Jon has only posted one example of his long strings - we
have no idea of what the other strings look like, but I had assumed
that similar character groupings would appear in many of them. He says
a significant number of his strings are very long, but that could mean
a couple of thousand if he has a batch of 15k records.

Applying the reduced key formula to the lookup table would only have to
be done once - the values could be fixed and then copied over the
original keys.

Jon said he does 300k weekly, so performance is an issue, and like you,
Niek, I have found that array formulae can be very slow. He wanted a
way to reduce his long strings so that he could use his existing
vlookup formulae, and my approach allows him to do that with not a lot
of effort in setting it up nor in using it - I also tested it on 30k
identical records following Harlan's posting, and it took less than 10
seconds.

Anyway, it's up to him - I think writing a generalised hashing
algorithm as he was contemplating would take considerable time.

Pete
 
P

Pete_UK

Harlan,

this is what the final part of the string gets converted into:

[Business_Type(¡¦£,¡¦_Equip¢, ¡¦_RE¢,¡§£, ¡§_Equip¢, ]

which represents the following:

[Business_Type(
[Business_Type(
BT_Adds_Non_Recourse_NB_Recv_Amt, ¡¦£,
BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, ¡¦_Equip¢,
BT_Adds_Non_Recourse_NB_RE_Cost_Amt, ¡¦_RE¢,
BT_Adds_Non_Recourse_EB_Recv_Amt, ¡§£,
BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ] ¡§_Equip¢, ]

I am substituting the phrases with individual characters in the range
161 onwards, not with blanks. The uniqueness is maintained.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top