Which is better and by how much?

agarwaldvk · Aug 19, 2004

Hi All

I am developing an analytical model involving reasonably big datasets.
The source or the primary data is spread over say 10 to 12 worksheets
with each worksheet containing data in around 200000 to 250000 cells
spread over 50 to 60 columns and around 4000 rows. I have just almost
finished organising the source data and am about to start the
development of the processing bit.

A fair amount of processing involves searching on large ranges and
summing (possibly large ranges) of data conditionally upon meeting
various sets of conditions - a large proportion of these conditions are
more than likely going to be on multiple parameters. In essence, I am
looking at building a fairly calculation intensive model.

I am looking for advice as to which way should I be going to make the
processing as quick as possible realizing full well that no matter
which option I eventually take, I, reasonably, am looking at a
processing time in the vicinity of a couple of minutes.

WOULD I BE BETTER OFF USING :-

SUM(...)'s, SUM(IF(...)'s)'s and SUMPRODUCT(...)'s WITH INDEX() &
MATCH() functions (both normally entered or array entered as required)

OR

LOOKUP functions,

given MY PREFERNCE FOR AVOIDING USING HELPER COLUMNS TO STORE
TRANSITIONAL VALUES?

Any help/suggestions/tips would be greatly valued.

Best regards and thanks in advance!

Deepak Agarwal

hgrove · Aug 19, 2004

agarwaldvk wrote...

I am developing an analytical model involving reasonably big
datasets. The source or the primary data is spread over say 10
to 12 worksheets with each worksheet containing data in
around 200000 to 250000 cells spread over 50 to 60 columns
and around 4000 rows. I have just almost finished organising
the source data and am about to start the development of the
processing bit.

A fair amount of processing involves searching on large ranges
and summing (possibly large ranges) of data conditionally upon
meeting various sets of conditions - a large proportion of these
conditions are more than likely going to be on multiple
parameters. In essence, I am looking at building a fairly
calculation intensive model.

Not particularly calculation-intensive. Summing and counting are fas
and simple, and as you describe it, this would be better suited to
database than a spreadsheet.

I am looking for advice as to which way should I be going to
make the processing as quick as possible realizing full well that
no matter which option I eventually take, I, reasonably, am
looking at a processing time in the vicinity of a couple of minutes.

...

While you may believe otherwise, you've not given sufficient detail
for anyone to give you an intelligent response other than on th
general approach. It's always best to do as little as possible, s
SUMPRODUCT on entire ranges is a BAD idea. Also, while you may b
biased against using ancillary cells to hold intermediate calculations
that's the SUREST way to speed up the sort of model you're building. I
you're going to allow simplistic esthetic prejudices to warp you
design, there's not much scope for giving you good advice.

For example, if your first column were some sort of identifier that'
repeated through several rows with other columns containing differen
identifiers within the broad category set by the first column, the
your data should be sorted in ascending order on each column o
identifiers, and you should use additional cells to hold the index o
address of the first and last cells holding a given category's data. I
you enter a category in a cell named Cat, these formulas would loo
like

First:
=LOOKUP(2,1/(INDEX(Data,0,1)<Cat),ROW(INDEX(Data,0,1)))+1

Last:
=LOOKUP(2,1/(INDEX(Data,0,1)=Cat),ROW(INDEX(Data,0,1))

Indeed, drop the '+1' from the end of the first formula, and you hav
the row offset, in which case the range of interest becomes

OFFSET(Data,First,0,Last-First,COLUMNS(Data))

Using such OFFSET-generated references is the ideal way to make a mode
such as yours as efficient as possible AS LONG AS the First and Las
formulas are entered in *CELLS* and so be subject to recalculation onl
when necessary

agarwaldvk · Aug 20, 2004

Harlan

Your advice is greatly appreciated.

I am learning new things everyday - thanks to you people! I am happ
that I can tap in to the skill set of you all.

The only reason I tend to avoid using transitional values is because
was advised that "generally", the more transitional values you have i
a spreadsheet, the more time it will take to calculate and recalculat
- since Excel works on a cell by cell basis. Obviously, this advice
whilst may not be wrong in its entirety, but may well be true in som
specific instances.

I will certainly take your viewpoints in to consideration and alter th
design of my model to make it as efficient as possible.

I also agree that I may not have provided as much of information as
idealy should have that would facilitate other people giving soun
advice.

Let me rethink the assignment and hopefully in the next couple of day
I shall be able to compile a more detailed requirement that may assis
in providing better technical advice.

Nonetheless, your advice is greatly appreciated and highly valued
always!

Best regards

Deepak Agarwa

Ron Rosenfeld · Aug 20, 2004

WOULD I BE BETTER OFF USING :-

SUM(...)'s, SUM(IF(...)'s)'s and SUMPRODUCT(...)'s WITH INDEX() &
MATCH() functions (both normally entered or array entered as required)

OR

LOOKUP functions,

given MY PREFERNCE FOR AVOIDING USING HELPER COLUMNS TO STORE
TRANSITIONAL VALUES?

Although I have little experience with them, some have written (not on this NG
but elsewhere) that the Excel Database functions may be quicker. e.g. DSUM and
the related functions.

It is something I would look into with that sort of problem.

--ron

Harlan Grove · Aug 21, 2004

agarwaldvk > said:
The only reason I tend to avoid using transitional values is because I
was advised that "generally", the more transitional values you have in
a spreadsheet, the more time it will take to calculate and recalculate
- since Excel works on a cell by cell basis. Obviously, this advice,
whilst may not be wrong in its entirety, but may well be true in some
specific instances.

If you were comparing

A1: =1+2
A2: =3+4
A3: =5+6
A4: =A1+A2+A3

against

A4: =SUM(1,2,3,4,5,6)

Then the latter would be faster and smaller.

However, if your data were in a range named Data which was sorted in
ascending order on the first column, and you were comparing

A1: foo
A2: =MATCH(A1,INDEX(Data,0,1),0)-1
A3: =COUNTIF(INDEX(Data,0,1),A1)
A4: =SUM(OFFSET(Data,A2,3,A3,1))

against

A1: foo
A4: =SUMPRODUCT(--(INDEX(Data,0,1)=A1),INDEX(Data,0,4))

For small ranges Data, the latter may be faster, but for small subsets of
large ranges, the former will often be much faster.

agarwaldvk · Aug 21, 2004

Harlan

Excuse my ignorance but could you please explain what does the the bi

"1/(INDEX(Data,0,1)<Cat)"

do in your suggested formulae (reproduced below) :-

First
=LOOKUP(2,1/(INDEX(Data,0,1)<Cat),ROW(INDEX(Data,0,1)))+1

and

Last
=LOOKUP(2,1/(INDEX(Data,0,1)=Cat),ROW(INDEX(Data,0,1)))

because that is exactly the course that I am inclining towards.

I went through the online help but I am none the wiser!

I shall also give due consideration to your later suggestions. I a
sure I will need every bit of help that I can get on this one.

Best regards

Deepak Agarwa

Harlan Grove · Aug 22, 2004

agarwaldvk > said:
Excuse my ignorance but could you please explain what does the the bit

I will this time. You need to learn how to test portions of formula by
highlighting them in the formula bar and pressing [F9] or entering them as
entire formulas (usually as array formulas) in other ranges.

"1/(INDEX(Data,0,1)<Cat)"

INDEX(Data,0,1) returns the first column of the range Data. If, as I stated
I was assuming, the range Data were sorted in ascending order on its first
column, then INDEX(Data,0,1)<Cat returns TRUE for all entries in the first
column of Data presumably lexically less than the entry in Cat and False
otherwise, so the entries in the first column of Data less than Cat would
produce TRUE results and the entries greater than or equal to Cat would
produce FALSE results. Using that array as the divisor of 1 converts the
TRUE and FALSE values to numeric 1s and 0s, respectively, so the final
result is 1s for entries in the first column of Data less than Cat and
#DIV/0! errors for the other entries.

do in your suggested formulae (reproduced below) :-

First
=LOOKUP(2,1/(INDEX(Data,0,1)<Cat),ROW(INDEX(Data,0,1)))+1

When given a number as first argument that's larger than any number in its
array second argument, LOOKUP returns the entry from its array third
argument corresponding to the last numeric entry in its array second
argument. So the formula above returns the row index in Data of the first
row equal to or greater than the value of Cat.

and

Last
=LOOKUP(2,1/(INDEX(Data,0,1)=Cat),ROW(INDEX(Data,0,1)))

Similar thing, but this time it returns the row index of the last entry in
the first column of Data equal to the value of Cat.

I went through the online help but I am none the wiser!

Would reading an automobile's owner's manual tell you how to drive? No, it
tells you how to change tires and oil and when to bring it in for regular
maintenance. Software manuals also address equally tangential details.

agarwaldvk · Aug 22, 2004

Harlan

Once again, your help is greatly appreciated. I promise, henceforth
shall make attempts to test the portions of the formula by doing wha
you have suggested.

And whilst I am on the subject of being able to test portions of th
formulas, I saw in one of the posts earlier (I think it was to do wit
COVAR), apparently there is a way to evaluate formulas by clicking o
"evaluate" button? How do we do this?

Best regards

Deepak Agarwa

Tushar Mehta · Aug 22, 2004

In XL2003 help, search for 'evaluate' (w/o the quotes). The first
suggested topic is 'Evaluate a nested formula one step at a time'

The same capability is available in XL2002, though not well documented.
Check the various options in the Formula Auditing toolbar.

For versions prior to 2002, Harlan has already told you how. See his
post that starts with "I will this time. You need to learn how to test
portions of formula..."

--
Regards,

Tushar Mehta
www.tushar-mehta.com
Excel, PowerPoint, and VBA add-ins, tutorials
Custom MS Office productivity solutions

agarwaldvk · Aug 22, 2004

Hi Tushar

Thanks for your help.! However, at this point, we, at work, still us
Excel 97. And yeah, I will cetainly use Harlan's suggestion for testin
portions of the formulas for Excel 97.

Hi Harlan

One last question on this thread. In your explanation of the suggete
formula (which is very highly appreciated) you have said (quote) :-

"Using that array as the divisor of 1 converts the
TRUE and FALSE values to numeric 1s and 0s, respectively, so the fina
result is 1s for entries in the first column of Data less than Cat an
#DIV/0! errors for the other entries."

From what I have learnt on this forum the unary operator "--" does jus
that - i.e. converts the logical true/false to numeric 1's/0's. So
could you please advise then if

"(--(INDEX(Data,0,1)=Cat))

would be equivalent to

"1/(INDEX(Data,0,1)=Cat)"

Or am I on the wrong track totally?

The reason I ask this is that I was just getting used to using th
unary operator (the 2 minus signs) and then came this "1/" operato
that actually threw me a bit. If this is another operator that I ca
use, that's ok - more things for me to learn!

Best regards

Deepak Agarwa

Aladin Akyurek · Aug 24, 2004

....

"Using that array as the divisor of 1 converts the
TRUE and FALSE values to numeric 1s and 0s, respectively, so the final
result is 1s for entries in the first column of Data less than Cat and
#DIV/0! errors for the other entries."

From what I have learnt on this forum the unary operator "--" does just
that - i.e. converts the logical true/false to numeric 1's/0's. So,
could you please advise then if

"(--(INDEX(Data,0,1)=Cat))

would be equivalent to

"1/(INDEX(Data,0,1)=Cat)"

(--(INDEX(Data,0,1)=Cat))

will evaluate to a calculated array of 1's and 0's like in

[1] {1,0,0,1,0}

while

1/(INDEX(Data,0,1)=Cat)

will evaluate to a calculated array of 1's and #DIV/0!'s like in

[2] {1,#DIV/0!,#DIV/0!,1,#DIV/0!}

Harlan is interested in [2], that is, in hits (indicated by 1's), not in [1]
for:

=LOOKUP(2,{1,0,0,1,0},{"a","B","a","x","b"}) ==> b

The last item of the 1st array is a numeric value (i.e., 0) and its pos is
5, which makes LOOKUP retrieve the 5th item from the 2nd array, which is b.

while

=LOOKUP(2,{1,#DIV/0!,#DIV/0!,1,#DIV/0!},{"a","B","a","x","b"}) ==> x

The last item of the 1st array is a numeric value (i.e., 1) and its pos is
4, which makes LOOKUP retrieve the 4th item from the 2nd array, which is x.

The setup exploits (a) the fact that

LOOKUP(n,Ref)
VLOOKUP(n,Ref,1,1)
INDEX(Ref,MATCH(n,Ref,1)

will all return the last numerical value from Ref when n is a number that
cannot occur in Ref, an example use of which is:

=LOOKUP(9.99999999999999E+307,A:A);

And (b) the fact that the lookup functions ignore error values.

agarwaldvk · Aug 25, 2004

Thanks Aladin!

You peopel are all absolute geniuses!!! What more can I say?

Best regards

Deepak Agarwa

Which is better and by how much?

agarwaldvk

hgrove

agarwaldvk

Ron Rosenfeld

Harlan Grove

agarwaldvk

Harlan Grove

agarwaldvk

Tushar Mehta

agarwaldvk

Aladin Akyurek

agarwaldvk