Requesting Suggestions on Managing Large Data Set

M

Marston

I've been trying several different approaches to mananging a large set
of data that I have, but I'm not sure tht my approach is best.

Here's the situation:

I have a series of flat files that contain data over a 12 month
period.
The data is so extensive that it spans several files. A typical year
requires
3-4 files because of the number of rows.

The general format of this information is as follows:

P1 P2 P3 P4 P5 P6 MTH D1 D2

where P1-6 are various parameters, MTH is a column that contains a
month numeral and D1 and D2 are data values.

I need to transform this data so that it is sorted by each of the
parameters
so that I ned up with

1-12 1-12
P1 P2 P3 P4 P5 P6 D1 ; D2


Where P1-6 are sorted in assending order by each item and D1 and D2
are two
data sets with 12 columns of data.

I'd hoped to do most of this in arrays, but I bet that looping through
300,000 records is a bit much. I've tried to pre-process each of the
datasets by worksheet with Pivot tables, but that isn't working either
because of the way I'm trying to sort the info. I also have some data
(roughly 10%-20% per file) that I don't need. D1 and D2 are financial
and personnel numbers and some of the D1 values are Balance Sheet
related and others are Revenue/Expense related. I only need the later.

Here's my approach so far:

1) Add a column to my data that picks out if its balance sheet data.
This can be done discovered if either of the following cases are true:
P1 =0
Or P2 < 20000. I set the variable to 0 if true and 1 if not. I then
set this value through a copy paste special values. I sort the data,
filter it for 0s and then delete the rows with 0 value, then filter.

2) Next I create a new sorting value that in essense is a
concatenation of strings: P1&P2&P3...P6. Then I delete the columns
with P1-P6 in them. Each P..
has a fixed length in string form for a total of 18 characters.

3) Here is where I'd like to do a Pivot, but unfortunately, I have too
many unique values. So What I did was do a worksheet sort based on
sorting value.
Then I add 24 columns (12 mths for D1 and 12 months for D2) each with
a header that runs from Jan-Dec. I then set up a formula that say
something like the following:

If Value (MTH) = month(Header) then D1 else 0 (first 12) and
If Value (MTH) = month(Header) then D2 else 0 (second 12).

Once this is done, I do another copy paste special values and delete
the MTH
column from my data.

4) Now I need to combine these by the unique sort variable. I place a
value of 1 in a new column and then in each subsequent row add:

If Sort Value (Prior row) = Sort Value(This row) Then 0, else 1
This creates a Unique Sort Value ID

I then copy paste special values this column and then sort on it in
decending order.

Next I add 24 more columns again with headers from Jan-Dec twice.
In each of these, I add the following logic:

If Unique Sort Value ID = 0 then 0 else
Sumif(sort values, this sort value, jan values)....through dec values
twice.

The idea here is that I'll end up with 12 months of D1 and D2 values
by each unique Sort Value. The problem is that my dataset, while
lessened is still
over 20,000 rows and the sumif is having a bit of difficulty running
through it, even when I restrict placing the formula in rows where I
know the Unique Sort Value ID = 1

This would effectively give me the same results as a pivot table
(without all the subtotal baloney that is always in the way) and I
could read the results into an array that would be roughly 100000 x 25
that is in the format I need
it rather than reading in a 300000 x 10 and trying to figure out how
to transform it into the 100000 x 25 array.

I've also done a lot of turning off and on the calculations on the
worksheet, particularly during the unfiltering so that it doesn't try
to calculate while its unfilter (which can really slow things down and
is uncessary).

Suggestions?

Thanks in advance.
 
T

Tom Ogilvy

Instead of consolidating all your data an then trying to rearrange, do all
the rearranging you can to each subset of the data as you prepare to bring
it into your common array. so when it comes in at least for the subset, it
already has just you index column and 24 columns of data with unnecessary
data eliminated. If this only makes sense for each 12 files opened, then
do it at that time. this should overcome the limitations you are bumping up
against.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top