Adding Up Percents Across Date Ranges

jehugaleahsa · Sep 28, 2009

Our users are allowed to split up a bill and assign percents to other
customers.

Our users can specify the start and end dates (inclusive) that a
percentage split is active. On any given day, the percents must add up
to 100%.

Here is an example of a valid percentage split, assuming it's
February, 2009:

1st to the 2nd - 50%
1st to the 15th - 50%
3rd to the 15th - 50%
16th to the 30th - 100%

Optionally, our users can leave the end date blank, which means never
ending.

I tried to solve this problem by breaking it into date ranges
(windows): 1 - 2, 3 - 15, 16 - 30. However, I have not been having
much luck in finding a correct algorithm.

I feel like this should be an easy task, but it is turning into a
nightmare. Any help or suggestions would be appreciated.

Thanks,
Travis

Peter Duniho · Sep 28, 2009

Our users are allowed to split up a bill and assign percents to other
customers.

Our users can specify the start and end dates (inclusive) that a
percentage split is active. On any given day, the percents must add up
to 100%.

Here is an example of a valid percentage split, assuming it's
February, 2009:

1st to the 2nd - 50%
1st to the 15th - 50%
3rd to the 15th - 50%
16th to the 30th - 100%

Optionally, our users can leave the end date blank, which means never
ending.

I tried to solve this problem by breaking it into date ranges
(windows): 1 - 2, 3 - 15, 16 - 30. However, I have not been having
much luck in finding a correct algorithm.

If you only have to deal with a single month at a time, and you only ever
have a relatively small number of ranges to deal with, I think that
"rossum"'s suggestion may well be the best approach. Even though it's
essentially a brute-force algorithm, IMHO brute-force can in fact
sometimes be the most elegant approach, because of its simplicity.

An alternative would in fact be to analyze the date ranges themselves, as
you seem to have tried. You can enumerate the ranges, creating new ranges
as you go based on the existing ones and any non-identity overlap. As you
find overlap, maintain a list of all the percentages that apply to that
overlap range. As you do this maintenance, check the total for a given
range and see if it exceeds 100%.

In very vague pseudo-code:

for each rangeIn in the input
{
for each rangeCur in the output
{
if rangeIn overlaps rangeOut non-identically
{
replace rangeOut in the output with
up to three new ranges representing the
overlap and the parts of the originals not
included in the overlap, preserving the original
ranges percentage total for the non-overlap part(s)
and adding the two ranges percentage total for the
overlap part
}
else if rangeIn is identical to rangeOut
{
add the rangeIn percentage to the rangeOut's percentage,
storing it back into rangeOut
}
else
{
add rangeIn to the output
}
}
}

At each point where you sum the percentages, you can check the result to
make sure it's <= 100 and report an error immediately. Alternatively, you
could just go do all the processing, and then report all the ranges that
exceed 100 at the end. If you maintain the output range collection in
sorted order (which is not expensive, assuming the input is sorted too),
then you can minimize the cost of enumerating the output range collection,
by terminating your enumeration once the "rangeCur" start date is after
the "rangeIn" end date. If the collection of ranges could be especially
large (thousands or tens of thousands, for example), it could even be
beneficial to maintain it as an indexed list you can binary search.

This latter approach is, as you can see, much more complicated than just
doing a brute-force approach. I would think that the input data should be
small enough that the brute-force approach should work fine. The only
slightly complicating factor is dealing with the open-ended "blank date"
range, but even there I think that should be just a matter of procesing
all the non-open-ended ranges (including the values for any open-ended
ranges that would be included), and then checking any open-ended ranges
together at the end.

But if you are dealing with a very large number of ranges, the algorithm I
describe may allow you to find the same results with fewer computations.
Note that either algorithm is essentially O(m * n), but the actual "m" and
"n" can be very different depending on your input. I suspect that in your
case, the values of "m" and "n" are small enough that a complex
optimization isn't warranted, and doing the brute-force approach is fast
enough (in fact, given that there is additional maintenance overhead for
the theoretically more efficient approach, the brute-force should in fact
outperform the theoretically more efficient approach for input that's
small enough).

I feel like this should be an easy task, but it is turning into a
nightmare. Any help or suggestions would be appreciated.

I feel your pain. I had a similar range-based process I had to deal with,
and while conceptualizing the basic task was easy, there were a lot of
little details to deal with. For what it's worth, I did wind up with a
basic "range management" class to do the heavy lifting for me.
Unfortunately, my scenario wasn't quite the same as yours; I needed to
perform different operations on the ranges than you are, and my input data
was plain integers, not dates. But if you are having trouble and you
think that code example would help more than it would hurt (after all,
looking at a similar but still fundamentally different example could just
make things more confusing), I'm happy to dig it up and post it.

Pete

Ben Voigt [C++ MVP] · Sep 28, 2009

Our users are allowed to split up a bill and assign percents to other
customers.

Our users can specify the start and end dates (inclusive) that a
percentage split is active. On any given day, the percents must add up
to 100%.

Here is an example of a valid percentage split, assuming it's
February, 2009:

1st to the 2nd - 50%
1st to the 15th - 50%
3rd to the 15th - 50%
16th to the 30th - 100%

How's that valid? February NEVER has 30 days.

Optionally, our users can leave the end date blank, which means never
ending.

I tried to solve this problem by breaking it into date ranges
(windows): 1 - 2, 3 - 15, 16 - 30. However, I have not been having
much luck in finding a correct algorithm.

Turn each line item into two events: before date d1 customer x starts paying
y%, after date d2 customer x stops paying y%

Adjust "after d2" into "before d2+1"

Then your example becomes:

(1, C1, +50)
(3, C1, -50)
(1, C2, +50)
(16, C2, -50)
(3, C3, +50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Now sort by d ascending:

(1, C1, +50)
(1, C2, +50)
(3, C1, -50)
(3, C3, +50)
(16, C2, -50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Just iterate through finding the running total. Whenever d changes, the
total must be 100%.

You don't actually have to store the customer number, every customer will
return to zero and never go negative by construction. I just included it so
you can see where things rearranged to.

jehugaleahsa · Sep 29, 2009

How's that valid? February NEVER has 30 days.

Turn each line item into two events: before date d1 customer x starts paying
y%, after date d2 customer x stops paying y%

Adjust "after d2" into "before d2+1"

Then your example becomes:

(1, C1, +50)
(3, C1, -50)
(1, C2, +50)
(16, C2, -50)
(3, C3, +50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Now sort by d ascending:

(1, C1, +50)
(1, C2, +50)
(3, C1, -50)
(3, C3, +50)
(16, C2, -50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Just iterate through finding the running total. Whenever d changes, the
total must be 100%.

You don't actually have to store the customer number, every customer will
return to zero and never go negative by construction. I just included it so
you can see where things rearranged to.

Ah. Now that's simple and efficient. February can have 30 days in make-
believe land!

jehugaleahsa · Sep 29, 2009

How's that valid? February NEVER has 30 days.

Turn each line item into two events: before date d1 customer x starts paying
y%, after date d2 customer x stops paying y%

Adjust "after d2" into "before d2+1"

Then your example becomes:

(1, C1, +50)
(3, C1, -50)
(1, C2, +50)
(16, C2, -50)
(3, C3, +50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Now sort by d ascending:

(1, C1, +50)
(1, C2, +50)
(3, C1, -50)
(3, C3, +50)
(16, C2, -50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Just iterate through finding the running total. Whenever d changes, the
total must be 100%.

You don't actually have to store the customer number, every customer will
return to zero and never go negative by construction. I just included it so
you can see where things rearranged to.

I spent some time to playing around with this solution. It seems that
on the last day (the day when everything expires) you end up with 0%.
I could deal with that. Then I realized that I also had breaks in my
ranges. For instance, more percents could be created mid-March. For
the time between two periods, then, I suppose the only valid percent
is 0%.

So, if I make sure that my users never enter 0%, I can simply ensure
that the sum within any range is either 100% or 0%.

This is a really elegant solution.

jehugaleahsa · Sep 29, 2009

How's that valid? February NEVER has 30 days.

Turn each line item into two events: before date d1 customer x starts paying
y%, after date d2 customer x stops paying y%

Adjust "after d2" into "before d2+1"

Then your example becomes:

(1, C1, +50)
(3, C1, -50)
(1, C2, +50)
(16, C2, -50)
(3, C3, +50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Now sort by d ascending:

(1, C1, +50)
(1, C2, +50)
(3, C1, -50)
(3, C3, +50)
(16, C2, -50)
(16, C3, -50)
(16, C4, +100)
(31, C4, -100)

Just iterate through finding the running total. Whenever d changes, the
total must be 100%.

You don't actually have to store the customer number, every customer will
return to zero and never go negative by construction. I just included it so
you can see where things rearranged to.

Before I posted this question, I created a solution that would find
overlapping ranges, breaking them into small ranges.

Your implementation essentially achieves the same thing, but without
the overhead. My approach required me to

1) Iterate through each range. If I found a range outside its bounds,
I broke the range into smaller pieces (at most 3).
2) Sorted by start and end date.
3) I then grouped by each range.
4) Then I summed up all the percents in each group.

As you can probably imagine, this required quite a lot more overhead
and was more difficult to implement.

Peter Duniho · Sep 29, 2009

I spent some time to playing around with this solution. It seems that
on the last day (the day when everything expires) you end up with 0%.
I could deal with that. Then I realized that I also had breaks in my
ranges. For instance, more percents could be created mid-March. For
the time between two periods, then, I suppose the only valid percent
is 0%.

So, if I make sure that my users never enter 0%, I can simply ensure
that the sum within any range is either 100% or 0%.

This is a really elegant solution.

I like it a lot too. I wish I'd suggested it.

(Though, in my defense, the fact that I got distracted by a similar, but
fundamentally different, problem that I'd solved before simply reinforces
my comment about how posting that solution could confuse the issue rather
than helping. I don't know whether I'd have come up with the same
solution as fast as Ben did, but it sure didn't help that my brain was
being tempted toward a different approach

).

Pete

Ben Voigt [C++ MVP] · Sep 29, 2009

Just iterate through finding the running total. Whenever d changes, the

Before I posted this question, I created a solution that would find
overlapping ranges, breaking them into small ranges.

Your implementation essentially achieves the same thing, but without
the overhead. My approach required me to

1) Iterate through each range. If I found a range outside its bounds,
I broke the range into smaller pieces (at most 3).
2) Sorted by start and end date.
3) I then grouped by each range.
4) Then I summed up all the percents in each group.

As you can probably imagine, this required quite a lot more overhead
and was more difficult to implement.

When you're doing integrals of functions specified piecewise (e.g.
convolution of a triangular pulse by a symmetric rectangular pulse, with
each having different duration/support), the amount of pain associated with
having more regions than necessary is a pretty strong incentive to figure
out a straightforward and optimal method.

Especially when you expect to have to do said convolution on an exam within
a strictly limited amount of time.

Adding Up Percents Across Date Ranges

jehugaleahsa

Peter Duniho

Ben Voigt [C++ MVP]

jehugaleahsa

jehugaleahsa

jehugaleahsa

Peter Duniho

Ben Voigt [C++ MVP]