Optimize an IEnumerable

T

timor.super

Hi group,

imagine I want to find number of values of a word in a string ...

Look at my code :

List<string> thingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, string> aValue in
myClass.GetValues())
{ // do something
}

.....

class MyClass
{
private readonly List<string> _list;
private readonly string _data;

public MyClass(List<string> listOfThingToSeach, string
data)
{
_list = listOfThingToSeach;
_data = data;
}

public IEnumerable<KeyValuePair<int, string>> GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length > 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}
}


the GetValues function use an yield return to construct an IEnumerable

If I do :

List<string> thingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, string> aValue in
myClass.GetValues())
{ // do something
}
foreach (KeyValuePair<int, string> aValue in
myClass.GetValues())
{ // 2 TIMES
}

then, I will parse the string data 2 times, and I need only one
time ...

so I would like to optimize my GetValues function by constructing a
List<> (for example, but maybe something else) during the parsing and
if the function is called a 2nd times, I return the list instead of
doing the parsing again ...
Something like this

public IEnumerable<KeyValuePair<int, string>> GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?


Thanks for your answer

S.
 
N

Nicholas Paldino [.NET/C# MVP]

S.

Why not do the parsing in the constructor and then save the list? Then,
your enumerator can just issue a yield return for each element in the parsed
list.

If you want the lazy-load semantics, then your method could just check
to see if a member is set, if it is, then just use it, otherwise, construct
the list of return values and assign it to the member variable. Then, just
iterate through the list and yield return each item.
 
C

Christof Nordiek

Hi group,

imagine I want to find number of values of a word in a string ...
public IEnumerable<KeyValuePair<int, string>> GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?
You can do it like:

public IEnumerable<KeyValuePair<int, string>> GetValue()
{
if my list is nothing
initialice and fill list:

foreach(.....)
yield return item;

}


or

public IEnumerable<KeyValuePair<int, string>> GetValue()
{
if my list is nothing
{
instanciate list
foreach()
{
get item
add item to list
yield return item
}
else
foreach(.....)
yield return item;
}

This doubles the yield return statment (in code), but in the first run it
will start faster.

Christof
 
N

Nicholas Paldino [.NET/C# MVP]

It will only start slightly faster, as the only overhead you are
removing with returning the item from the list as you generate it is the
overhead from iterating through the loop a second time, and the lookup of
the item from the list, which probably in most cases, is going to be
negligible.

It might even be more costly, as I think the enumerator that is
generated by the compiler is going to be more complex in this case, but
tests would have to be run to be sure.
 
P

Peter Duniho

Nicholas said:
It will only start slightly faster, as the only overhead you are
removing with returning the item from the list as you generate it is the
overhead from iterating through the loop a second time, and the lookup of
the item from the list, which probably in most cases, is going to be
negligible.

It really depends on what you mean by "faster".

I agree that the total cost of the complete enumeration may be basically
the same either way.

However, the initial construction of the parsed data could be very
expensive. By caching it as it's being enumerated the first time, this
cost is spread over the entire enumeration. If the client code is doing
something relatively expensive with each iteration (there's no way to
know from the sample code here whether that's the case), then the client
code may find the parsing enumerator more responsive implemented that
way. Otherwise, while the subsequent calls to the enumerator might be
extremely quick, that first one could be VERY slow.

Why would the code care? Any place where responsiveness is an issue.
For example, if the iteration is somehow tied to the user-interface,
making the user sit and wait for that first one to complete without any
feedback isn't all that great of an idea. Or if the code is doing some
sort of i/o as part of the iteration of the enumerator, at the very
least intermingling the enumeration with the i/o allows for efficient
use of the CPU and i/o resources. And in some situations, there may be
a responsiveness requirement for the i/o itself (for example, network
i/o where the other end will drop the connection after some timeout).

IMHO, all of the proposals from you and Christof's have value, including
the "build the list as you go" one from Christof. Which is most
appropriate depends on the specific situation.

Finally, one warning about the "build the list as you go". One design
disadvantage I see with this technique is that there's no guarantee that
the client will do the entire enumeration. So it would be important not
just to check for existence of the cached list, but to verify that it's
complete. This would probably require inclusion of some sort of "where
did the enumeration leave off last" state in the enumerator that can be
referenced the next time the enumeration is done.

Pete
 
P

Peter Duniho

You've gotten a couple of really useful replies, IMHO, regarding
optimizing specifically via a cache, as you've requested. Though,
personally it seems to me that unless you have a very specific need to
reuse the enumeration in completely different places in the code, it
might be better to just have the _client_ of the enumerator cache the
results as needed.

But even so, I think the basic question has been answered pretty well.
So here are a couple of comments that digress a bit, but which IMHO
should still be useful...

First, the specific code you posted:
[...]
public IEnumerable<KeyValuePair<int, string>> GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length > 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}

I may be missing something, but the above code appears to me to only
return an iteration instance once for each word in your search list.
This may be your intended semantics, but if so it's my opinion that
returning a KeyValuePair isn't appropriate.

Actually, I don't think returning a KeyValuePair is appropriate in any
case. The reason why is that semantically, a KeyValuePair implies a key
that is unique and a value that is associated with that unique key. In
this case, your "key" isn't unique at all as near as I can tell; it's
simply the number of times a specific word showed up in your string.

I understand the temptation to use the KeyValuePair since it _seems_ to
encapsulate the very information you're returning (a pair of related
values). But IMHO it would be much better for you to just define your
own generic struct that does basically the same thing.

Alternatively, at the very least it's my opinion that the key in the
KeyValuePair should be the string and the count be the value. It seems
much more likely to me that your string is the unique element, rather
than the count.

Why would you care? Well, IMHO the biggest risk is that a KeyValuePair
could be used to construct a Dictionary or similar object directly, and
this would fail as soon as the key -- that is, the integer count --
showed up more than once.

I think there are other semantic reasons to fix this part of the code,
but the fact that there's a genuine potential risk for future uses of
the code is probably the only reason one would need to justify fixing it.


My second comment has to do with the "optimization" aspect itself.
IMHO, caching the results is a "sort of" optimization. That is, yes it
has the potential for improving performance, but it doesn't really
change the efficiency of the algorithm.

I note that this algorithm searches the string once for each word in
your list of words. For a very small number of words, this might not be
too bad, but as the number of words goes up, it can dramatically inflate
the cost of the algorithm.

The idea of caching the results can mitigate that problem somewhat, but
given that the algorithm is basically O(N^2), if you can change the
algorithm to something more efficient, you could wind up getting a HUGE
performance increase even without caching.

It depends somewhat on the number of words, the length of the string,
and the number of times you'd enumerate the results. But assuming that
the number of words is more than the number of times you'd enumerate the
results, an O(N) enumeration executed multiple times will still be
faster than an O(N^2) enumeration executed once and cached for future use.

The exact break-even point would depend a little on the specific
performance and how many times you need to do the enumeration for a
given list of words. But it seems to me that in any situation where you
are actually concerned about performance, that means the number of
words, length of string, and/or number of enumerations is relatively
high, and these are the situations where the disparity between an O(N)
algorithm and an O(N^2) algorithm can be very significant.

Of course, if you cache the results _and_ fix the algorithm to be better
than O(N^2), you stand to gain even more.

So, what's an O(N) algorithm to accomplish the same thing? A common one
used for string searches such as yours would be a state machine
implementation, where the nodes of the state graph are connected
according to the letters in the search words.

You traverse the graph by extracting a single character at a time from
the string being searched and using that character to determine which
node to move to; if the character has no matching link to a subsequent
node, you return to the root of the graph. The graph will have
"terminal" nodes representing the end of a search word; if and when you
reach one of those, you know you've found a word.

The exact implementation depends on your needs. The basic
implementation suffices in many cases, but if it is possible for your
search words to overlap, that can be addressed as well.

Using a state graph requires a small amount of initialization to take
place, and for relatively small tasks this initialization could cost
more than the overhead of a less-efficient algorithm. But in those
cases, usually you don't care about performance anyway. For larger
tasks where performance is actually a concern, the initialization is
relatively minor compared to the benefits of being able to go through a
string only once while still recovering every single word you're looking
for.

So, that's a long way of saying, if you are really serious about
optimizing the code, what you need is a different algorithm. You can
mitigate the problems of the O(N^2) algorithm you've got somewhat, but
in the end you've still got an O(N^2) algorithm. It will never scale
well, no matter what you do to it.

For what it's worth, there are other forms of search algorithms that are
similarly efficient. I just mention the state graph because IMHO it's
relatively easy to understand _and_ implement. You might want to look
at other search algorithms; pretty much anything designed specifically
for matching search terms to an input string will be better than the
algorithm you've got now, and it's entirely possible you could find
something that would suit your specific needs better than the
general-purpose state graph solution.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top