Confused by implementing an IEnumerable with iterator blocks

H

Harlan Messinger

I wanted to write a class to encapsulate an enumeration of all the files
in a directory. I could just use

Directory.GetFiles(directoryPath, "*.*",
SearchOptions.SearchOption.AllDirectories);

but when hundreds of files are involved and the directory is on a remote
server this causes a substantial delay, so I wanted to design an
enumeration that would browse through the root directory and its
subdirectories incrementally as iteration took place.

So I tried this, having read up only briefly on how "yield" works:

public class DirectoryFileList : IEnumerable<string>
{
public string Root { get; set; }
public DirectoryFileList(string root)
{
Root = root;
}
public IEnumerator<string> GetEnumerator()
{
DirectoryFileEnumerator enumerator = new
DirectoryFileEnumerator();
enumerator.Root = this.Root;
}
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
throw new NotImplementedException(); // to implement
}
}

public class DirectoryFileEnumerator : IEnumerator<string>
{
public string Root { get; private set; }
public string Current
{
get
{
foreach (string file in Directory.GetFiles(Root))
{
yield return file;
}
foreach (string subdirectory in
Directory.GetDirectories(Root, "*.*", SearchOption.AllDirectories))
{
foreach (string file in
Directory.GetFiles(subdirectory))
{
yield return file;
}
}
yield break;
}
}
}

But this produces a compiler error, because "The body of
'ListAspScripts.DirectoryFileEnumerator.Current.get' cannot be an
iterator block because 'string' is not an iterator interface type". What
I'm finding out from elsewhere is that my iterating code has to be in a
method that returns an IEnumerable. From what I now read when I look
more closely, this code that returns one string after another has to be
in a method that returns IEnumerable<string>!

Well, that stinks. I can imagine wanting to retrieve all the files in a
directory in this manner in a variety of applications, and I would like
to encapsulate in a class, decoupled from any other class that might use
it! Now it looks as though I have to define a method in the class that
needs such an enumeration to return that enumeration. Am I missing
another approach?
 
A

Adam Clauss

Harlan said:
I wanted to write a class to encapsulate an enumeration of all the
files in a directory. I could just use

Directory.GetFiles(directoryPath, "*.*",
SearchOptions.SearchOption.AllDirectories);

but when hundreds of files are involved and the directory is on a
remote server this causes a substantial delay, so I wanted to design
an enumeration that would browse through the root directory and its
subdirectories incrementally as iteration took place.

So I tried this, having read up only briefly on how "yield" works:

public class DirectoryFileList : IEnumerable<string>
{
public string Root { get; set; }
public DirectoryFileList(string root)
{
Root = root;
}
public IEnumerator<string> GetEnumerator()
{
DirectoryFileEnumerator enumerator = new
DirectoryFileEnumerator();
enumerator.Root = this.Root;
}
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
throw new NotImplementedException(); // to implement
}
}

public class DirectoryFileEnumerator : IEnumerator<string>
{
public string Root { get; private set; }
public string Current
{
get
{
foreach (string file in Directory.GetFiles(Root))
{
yield return file;
}
foreach (string subdirectory in
Directory.GetDirectories(Root, "*.*", SearchOption.AllDirectories))
{
foreach (string file in
Directory.GetFiles(subdirectory))
{
yield return file;
}
}
yield break;
}
}
}

But this produces a compiler error, because "The body of
'ListAspScripts.DirectoryFileEnumerator.Current.get' cannot be an
iterator block because 'string' is not an iterator interface type".
What I'm finding out from elsewhere is that my iterating code has to
be in a method that returns an IEnumerable. From what I now read when
I look more closely, this code that returns one string after another
has to be in a method that returns IEnumerable<string>!

Well, that stinks. I can imagine wanting to retrieve all the files in
a directory in this manner in a variety of applications, and I would
like to encapsulate in a class, decoupled from any other class that
might use it! Now it looks as though I have to define a method in the
class that needs such an enumeration to return that enumeration. Am I
missing another approach?
Basically move your implementation of
DirectoryFileEnumerator.Current.get{} into
DirectoryFileList.GetEnumerator(). Then your DirectoryFileList becomes
your decoupled object you want. The point of yield is you should not be
implementing a IEnumerator object.
So, then whenever you want the file list, you just do:
DirectoryFileList fileList = new DirectoryFileList(@"c:\temp");
foreach (string file in fileList)
{
// Do something with file
}
 
A

Adam Clauss

Harlan said:
I wanted to write a class to encapsulate an enumeration of all the
files in a directory. I could just use

Directory.GetFiles(directoryPath, "*.*",
SearchOptions.SearchOption.AllDirectories);

but when hundreds of files are involved and the directory is on a
remote server this causes a substantial delay, so I wanted to design
an enumeration that would browse through the root directory and its
subdirectories incrementally as iteration took place.

So I tried this, having read up only briefly on how "yield" works:

public class DirectoryFileList : IEnumerable<string>
{
public string Root { get; set; }
public DirectoryFileList(string root)
{
Root = root;
}
public IEnumerator<string> GetEnumerator()
{
DirectoryFileEnumerator enumerator = new
DirectoryFileEnumerator();
enumerator.Root = this.Root;
}
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
throw new NotImplementedException(); // to implement
}
}

public class DirectoryFileEnumerator : IEnumerator<string>
{
public string Root { get; private set; }
public string Current
{
get
{
foreach (string file in Directory.GetFiles(Root))
{
yield return file;
}
foreach (string subdirectory in
Directory.GetDirectories(Root, "*.*", SearchOption.AllDirectories))
{
foreach (string file in
Directory.GetFiles(subdirectory))
{
yield return file;
}
}
yield break;
}
}
}

But this produces a compiler error, because "The body of
'ListAspScripts.DirectoryFileEnumerator.Current.get' cannot be an
iterator block because 'string' is not an iterator interface type".
What I'm finding out from elsewhere is that my iterating code has to
be in a method that returns an IEnumerable. From what I now read when
I look more closely, this code that returns one string after another
has to be in a method that returns IEnumerable<string>!

Well, that stinks. I can imagine wanting to retrieve all the files in
a directory in this manner in a variety of applications, and I would
like to encapsulate in a class, decoupled from any other class that
might use it! Now it looks as though I have to define a method in the
class that needs such an enumeration to return that enumeration. Am I
missing another approach?
Basically move your implementation of
DirectoryFileEnumerator.Current.get{} into
DirectoryFileList.GetEnumerator(). Then your DirectoryFileList becomes
your decoupled object you want. The point of yield is you should not be
implementing a IEnumerator object.
So, then whenever you want the file list, you just do:
DirectoryFileList fileList = new DirectoryFileList(@"c:\temp");
foreach (string file in fileList)
{
// Do something with file
}
 
H

Harlan Messinger

Harlan said:
I wanted to write a class to encapsulate an enumeration of all the files
in a directory. I could just use

Directory.GetFiles(directoryPath, "*.*",
SearchOptions.SearchOption.AllDirectories);

but when hundreds of files are involved and the directory is on a remote
server this causes a substantial delay, so I wanted to design an
enumeration that would browse through the root directory and its
subdirectories incrementally as iteration took place.
[snip]
Well, that stinks. I can imagine wanting to retrieve all the files in a
directory in this manner in a variety of applications, and I would like
to encapsulate in a class, decoupled from any other class that might use
it! Now it looks as though I have to define a method in the class that
needs such an enumeration to return that enumeration. Am I missing
another approach?

Never mind--I think. Despite what I'd read earlier, I guess the
IEnumerable's GetEnumerator method can contain the iterator block. My
code compiles with the explicit IEnumerator-implementing class removed
and this implementation of GetEnumerator:

public IEnumerator<string> GetEnumerator()
{
foreach (string file in Directory.GetFiles(Root))
{
yield return file;
}
foreach (string subdirectory in
Directory.GetDirectories(Root, "*.*", SearchOption.AllDirectories))
{
foreach (string file in Directory.GetFiles(subdirectory))
{
yield return file;
}
}
yield break;
}
 
P

Peter Duniho

Harlan said:
Harlan said:
I wanted to write a class to encapsulate an enumeration of all the
files in a directory. I could just use

Directory.GetFiles(directoryPath, "*.*",
SearchOptions.SearchOption.AllDirectories);

but when hundreds of files are involved and the directory is on a
remote server this causes a substantial delay, so I wanted to design
an enumeration that would browse through the root directory and its
subdirectories incrementally as iteration took place.
[...]
Never mind--I think. Despite what I'd read earlier, I guess the
IEnumerable's GetEnumerator method can contain the iterator block.

Not only can it, that's the whole point of an iterator block. It has to
appear in something that is returning an enumerator.

By the way, you can improve significantly on the "deferred processing"
aspect of your enumerator with some changes (right now, you still wind
up having .NET enumerate the entire depth and breadth of the directory
structure under the target one, to retrieve all the directory names).
The simplest fix would be to apply the same per-directory logic to the
retrieval of directory paths as for the files themselves, making your
enumerator recursive:

public IEnumerator<string> GetEnumerator()
{
return GetFilesForDirectory(Root);
}

private IEnumerator<string> GetFilesForDirectory(string strDirectory)
{
foreach (string strFile in Directory.GetFiles(strDirectory))
{
yield return strFile;
}

foreach (string strDirectory in Directory.GetDirectories(strDirectory))
{
foreach (string strFile in GetFilesForDirectory(strDirectory))
{
yield return strFile;
}
}
}

Some notes:

-- An iterator block doesn't have to be part of the implementation
for IEnumerable<T> per se. You can use one in _any_ situation you want
an IEnumerator<T>.

-- There's no need to write a "yield break" unless you specifically
need to stop enumeration in the middle of the method. Simply exiting
the method normally will implicitly terminate the iteration.

-- The basic idea of the above is a method that return all of the
files in the given directory, and then uses itself to return all of the
files in all of the subdirectories in the given directory. In this way,
at worst a given iteration of the files will involve a call to retrieve
all the directories in a specific directory, followed by a call to
retrieve all the files in the first directory in that specific directory.

It's _possible_ that you can achieve even lower latency per-iteration by
using p/invoke to access the Win32 file enumeration functions. In
particular, these functions only return one file at a time, and in some
cases they might be able to do so without processing an entire directory
at once.

Fortunately, someone else already asked this question in the past _and_
they were an MSDN subscriber, so Microsoft wrote the code for them. You
can find their reply here:
http://groups.google.com/group/microsoft.public.dotnet.framework/msg/ebac5afd85780bac

Note that this unmanaged code approach may actually have _lower_
throughput overall, because of the frequent transitions between managed
and unmanaged code, as well as whatever overhead might exist in the
unmanaged API by returning only one file at a time (which depends on how
it's implemented...I don't actually know whether this is a real issue or
not). Presumably, reducing the latency for a given iteration is more
important than total throughput.

And of course, if your processing of each file is relatively
time-consuming, you can have the best of both worlds by enumerating the
files in one thread while you process them in another, allowing the
processing thread to start work as soon as the very first filename has
been returned (e.g. use a queue of filenames with one thread producing
and the other consuming).

Hope that helps!
Pete
 
H

Harlan Messinger

Peter said:
Not only can it, that's the whole point of an iterator block. It has to
appear in something that is returning an enumerator.

Yes, and unfortunately, I was going by the examples I'd seen on some web
pages, all of which were methods that returned IEnumerable, without
having noticed a sentence on one of them that read, "yield statements
are only valid in a method/operator/property which returns one of
By the way, you can improve significantly on the "deferred processing"
aspect of your enumerator with some changes (right now, you still wind
up having .NET enumerate the entire depth and breadth of the directory
structure under the target one, to retrieve all the directory names).
The simplest fix would be to apply the same per-directory logic to the
retrieval of directory paths as for the files themselves, making your
enumerator recursive:

public IEnumerator<string> GetEnumerator()
{
return GetFilesForDirectory(Root);
}

private IEnumerator<string> GetFilesForDirectory(string strDirectory)
{
foreach (string strFile in Directory.GetFiles(strDirectory))
{
yield return strFile;
}

foreach (string strDirectory in Directory.GetDirectories(strDirectory))
{
foreach (string strFile in GetFilesForDirectory(strDirectory))
{
yield return strFile;
}
}
}

Some notes:
[snipping]

Hope that helps!

It does, very interesting, thank you for your time! Only problem is now
I've got the following code, which is similar to yours except that I've
added a Mask property for the file selection, and on the inner foreach
I'm getting the error "foreach statement cannot operate on variables of
type 'System.Collections.Generic.IEnumerator<string>' because
'System.Collections.Generic.IEnumerator<string>' does not contain a
public definition for 'GetEnumerator'", and I don't understand where
that comes from.

public class DirectoryFileList : IEnumerable<string>
{
/* ... */
public IEnumerator<string> GetEnumerator()
{
return GetFilesInDirectory(Root);
}

public IEnumerator<string> GetFilesInDirectory(string path)
{
foreach (string file in Directory.GetFiles(path, Mask))
{
yield return file;
}
foreach (string subdirectory in Directory.GetDirectories(path))
{
foreach (string file in GetFilesInDirectory(subdirectory))
{
yield return file;
}
}
}

System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
 
H

Harlan Messinger

Harlan said:
Peter said:
Not only can it, that's the whole point of an iterator block. It has
to appear in something that is returning an enumerator.

Yes, and unfortunately, I was going by the examples I'd seen on some web
pages, all of which were methods that returned IEnumerable, without
having noticed a sentence on one of them that read, "yield statements
are only valid in a method/operator/property which returns one of
By the way, you can improve significantly on the "deferred processing"
aspect of your enumerator with some changes (right now, you still wind
up having .NET enumerate the entire depth and breadth of the directory
structure under the target one, to retrieve all the directory names).
The simplest fix would be to apply the same per-directory logic to the
retrieval of directory paths as for the files themselves, making your
enumerator recursive:

public IEnumerator<string> GetEnumerator()
{
return GetFilesForDirectory(Root);
}

private IEnumerator<string> GetFilesForDirectory(string strDirectory)
{
foreach (string strFile in Directory.GetFiles(strDirectory))
{
yield return strFile;
}

foreach (string strDirectory in
Directory.GetDirectories(strDirectory))
{
foreach (string strFile in GetFilesForDirectory(strDirectory))
{
yield return strFile;
}
}
}

Some notes:
[snipping]

Hope that helps!

It does, very interesting, thank you for your time! Only problem is now
I've got the following code, which is similar to yours except that I've
added a Mask property for the file selection, and on the inner foreach
I'm getting the error "foreach statement cannot operate on variables of
type 'System.Collections.Generic.IEnumerator<string>' because
'System.Collections.Generic.IEnumerator<string>' does not contain a
public definition for 'GetEnumerator'", and I don't understand where
that comes from.

public class DirectoryFileList : IEnumerable<string>
{
/* ... */
public IEnumerator<string> GetEnumerator()
{
return GetFilesInDirectory(Root);
}

public IEnumerator<string> GetFilesInDirectory(string path)
{
foreach (string file in Directory.GetFiles(path, Mask))
{
yield return file;
}
foreach (string subdirectory in Directory.GetDirectories(path))
{
foreach (string file in GetFilesInDirectory(subdirectory))
{
yield return file;
}
}
}

The problem is that GetFilesInDirectory returns an IEnumerator, not an
IEnumerable, so of course the foreach doesn't work!

This code, where GetFilesInDirectory returns an IEnumerable and my
GetEnumerator method explicitly calls that IEnumerable's GetEnumerator
method, does compile:

public IEnumerator<string> GetEnumerator()
{
return GetFilesInDirectory(Root).GetEnumerator();
}

private IEnumerable<string> GetFilesInDirectory(string path)
{
foreach (string file in Directory.GetFiles(path, Mask))
{
yield return file;
}
foreach (string subdirectory in Directory.GetDirectories(path))
{
foreach (string file in GetFilesInDirectory(subdirectory))
{
yield return file;
}
}
}

Finally: what I'd really love is a method that, before doing all this,
could return up front the number of files at all levels under my initial
directory, so I can initialize a progress dialog, but that appears to be
too much to hope for, since as far as I can tell Directory and
DirectoryInfo don't offer a file count.
 
P

Peter Duniho

Harlan said:
[...]
The problem is that GetFilesInDirectory returns an IEnumerator, not an
IEnumerable, so of course the foreach doesn't work!

Yes, exactly. Sorry for the typo. I should have included my usual
"uncompiled, untested" disclaimer. :)
This code, where GetFilesInDirectory returns an IEnumerable and my
GetEnumerator method explicitly calls that IEnumerable's GetEnumerator
method, does compile:

public IEnumerator<string> GetEnumerator()
{
return GetFilesInDirectory(Root).GetEnumerator();
}

private IEnumerable<string> GetFilesInDirectory(string path)
{
foreach (string file in Directory.GetFiles(path, Mask))
{
yield return file;
}
foreach (string subdirectory in Directory.GetDirectories(path))
{
foreach (string file in GetFilesInDirectory(subdirectory))
{
yield return file;
}
}
}

Right. Noting, of course, that an alternative would be to not implement
IEnumerable in your class at all, and simply make the
GetFilesInDirectory() method a public static method in some helper class
(i.e. unless you have some specific need to have a class that stores the
instance of the "Root" string persistently, it would suffice to just
provide for a method to which you can pass whatever starting directory
you want to inspect).
Finally: what I'd really love is a method that, before doing all this,
could return up front the number of files at all levels under my initial
directory, so I can initialize a progress dialog, but that appears to be
too much to hope for, since as far as I can tell Directory and
DirectoryInfo don't offer a file count.

Well, to count the number of files, the OS needs to access the directory
information, which is mostly the same work you have to do just to get a
list of the files anyway. Any API that provided for counting files
without enumerating them would just wind up hiding duplicated effort,
possibly encouraging inefficient code in a non-transparent way.

As I mentioned in my previous post, if the cost of processing each file
is significant, you can set up a producer/consumer scenario, in which
the enumeration of the filenames happens in a separate thread from where
you actually process each file. Not only does that allow processing to
start immediately, rather than having intermittent pauses in processing
as the recursive iterator method has to retrieve more information (i.e.
directories in a directory, files in a directory), it opens the
possibility for more accurate progress indication.

In particular, you can accumulate the total files that will be processed
as you enumerate them (the producer), and add to the progress value as
you process each file (the consumer). Initially the progress indication
will be optimistic, as it won't have the complete count of files. But
the enumeration of the filenames to process will complete much earlier
than the actual processing of them, and once that happens, you'll have
accurate indication of progress in terms of total files to process
versus files processed so far.

I've done this sort of thing in my own code and it works well. In fact,
I've even implemented it in some cases by running the enumeration of the
files twice: once to count them, and another time to process them. This
alternative isn't quite as efficient in terms of disk/network i/o, but
it does avoid the overhead and complexity of the producer/consumer
queue. I don't know how well it'd work on a network, but when I've used
it for local file system access, there's enough caching built into the
OS that the inefficiency isn't really noticeable.

So, there's at least two approaches you can use in conjunction with this
kind of enumeration in order to provide reasonably accurate progress
indication (you'll still have whatever inaccuracies arise due to
differences in processing time for each file, but that's unavoidable).

Pete
 
H

Harlan Messinger

Peter said:
Harlan said:
[...]
The problem is that GetFilesInDirectory returns an IEnumerator, not an
IEnumerable, so of course the foreach doesn't work!

Yes, exactly. Sorry for the typo. I should have included my usual
"uncompiled, untested" disclaimer. :)
This code, where GetFilesInDirectory returns an IEnumerable and my
GetEnumerator method explicitly calls that IEnumerable's GetEnumerator
method, does compile:

public IEnumerator<string> GetEnumerator()
{
return GetFilesInDirectory(Root).GetEnumerator();
}

private IEnumerable<string> GetFilesInDirectory(string path)
{
foreach (string file in Directory.GetFiles(path, Mask))
{
yield return file;
}
foreach (string subdirectory in Directory.GetDirectories(path))
{
foreach (string file in GetFilesInDirectory(subdirectory))
{
yield return file;
}
}
}

Right. Noting, of course, that an alternative would be to not implement
IEnumerable in your class at all, and simply make the
GetFilesInDirectory() method a public static method in some helper class
(i.e. unless you have some specific need to have a class that stores the
instance of the "Root" string persistently, it would suffice to just
provide for a method to which you can pass whatever starting directory
you want to inspect).
Finally: what I'd really love is a method that, before doing all this,
could return up front the number of files at all levels under my
initial directory, so I can initialize a progress dialog, but that
appears to be too much to hope for, since as far as I can tell
Directory and DirectoryInfo don't offer a file count.

Well, to count the number of files, the OS needs to access the directory
information, which is mostly the same work you have to do just to get a
list of the files anyway. Any API that provided for counting files
without enumerating them would just wind up hiding duplicated effort,
possibly encouraging inefficient code in a non-transparent way.

Correct, assuming that OS directories don't keep a count of the number
of files they contain. I was just kind of hoping that they did.
As I mentioned in my previous post, if the cost of processing each file
is significant, you can set up a producer/consumer scenario, in which
the enumeration of the filenames happens in a separate thread from where
you actually process each file. Not only does that allow processing to
start immediately, rather than having intermittent pauses in processing
as the recursive iterator method has to retrieve more information (i.e.
directories in a directory, files in a directory), it opens the
possibility for more accurate progress indication.

In particular, you can accumulate the total files that will be processed
as you enumerate them (the producer), and add to the progress value as
you process each file (the consumer). Initially the progress indication
will be optimistic, as it won't have the complete count of files. But
the enumeration of the filenames to process will complete much earlier
than the actual processing of them, and once that happens, you'll have
accurate indication of progress in terms of total files to process
versus files processed so far.

That sounds like a useful approach, thanks.
I've done this sort of thing in my own code and it works well. In fact,
I've even implemented it in some cases by running the enumeration of the
files twice: once to count them, and another time to process them. This
alternative isn't quite as efficient in terms of disk/network i/o, but
it does avoid the overhead and complexity of the producer/consumer
queue. I don't know how well it'd work on a network, but when I've used
it for local file system access, there's enough caching built into the
OS that the inefficiency isn't really noticeable.

So, there's at least two approaches you can use in conjunction with this
kind of enumeration in order to provide reasonably accurate progress
indication (you'll still have whatever inaccuracies arise due to
differences in processing time for each file, but that's unavoidable).

Cool. Again, I appreciate the time you spend answering these questions!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top