Is LinQ an option for this scenario?

R

Rich P

Actually, I have two questions.

I wrote a program which displays images in a slideshow type manner. The
image files are .jpg and .bmp and .gif images. They are stored in a
variety of subfolders under a parent folder. There are about 14,000
image files in these subfolders. I retrieve each image file as follows:

List<string> myList = new List<string>();
foreach (string str1 in My.Computer.FileSystem.GetFiles("C:\\1A",
FileIO.SearchOption.SearchAllSubDirectories,
"*.jpg", "*.bmp", "*.gif"))
{
myList.Add(str1);
}

Currently, I read all the files into this list object and then display
each image for 1 second and then display the next image, ... in a loop
(it is basically a search for something by seeing it program) . The
image files are named in the following manner:

aaa1.jpg
aaa2.jpg
aaa3.jpg
...
aaa100.jpg

abcbbb1.jpg
abcbbb2.jpg
...
abcbbb77.jpg

cccttttt1.jpg
cccttttt2.jpg
...
cccttttt142.jpg

ddd1.bmp
ddd2.jpg
ddd3.gif
ddd4.jpg
...
ddd95.jpg
...

and the subfolders are named fldA, fldB, fldC, ... fldZ where image
files that begin with "a" will be stored in fldA, image files that begin
with "b" will be stored in fldB, ...

Basically, I have groups of image files where a group has the same
beginning text in the filename (alpha chars) and then followed by a
numeric char (incremented as aaa1, aaa2, aaa3, ...aaa100). So group aaa
may have 100 image files that begin with "aaa" before encountering a
numeric char, group "abcbbb" may have 77 files, ...

What I want to do is this: when a group of images begins displaying -
group "aaa" for example - I want to display the count of files in that
group while that group of images is being displayed. I would have a
label reading "Count of 'aaa' is 100". Then when the next group of
images is displayed the label would change to "Count of 'abcbbb' is 77"
and so on.

I could like pick the max count for a given group or I could do a "Group
By" type query on the current group of images being displayed. Then -
for each group I would have to search the filename for the point at
which the char becomes numeric and then find the max number value or do
the "Group By" thing based on the alpha portion of the filenames.

in pseudocode I would have something like this:

class myGroup
{
//alpha part of filename in the group
//count of files in this group
string GroupName;
int GroupCount;
}


string s1, s2;
int Lcount = 0;
//store just the group name in another list object
List<myGroup> myGroups = new List<myGroup>();
myGroups = LinQ magic to get group - parsing out the number part of the
group filenames from the 14,000 files in myList -- which may be only 200
individual groups of image files

for (int i = 0; i < myGroups.Count, i++)
{
//now get the list of filenames for this group
//more LinQ magic to get just the "aaa's" then the "abcbbb's", then
the "bbb's", ...

List<string> newFileList = new List<string>();
//get files from myList where the alpha portion of the filenames
matches the current myGroups.GroupName

LabelCount.Text = myGroups.GroupCount.ToString() + " files in " +
myGroups.GroupName.ToString();

for (int j = 0; j < newfileList; j++)
{
//display image
}
}


Question 1: Could LinQ do this? If yes - may I ask for an example how?

Question 2: would it be more efficient to read the subfolders
individually? Where I would just loop through each subfolder.

Like subfolder fldA may store 1000 image files, fldM may have 3000
files, fldQ may have only 50 image files. Right now I am just reading
everything into memory - all 14,000 filenames. would there be any
performance/efficiency difference between reading everything in one
chunk or reading the subfolders individually?



Rich
 
P

Peter Duniho

Rich said:
Actually, I have two questions.

I wrote a program which displays images in a slideshow type manner. The
image files are .jpg and .bmp and .gif images. They are stored in a
variety of subfolders under a parent folder. There are about 14,000
image files in these subfolders. I retrieve each image file as follows:

List<string> myList = new List<string>();
foreach (string str1 in My.Computer.FileSystem.GetFiles("C:\\1A",
FileIO.SearchOption.SearchAllSubDirectories,
"*.jpg", "*.bmp", "*.gif"))
{
myList.Add(str1);
}

You should rid your code of any VB references. There's really no need
for them in C#, and doing things "the VB way" in C# will only slow you
down in the long run.

Use the System.IO.Directory class, and its GetFiles() method in
particular, to obtain a list of files found at a path.

Also note that the List<T> class has an AddRange() method. It is much
more efficient, especially when adding a large number of items, to use
that method instead of adding items individually.

For the moment, I'll take for granted that storing in memory the names
of 14,000 files all at once makes sense. But that seems potentially
inefficient as well. :)

So, as for the questions:
[...]
Question 1: Could LinQ do this? If yes - may I ask for an example how?

LINQ certainly can group data. One question is, is there a particular
order you need the groups to be presented in? And can you confirm that
you do in fact want to display a given group of pictures together? Or
is it simply that you want the count of pictures in a given group to be
displayed with a given picture from that group?

Assuming you have an enumeration of all the files, you can group them
like this…

char[] _rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9' };

var grouped = from filename in myList
group filename by Path
.GetFilename(filename)
.Substring(0, filename.IndexOfAny(_rgchDigits));
Question 2: would it be more efficient to read the subfolders
individually? Where I would just loop through each subfolder.

The most efficient thing would be for each group of files to be in their
own folder. It's not feasible to try to retrieve individual groups from
a single folder; to enumerate the files individually by group, you'd
have to generate the group names and filter a file enumeration by that.
You might as well in that case just get all the files for a folder and
then group them.

That said, certainly working on one folder at a time rather than trying
to manage everything all at once could be more _memory_ efficient, if
not performance efficient. User perception of performance could be
better, simply because your program isn't trying to do so much all at
once (the big performance hit being all the i/o involved in retrieving
14,000 file names from the directory structure all at once).

Hope that helps.

Pete
 
R

Rich P

Thank you for your reply. And "Enumeration" was the word I believe I
was looking for to describer how I have these image files organized.

When I read the files - they are alphabetic. I read all the A's first,
then the B's, C's, ...Q's, W's, X's, Z's.

Confession(bless me almightly one for mixing VB with C# :) I have been
doing VB/VB.Net for several years and have been migrating to C# for the
last couple of years. So I don't have all the C# stuff down yet.
Question:

My.Computer.FileSystem.GetFiles("C:\\1A",
FileIO.SearchOption.SearchAllSubDirectories...

this will search all the subdirectories. How do I search all
subdirectories with System.IO.Directory class - GetFiles() ? Before
My.Computer... I used to have recursive routine that would read each
subfolder\subfolder... using Windows API's. It was pretty fast but way
more lines of code than My.Computer...


Question2:
char[] _rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9' };

var grouped = from filename in myList
group filename by Path
.GetFilename(filename)
.Substring(0, filename.IndexOfAny(_rgchDigits));
<

lets say I set up a test scenario where I have a list of test text files
in C:\1A\A, C:\1A\B, C:\1A\C

in subfolder A I have the following test text files (no content)

testA1.txt
testA2.txt
testA3.txt
testAB1.txt
testAB2.txt
testAB3.txt
testAB4.txt

then in subfolder B I have

testB1.txt
testB2.txt
testB3.txt
testBC1.txt
testBC2.txt
testBC3.txt
testBC4.txt

and the same in subfolder C for the C's.

I want to read all of these text file names into a list and group them
by testA, testAB, testB, testBC, testC, testCD, and get a count of each
group where group testA is count = 3, testAB is count = 4, testB count =
3, testBC count = 4, ...

Using System.IO.Directory how can I read each subdirectory to populate
my list of the test text file names? and how can I use linQ to group
this list for something like the following?

foreach (myGroupTestTxt grp in Result of LinQ Magic)
console.WriteLine(groupName + " " + groupCount.ToString());



Rich
 
P

Peter Duniho

Rich said:
[...]
Question:

My.Computer.FileSystem.GetFiles("C:\\1A",
FileIO.SearchOption.SearchAllSubDirectories...

this will search all the subdirectories. How do I search all
subdirectories with System.IO.Directory class - GetFiles() ?
See:
http://msdn.microsoft.com/en-us/library/ms143316.aspx

[...]
Question2:

char[] _rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9' };

var grouped = from filename in myList
group filename by Path
.GetFilename(filename)
.Substring(0, filename.IndexOfAny(_rgchDigits));

[...]
I want to read all of these text file names into a list and group them
by testA, testAB, testB, testBC, testC, testCD, and get a count of each
group where group testA is count = 3, testAB is count = 4, testB count =
3, testBC count = 4, ...

Using System.IO.Directory how can I read each subdirectory to populate
my list of the test text file names?

You can either enumerate files one directory at a time (see
Directory.GetDirectories() for getting a list of directories in a
directory), or see above for enumerating all files recursively under a
given path.
and how can I use linQ to group
this list for something like the following?

foreach (myGroupTestTxt grp in Result of LinQ Magic)
console.WriteLine(groupName + " " + groupCount.ToString());

Assuming the code I proposed:

foreach (var group in grouped)
{
Console.WriteLine(group.Key + " " + group.Count.ToString());
}

This stuff is all in the documentation. Given the code I proposed
earlier, you could have even used VS's Intellisense to see what the
query result was and figure out how to use it, but of course you could
also have started with the Enumerable.GroupBy() method to see what it
returns and followed the chain of class features from there.

Pete
 
R

Rich P

As alwyas, thank you very much for your reply. I am now using
System.IO.Directory for drilling into subdirectories -- very nice! And
I am attempting to use the code sample you have proposed. But VS is
complaining. Here is what I have attempted thus far:

private void GroupFiles()
{
DirectoryInfo di = new DirectoryInfo(@"C:\1A\1AA\");
FileInfo[] files = di.GetFiles("*.txt", SearchOption.AllDirectories);
List<string> myList = new List<string>();
foreach (FileInfo file in files)
myList.Add(file.Name);

foreach (string str1 in myList)
Console.WriteLine(str1);

/*only searching 1 dir for now -- here are my test files
test1.txt
test2.txt
test3.txt
test4.txt
testA1.txt
testA2.txt
testA3.txt
*/

char[] _rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};

var grouped = from filename in myList
group filename myPath
.GetFilename(filename) <<<--- VS complains here
.Substring(0, filename.IndexOfAny(_rgchDigits));

}


I appologize in advance for my ignorance on the subject of LinQ, but
when I add your proposed code to the routine above - VS complains as
noted. At this point in time I don't have enough experience/intuition
to see what is missing or where to go next with the Linq part of the
exercise. Any suggestions greatly appreciated on how I could list the
count of groups of my test files -- like
group "test" has count = 4, and group "testA" has count = 3. how do I
proceed with Linq to obtain this information?

Thanks again for all the help.

Rich
 
P

Peter Duniho

Rich said:
[...]
var grouped = from filename in myList
group filename myPath
.GetFilename(filename) <<<--- VS complains here
.Substring(0, filename.IndexOfAny(_rgchDigits));
}

I appologize in advance for my ignorance on the subject of LinQ, but
when I add your proposed code to the routine above - VS complains as
noted.

As a general rule: do not write "VS complains". Provide a complete,
specific description of the exact error message.

That said, the problem in this particular case is obvious: you have
failed to copy the code I posted correctly. It's not "myPath". It's
"Path". As in, System.IO.Path (but seeing as how've used Directory and
other System.IO types without qualification, that means you have a
"using" directive, and "Path" by itself ought to be fine).

You've also left out the "by" keyword in the "group by" clause.

I do have a typo in my original post you'll need to fix though: the
method name is "GetFileName", and not "GetFilename".
At this point in time I don't have enough experience/intuition
to see what is missing or where to go next with the Linq part of the
exercise.

You really should practice more dealing with compilation errors. The
errors usually are very specific, though in some cases overly
descriptive. When interpreted correctly, they practically always tell
you exactly what you need to know in order to fix the code.

Finally, I note you haven't changed your code over to use the List<T>
more efficiently:

– First, since you're creating a new one from scratch, you can simply
pass the "files" array to the constructor:

List<string> myList = new List<string>(files);

– Second, even if you didn't want to do that, you could use the
AddRange() method:

List<string> myList = new List<string>();

myList.AddRange(files);

– Third, there's nothing about your code that suggests you really
need to use a List<string> anyway, nor the "…Info" versions of the
System.IO classes. Why not write this:

private void GroupFiles()
{
string[] files = Directory.GetFiles(@"C:\1A\1AA\", "*.txt",
SearchOption.AllDirectories);

foreach (string str1 in files)
Console.WriteLine(str1);

char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9' };

var grouped = from filename in files
group filename by Path
.GetFileName(filename)
.Substring(0, filename.IndexOfAny(rgchDigits));
}

?

Pete
 
R

Rich P

Wouldn't you know it? The day job had me doing something :) (Linq would
have helped out for this quicky proj - slapped together something -
which grouped data for another app).

Anyway -- Rock n Roll !

I am kinda starting to get the idea. Note: when I passed in the
string[] files array to var grouped = ... -- this included the path with
the filenames -- which the subdirectory names included digits in the
chars. So I didn't get the correct grouping I was looking for (only
printed: tes 7 one time in the console window).

So I went back to using FileInfo since that parses out the path part of
the filename and loaded up myList with the FileInfo filenames. Then I
passed myList to var grouped =....

Then I got the correct grouping --

test count = 4
testA count = 3

Yay!

I'm sure I could parse out the path from string[] files, but I am
thinking it is either some of one or some of the other --
files.Substring... or FileInfo something.

Many thanks for all the help and this great lesson. I am starting to
see some glimmer of light in my Linq pursuits :).


Rich
 
P

Peter Duniho

Rich said:
Wouldn't you know it? The day job had me doing something :) (Linq would
have helped out for this quicky proj - slapped together something -
which grouped data for another app).

Anyway -- Rock n Roll !

I am kinda starting to get the idea. Note: when I passed in the
string[] files array to var grouped = ... -- this included the path with
the filenames -- which the subdirectory names included digits in the
chars. So I didn't get the correct grouping I was looking for (only
printed: tes 7 one time in the console window).

That doesn't make sense. The whole point of including the call to
Path.GetFileName() is to remove the path for the purpose of grouping.
So I went back to using FileInfo since that parses out the path part of
the filename and loaded up myList with the FileInfo filenames. Then I
passed myList to var grouped =....

Then I got the correct grouping --

test count = 4
testA count = 3

FileInfo doesn't do anything that you couldn't already do with
Path.GetFileName().
Yay!

I'm sure I could parse out the path from string[] files, but I am
thinking it is either some of one or some of the other --
files.Substring... or FileInfo something.


No. It's Path.GetFileName().

You can use FileInfo, but creating yet another object for a single
operation on the path when there's a static method to accomplish the
same thing is inefficient

Pete
 
R

Rich P

I'm sure I could parse out the path from
string[] files, but I am thinking it is either some of
one or some of the other --
files.Substring... or FileInfo something.

No. It's Path.GetFileName().

how would I implement/integrate

Path.GetFileName()

with

string[] files = Directory.GetFiles(@"C:\1A\1AA\", "*.txt",
SearchOption.AllDirectories);

char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};

var grouped = from filename in files
group filename by Path
.GetFileName(filename)
.Substring(0, filename.IndexOfAny(rgchDigits));




Rich
 
P

Peter Duniho

Rich said:
how would I implement/integrate

Path.GetFileName()

with

string[] files = Directory.GetFiles(@"C:\1A\1AA\", "*.txt",
SearchOption.AllDirectories);

char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};

var grouped = from filename in files
group filename by Path
.GetFileName(filename)
.Substring(0, filename.IndexOfAny(rgchDigits));

I don't understand the question. Path.GetFileName() is _already_
included in the code you posted. There's nothing more to do.
 
R

Rich P

string[] files = Directory.GetFiles(@"C:\1A\1AA\", "*.txt",
SearchOption.AllDirectories);

char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};

var grouped = from filename in files
group filename by Path
.GetFileName(filename)
.Substring(0, filename.IndexOfAny(rgchDigits));

//ignoring casing -- just typing
foreach(var group in grouped)
console.writeline(group.filename + " " + group.count)


this iterates the for body only once and prints

tes 7

in the console window. If I use myList with FileInfo
this for loop iterates twice through the body and prints

test 4
testA 3

How do I use Path.GetFileName()

in the code above to print the same results using the file array instead
of myList?

Rich
 
P

Peter Duniho

Rich said:
string[] files = Directory.GetFiles(@"C:\1A\1AA\", "*.txt",
SearchOption.AllDirectories);

char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};

var grouped = from filename in files
group filename by Path
.GetFileName(filename)
.Substring(0, filename.IndexOfAny(rgchDigits));
[...]
How do I use Path.GetFileName()

in the code above to print the same results using the file array instead
of myList?

Sorry. My code example was flawed. Here's a correct version of the LINQ:

var grouped = from file in files
let name = Path.GetFileName(file)
group file by name.Substring(0, name.IndexOfAny(rgchDigits));

If you compare the above with the code I wrote earlier, you'll see that
in the code I wrote earlier, it was incorrectly calculating the index
based on the full file path, rather than just the file name, even though
I'd gotten just the file name for the purpose of calling the Substring()
method.

In fact, it was only by virtue of your path containing digits early in
the string that the code didn't just crash. A more normal path, without
digits in the directory names, would have resulted in an exception as
the second argument passed to Substring() would have been out of range!

Anyway, sorry for the confusion. Your use of FileInfo avoided the
problem because, of course, that class inherently is resistant to the
bug I put in the code. :) But if you want a more efficient solution
that still works (as opposed to the more efficient solution I provided
before that doesn't work :) ), then the above should do it.

Pete
 
R

Rich P

Again, thank you so much for helping me out with this! The exercise
really points out my lack of intution (almost perturbs me - it perturbs
me :).


I tried

var grouped = from file in files
let name = Path.GetFileName()
group file by name.Substring(0, name.IndexOfAny(rgchDigits));

and VS complained (which I mean Path.GetFileName() had a read
underline). Even though I sort of read the error message I did not
understand it. It did not occur to me to enter the file object in
Path.GetFileName(). I think -- for me -- the real lesson in this
exercise has been on learning to read and interpret the error messages
(along with learning some linQ).

Thank you very much for your help and patience. This sample alone will
serve me well because I have a few projects that will be doing this, and
I just did not have an efficient way to deal with it. I was going to
interface with a sql server DB and use temp tables to perform the same
operation that the LinQ does (with the grouping). But that added way
too many dependencies on an external DB. This is sooo much better !

Rich
 
P

Peter Duniho

Rich said:
[...]
Thank you very much for your help and patience. This sample alone will
serve me well because I have a few projects that will be doing this, and
I just did not have an efficient way to deal with it. I was going to
interface with a sql server DB and use temp tables to perform the same
operation that the LinQ does (with the grouping). But that added way
too many dependencies on an external DB. This is sooo much better !

Glad you got it to work.

For what it's worth, grouping manually is not really that difficult.
You would probably be well-served to study how it might be done
explicitly, without the help of LINQ or a database.

The basic idea is to maintain some kind of group of "bins" into which
items can be placed as they are being grouped. In .NET, one possible
approach (and I believe this is in fact how the LINQ
Enumerable.GroupBy() method is implemented) is to use a dictionary.

For example (disclaimer: as with every other code example I've posted to
this thread, I haven't bothered to compile or run this code…if it's not
correct, it should give you the basic idea even so :) ):

string strPath = /* initialized as appropriate */;
char[] rgchDigits =
{ '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
string[] files = Directory.GetFiles(strPath);
Dictionary<string, List<string>> dictGroups = new Dictionary<string,
string>();

foreach (string file in files)
{
string name = Path.GetFileName(file);
string key = name.Substring(0, name.IndexOfAny(rgchDigits));
List<string> group;

if (!dictGroups.TryGetValue(key, out group))
{
group = new List<string>();
dictGroups.Add(key, group);
}

group.Add(file);
}

foreach (KeyValuePair<string, List<string>> kvp in dictGroups)
{
Console.WriteLine(kvp.Key + " " + kvp.Value.Count);
}

The basic idea in the code above is to maintain a dictionary that can
map the group key to a list of file paths. For each file path, the key
is extracted from the filename, then the dictionary is checked to see if
the key's already been added. If it hasn't, a new List<string> is
created and added to the dictionary for that key.

In either case, the List<string> that goes with the key, either the one
newly created or the one that was already in the dictionary, has the
file path added to it.

The result is a dictionary for which each entry is in fact a single
group, with the key being the base of the filename for the group, and
the value associated with the key being the actual list of file paths
that go with that filename base for the group.

And yes, learning to properly interpret the errors reported to you by
the compiler and run-time will go a long way to making your programming
activities easier and more efficient. :) It may take some practice,
but it's definitely worth it.

Pete
 
R

Rich P

string strPath = @"C:\1A\1AA\";
char[] rgchDigits = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
};
string[] files =
Directory.GetFiles(strPath,"*.txt",SearchOption.AllDirectories);
Dictionary<string, List<string>> dictGroups = new Dictionary<string,
List<string>>();

foreach (string file in files)
{
string name = Path.GetFileName(file);
string key = name.Substring(0, name.IndexOfAny(rgchDigits));
List<string> group;
if (!dictGroups.TryGetValue(key, out group))
{
group = new List<string>();
dictGroups.Add(key, group);
}

group.Add(file);
}

foreach (KeyValuePair<string, List<string>> kvp in dictGroups)
{
Console.WriteLine(kvp.Key + " " + kvp.Value.Count);
}

<<

Thanks for this code sample - works great. This is a little bit easier
to follow - what is going on than the LinQ -- although the LinQ was
several lines shorter. Dictionary is still great though, for when I
can't figure out a LinQ method.


Rich
 
P

Peter Duniho

Rich said:
Thanks for this code sample - works great. This is a little bit easier
to follow - what is going on than the LinQ -- although the LinQ was
several lines shorter. Dictionary is still great though, for when I
can't figure out a LinQ method.

Right. The point being, a heavyweight solution shouldn't be your first
fallback. It's better to figure out the LINQ, but barring that, it's
important to consider that whatever LINQ can do, it does it without the
benefit of a major component like a database.

As far as what's "easier to follow", as I mentioned, I think it's
important to understand what LINQ is doing behind the scenes, just as I
think it's important for all high-level programming languages and tools
to still understand what the underlying implementation is really doing.

But one thing that LINQ really helps address is for the code to show
better what you _intend_ for it to do, than for _how_ you intend it to
work. Of course, in doing so, you lose much of the "how" aspect of the
code. But that's what higher-level programming techniques are all
about; to express more clearly _what_ you want the code to do, and leave
the _how_ to the programming platform.

So the LINQ version is not only more concise, it clearly expresses that
the "what" is "group the data". The explicit version I posted doesn't
really express that at all. It's left to the programmer to figure that
out, after examining the "how" closely.

IMHO, in this particular case, I find the LINQ much better. And in
general, as long as the LINQ is easy to write and simple to read, it's
better. I've seen LINQ abused, but as long as it's used only where it
really does express the "what" much better than a more explicit
implementation would, I think it's an improvement.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top