Unique List (List + Dictionary?)

Sin Jeong-hun · May 17, 2010

For example, I want to extract unique words from some text. Using
List<string> to store the extracted words in the following way will be
inefficient
List<String> words;
foreach(String word in text)
{
if(words.Contains(word)==false) words.Add(word);
}
because, according to the documentation, the Contains is an O(n)
operation.
I was looking for something like a UniqueList but I couldn't. Up until
now, I was using something like
Dictionary<String, object> words;
object dummy = null;
foreach(String word in text)
{
words[word]=dummy;
}
This works but using unnecessary "dummy" object makes the code look
dirty. What would be the best way to achieve what I'm trying to do?
Thanks.

Peter Duniho · May 17, 2010

Sin said:
For example, I want to extract unique words from some text. Using
List<string> to store the extracted words in the following way will be
inefficient
[...] I was using something like
Dictionary<String, object> words;
object dummy = null;
foreach(String word in text)
{
words[word]=dummy;
}
This works but using unnecessary "dummy" object makes the code look
dirty. What would be the best way to achieve what I'm trying to do?

Sounds like you want the HashSet<T> class.

Göran Andersson · May 17, 2010

For example, I want to extract unique words from some text. Using
List<string> to store the extracted words in the following way will be
inefficient
List<String> words;
foreach(String word in text)
{
if(words.Contains(word)==false) words.Add(word);
}
because, according to the documentation, the Contains is an O(n)
operation.
I was looking for something like a UniqueList but I couldn't. Up until
now, I was using something like
Dictionary<String, object> words;
object dummy = null;
foreach(String word in text)
{
words[word]=dummy;
}
This works but using unnecessary "dummy" object makes the code look
dirty. What would be the best way to achieve what I'm trying to do?
Thanks.

If you are using framework 3.5 or later, you can use a HashSet<string>.
The hash set automatically ignores duplicates, so you can just add all
the strings to it and at the end it will only contain one of each:

HashSet<string> words = new HashSet<string>();
foreach (string word in text) {
words.Add(word);
}

You can also use the Distinct extension method to create a distinct
result. This gives simpler code, but has slightly more overhead:

List<string> words = text.Distinct().ToList();

For earlier versions of the framework the dictionary is the best option.
However, you can use a smaller data type than an Object reference, and
one that is not a reference type (as the garbage collector has to scan
all references each run), like a Dictionary<string, byte>.

Peter Duniho · May 17, 2010

Göran Andersson said:
If you are using framework 3.5 or later, you can use a HashSet<string>.
The hash set automatically ignores duplicates, so you can just add all
the strings to it and at the end it will only contain one of each:

HashSet<string> words = new HashSet<string>();
foreach (string word in text) {
words.Add(word);
}

The above can be replaced with:

HashSet<string> words = new HashSet<string>(text);

…where in both examples, of course, whatever "text" is, it's declared to

be IEnumerable said:
You can also use the Distinct extension method to create a distinct
result. This gives simpler code, but has slightly more overhead:

List<string> words = text.Distinct().ToList();

The extension method only works in a situation when you could pass the
object directly to the HashSet constructor, and just using the HashSet
constructor is simpler than the above (and results in an object that
continues to have the "set" semantics desired).

For earlier versions of the framework the dictionary is the best option.
However, you can use a smaller data type than an Object reference, and
one that is not a reference type (as the garbage collector has to scan
all references each run), like a Dictionary<string, byte>.

Seems like a premature optimization to me. The generational GC in .NET
will minimize whatever nominal overhead there is in managing the
dictionary (since it still has to scan the key, which has data locality
with the value, the bulk of the overhead involved is in the memory i/o
cost which is incurred regardless). As far as the data type being
smaller, I doubt there will be any difference between using "object" and
"byte", because the value is stored inside a managed struct with other
data elements, the size of which is likely aligned to at least 32-bit size.

Pete

csharper · May 17, 2010

For example, I want to extract unique words from some text. Using
List<string> to store the extracted words in the following way will be
inefficient
List<String> words;
foreach(String word in text)
{
if(words.Contains(word)==false) words.Add(word);
}
because, according to the documentation, the Contains is an O(n)
operation.
I was looking for something like a UniqueList but I couldn't. Up until
now, I was using something like
Dictionary<String, object> words;
object dummy = null;
foreach(String word in text)
{
words[word]=dummy;
}
This works but using unnecessary "dummy" object makes the code look
dirty. What would be the best way to achieve what I'm trying to do?
Thanks.

Click to expand...

If you are using framework 3.5 or later, you can use a HashSet<string>.
The hash set automatically ignores duplicates, so you can just add all
the strings to it and at the end it will only contain one of each:

HashSet<string> words = new HashSet<string>();
foreach (string word in text) {
words.Add(word);
}

You can also use the Distinct extension method to create a distinct
result. This gives simpler code, but has slightly more overhead:

List<string> words = text.Distinct().ToList();

For earlier versions of the framework the dictionary is the best option.
However, you can use a smaller data type than an Object reference, and
one that is not a reference type (as the garbage collector has to scan
all references each run), like a Dictionary<string, byte>.

When I read the OP's question, I was also thinking that HashSet is
what s/he was looking for. But on 2nd thought, I am wondering how
different is HashSet from a List or Dictionary in terms of
performance. HashSet does save us from one or two lines comparing if
the next item is already in the collection, but how is HashSet
implemented? In its implementation, doesn't it still need to compare
if the next item is already in the set? Other than saving us a couple
of lines of code, does HashSet really improve the performance?

Harlan Messinger · May 17, 2010

csharper said:
When I read the OP's question, I was also thinking that HashSet is
what s/he was looking for. But on 2nd thought, I am wondering how
different is HashSet from a List or Dictionary in terms of
performance. HashSet does save us from one or two lines comparing if
the next item is already in the collection, but how is HashSet
implemented? In its implementation, doesn't it still need to compare
if the next item is already in the set? Other than saving us a couple
of lines of code, does HashSet really improve the performance?

Yes, because the way a HashSet stores its items speeds up searches
enormously in comparison with the way a List operates. With a List, a
search has to examine the existing items one by one. With a HashSet, the
search uses arithmetic to compute where the item will be located if it's
already a member, and checks that location directly. See
http://www.vcskicks.com/hashset.php for a sample performance comparison.

Arne Vajhøj · May 18, 2010

When I read the OP's question, I was also thinking that HashSet is
what s/he was looking for. But on 2nd thought, I am wondering how
different is HashSet from a List or Dictionary in terms of
performance. HashSet does save us from one or two lines comparing if
the next item is already in the collection, but how is HashSet
implemented? In its implementation, doesn't it still need to compare
if the next item is already in the set? Other than saving us a couple
of lines of code, does HashSet really improve the performance?

The name HashSet somewhat implies that it uses a hash table.

Meaning that lookup is a O(1) operation.

So performance of HashSet would be similar to Dictionary
and much better than List which is O(n) for lookup.

List is better for indexing.

Arne

Arne Vajhøj · May 18, 2010

The above can be replaced with:

HashSet<string> words = new HashSet<string>(text);

And if .Split(' ') is added to both, then they will both
compile.

Arne

Peter Duniho · May 18, 2010

Arne said:
And if .Split(' ') is added to both, then they will both
compile.

They compile fine as long as "text" already implements IEnumerable<string>.

There was nothing in the original post to suggest it didn't, as you seem
to be assuming (in fact, the original post strongly suggests it
_does_…the OP didn't show any declaration of "text", but it is already
being used as an enumerable type in the OP).

Pete

Göran Andersson · May 19, 2010

The above can be replaced with:

HashSet<string> words = new HashSet<string>(text);

Good point.

…where in both examples, of course, whatever "text" is, it's declared to

The extension method only works in a situation when you could pass the
object directly to the HashSet constructor, and just using the HashSet
constructor is simpler than the above (and results in an object that
continues to have the "set" semantics desired).

Yes, supposing that the set is the desired result. If you just want a
list as the result, there is no specific advantage of going the way over
a set.

Seems like a premature optimization to me. The generational GC in .NET
will minimize whatever nominal overhead there is in managing the
dictionary (since it still has to scan the key, which has data locality
with the value, the bulk of the overhead involved is in the memory i/o
cost which is incurred regardless). As far as the data type being
smaller, I doubt there will be any difference between using "object" and
"byte", because the value is stored inside a managed struct with other
data elements, the size of which is likely aligned to at least 32-bit size.

The implementation seems to have changed in the latest framework... The
keys and values used to be stored in separate arrays, which would
definitely make the value array smaller with a smaller data type.

Peter Duniho · May 19, 2010

Göran Andersson said:
[...]

The extension method only works in a situation when you could pass the
object directly to the HashSet constructor, and just using the HashSet
constructor is simpler than the above (and results in an object that
continues to have the "set" semantics desired).

Click to expand...

Yes, supposing that the set is the desired result. If you just want a
list as the result, there is no specific advantage of going the way over
a set.

Yes, but under what scenario would the list be the desired result? It
seems contrary to the original problem requirement. In general, the
HashSet<T> will offer all the same operations that a List<T> would,
including enumeration.

I could see maybe wanting the data sorted, in which case having it in a
list would be useful. But in that case, one could easily check for
uniqueness _and_ sort at the same time using, for example,

[QUOTE="SortedList said:
[...] As far as the data type being
smaller, I doubt there will be any difference between using "object" and
"byte", because the value is stored inside a managed struct with other
data elements, the size of which is likely aligned to at least 32-bit
size.

The implementation seems to have changed in the latest framework... The
keys and values used to be stored in separate arrays, which would
definitely make the value array smaller with a smaller data type.[/QUOTE]

The 2.0 CLR has the same implementation as the current one. If there
was ever an implementation that stored the values in standalone array,
it would be the 1.1 CLR which hasn't been in common use for many years.

Pete

Help with syntax	2	Feb 2, 2009
Stuck on crazy Dictionary/List problem...	6	Jan 21, 2008
Translation Service. Please, help me improve it ... Thank You	9	Sep 28, 2012
Fill Dictionary in a Fluent Way	4	Sep 4, 2012
Wire a Configuration to a Service	4	Nov 15, 2012
XML to Dictionary using LINQ	2	Jun 17, 2013
serialize sorteddictionary-does this seem ok	13	Jan 19, 2011
Create Permutations from Dictionary<string, List<string>>	2	Nov 30, 2007

Unique List (List + Dictionary?)

Sin Jeong-hun

Peter Duniho

Göran Andersson

Peter Duniho

csharper

Harlan Messinger

Arne Vajhøj

Arne Vajhøj

Peter Duniho

Göran Andersson

Peter Duniho

Ask a Question

Similar Threads