Need Char Comparer Help

J

Jonathan Wood

Greetings,

I'm writing some text-parsing code. I need to compare individual System.Char
variables but need the option of ignoring case in the comparison. I need
help in two areas.

First, does anyone know a good way to compare characters ignoring case? I
found a case insensitive comparer at
http://stackoverflow.com/questions/...ect-way-to-compare-char-ignoring-case/1394909
but I'm not sure that's the best approach. Basically, he converts both
characters to upper case (with every comparison). But I don't know if this
is reliable and it was even suggested he should try comparing with lower
case if the upper case doesn't match just to be safe.

Second, how can I use one of two comparers without hard coding it? I need to
be able to use one of two comparers each time through the loop depending on
if the comparison should be case sensitive. I thought about creating a
Comparer<char> variable and setting it to the appropriate type. But since
you can't create an instance of a comparer, this won't work.

Note: For some of this String.IndexOf() would work but, oddly, the case
insensitive versions of this method only work with strings and not
characters.

Thanks for any tips.

Jonathan
 
F

Family Tree Mike

Greetings,

I'm writing some text-parsing code. I need to compare individual
System.Char variables but need the option of ignoring case in the
comparison. I need help in two areas.

First, does anyone know a good way to compare characters ignoring case?
I found a case insensitive comparer at
http://stackoverflow.com/questions/...ect-way-to-compare-char-ignoring-case/1394909
but I'm not sure that's the best approach. Basically, he converts both
characters to upper case (with every comparison). But I don't know if
this is reliable and it was even suggested he should try comparing with
lower case if the upper case doesn't match just to be safe.

Second, how can I use one of two comparers without hard coding it? I
need to be able to use one of two comparers each time through the loop
depending on if the comparison should be case sensitive. I thought about
creating a Comparer<char> variable and setting it to the appropriate
type. But since you can't create an instance of a comparer, this won't
work.

Note: For some of this String.IndexOf() would work but, oddly, the case
insensitive versions of this method only work with strings and not
characters.

Thanks for any tips.

Jonathan

What about converting the chars to strings, and using the overload that
ignores case? In other words, can 'a' compared to 'A' ever be different
than "a" compared to "A"?

For the second question, have a property in your Comparer class that
controls which mode you are working. Use that property in the compare
function.
 
J

Jonathan Wood

What about converting the chars to strings, and using the overload that
ignores case? In other words, can 'a' compared to 'A' ever be different
than "a" compared to "A"?

I think that would probably work. It seems like it would be horribly
inefficient though because I'd have to create at least one string on every
comparison.
For the second question, have a property in your Comparer class that
controls which mode you are working. Use that property in the compare
function.

But Comparers aren't values, they're types. How do I use a Comparer type as
a property?

Thanks.

Jonathan
 
P

Peter Duniho

Jonathan said:
Greetings,

I'm writing some text-parsing code. I need to compare individual
System.Char variables but need the option of ignoring case in the
comparison. I need help in two areas.

First, does anyone know a good way to compare characters ignoring case?
I found a case insensitive comparer at
http://stackoverflow.com/questions/...ect-way-to-compare-char-ignoring-case/1394909
but I'm not sure that's the best approach. Basically, he converts both
characters to upper case (with every comparison). But I don't know if
this is reliable and it was even suggested he should try comparing with
lower case if the upper case doesn't match just to be safe.

IMHO, the best approach not considering possible performance concerns is
to simply convert each char to a string, and then use the normal string
comparisons.

In the replies seen on that particular question, as usual Jon Skeet
offers good insight. And I'm not just saying that because one of his
answers is what I said. :)

Note that one issue you will run into is the question of surrogate pair
characters. Not every Unicode character can be represented in a single
UTF-16 character (which is the format for System.Char). So depending on
the text you're looking at, you may have characters that simply aren't
comparable to other characters, because a single System.Char instance is
only part of an actual character.

I believe this is one of the reasons that the System.String class has
well-defined, case-insensitive comparison support, while System.Char
does not.

Ultimately, you will need to define the problem better than you've
presented it here. For one, explaining why you are doing
character-by-character comparisons rather than full strings.

If you can limit your range of input to specific locales or other
constraints, then you can use simpler techniques such as converting to a
given case and comparing that. Otherwise, you need to be prepared to
move the comparison into the string domain and do it there. And even in
that case, you need to decide what to do, if anything, for character
values that individually don't represent an actual single character.
Second, how can I use one of two comparers without hard coding it? I
need to be able to use one of two comparers each time through the loop
depending on if the comparison should be case sensitive. I thought about
creating a Comparer<char> variable and setting it to the appropriate
type. But since you can't create an instance of a comparer, this won't
work.

Comparer<T> is abstract. But you can inherit Comparer<char> and create
your own implementations to perform the comparison. That said, I would
just not bother with Comparer<T> and implement IComparer<T> instead.
IMHO, the main benefit for Comparer<T> is to get at default comparisons
for a type; for custom comparisons, implementing IComparer<T>
specifically makes more sense to me.

Either way, you can easily create a variable that can reference one of
multiple implementations of a comparer, whether the base class is
Comparer said:
Note: For some of this String.IndexOf() would work but, oddly, the case
insensitive versions of this method only work with strings and not
characters.

Well, the options type for controlling comparisons is a
"StringComparison" enumeration. It makes sense that you would have to
use a System.String method when using that enumeration. :)

More generally, see my comment above regarding the impossibility of an
actual general-purpose case-insensitive comparison for characters. It's
one thing to simply order characters according to their numeric value.
But culture-specific comparisons often require more context than is
available from a single character. In situations where a
culture-specific comparison _can_ be done without that context, there's
no general rules for coming up with those comparisons, so each
application needs to define its own comparison.

Pete
 
J

Jonathan Wood

Peter said:
Note that one issue you will run into is the question of surrogate pair
characters. Not every Unicode character can be represented in a single
UTF-16 character (which is the format for System.Char). So depending on
the text you're looking at, you may have characters that simply aren't
comparable to other characters, because a single System.Char instance is
only part of an actual character.

I believe this is one of the reasons that the System.String class has
well-defined, case-insensitive comparison support, while System.Char does
not.

Okay, that's interesting. I wasn't able to imagine why there was no
case-insensitive searching support for System.Char.
Ultimately, you will need to define the problem better than you've
presented it here. For one, explaining why you are doing
character-by-character comparisons rather than full strings.

Not sure what you mean by "full strings." As just one of many examples, I
may want to find the next location that contains a space, a double-quote, or
a new line. In other cases, I may want to find the next location that is not
one of these.

I've been developing in C and C++ for many years now. While code bloat seems
to be the norm these days, I still try and consider efficiency. Normally,
characters are a lot more efficient that strings and, besides, it is
characters that I need to find.

But, if it's that much of an issue, I guess I could search for
single-character strings instead.
Either way, you can easily create a variable that can reference one of
multiple implementations of a comparer, whether the base class is
Comparer<char> or they simply both (all?) implement IComparer<char>.

I believe you. But I was unable to create a variable reference to a
Comparer<char>. Again, I get errors that it can't be instantiated.

But I guess this problem goes away if I search for strings, since
case-insensitive string searching is already supported.

I'll have to think this through and see if strings will work for everything
I need to do. I'll convert to strings if I need to, but I won't like it.
Like I said, I would prefer to be efficient and, frankly, the whole range of
Unicode issues just seems like a big mess to me.

Thanks.

Jonathan
 
J

Jonathan Wood

Peter said:
Ultimately, you will need to define the problem better than you've
presented it here. For one, explaining why you are doing
character-by-character comparisons rather than full strings.

On further thought, it's hard to see how strings can be used in any sort of
efficient way.

How do you find the first location that contains one of several strings? How
do you find the first location that does not contain any of several strings?

Thanks.

Jonathan
 
P

Peter Duniho

Jonathan said:
[...]
Not sure what you mean by "full strings." As just one of many examples,
I may want to find the next location that contains a space, a
double-quote, or a new line. In other cases, I may want to find the next
location that is not one of these.

That doesn't explain your original question. A space, double-quote, or
new-line doesn't have upper or lower case. Why would you need
case-insensitive comparisons if you are trying to find those characters,
or the lack of those characters?
[...]
Either way, you can easily create a variable that can reference one of
multiple implementations of a comparer, whether the base class is
Comparer<char> or they simply both (all?) implement IComparer<char>.

I believe you. But I was unable to create a variable reference to a
Comparer<char>. Again, I get errors that it can't be instantiated.

Show the code. My guess is that you are trying to "new
Comparer<char>()", which of course cannot work. Nothing I wrote should
have been interpreted as suggesting that you do something like that.

If you aren't specific about the code you're trying to write and what
error(s) happen, there's no way to suggest a fix.
[...] Like I said, I would prefer to be efficient and, frankly,
the whole range of Unicode issues just seems like a big mess to me.

The thing that is a big mess is human language, and in particular the
written forms for those languages. Unicode was designed to try to
address the variety of human languages that exist and while it does so
in a reasonably efficient, predictable way you are still dealing with
human languages.

If you don't care about your code being correct for any but a specific
human language (e.g. English), then you may be able to not worry about
these issues. However, most of the time that's the wrong choice to
make. Code often lives longer or is used more broadly than one expects,
and it's worthwhile trying to make sure it's right the first time.

Pete
 
J

Jonathan Wood

Peter said:
That doesn't explain your original question. A space, double-quote, or
new-line doesn't have upper or lower case. Why would you need
case-insensitive comparisons if you are trying to find those characters,
or the lack of those characters?

It was just one of many examples. Of course, I want the ability to find
letters as well.
[...] Like I said, I would prefer to be efficient and, frankly, the whole
range of Unicode issues just seems like a big mess to me.

The thing that is a big mess is human language, and in particular the
written forms for those languages. Unicode was designed to try to address
the variety of human languages that exist and while it does so in a
reasonably efficient, predictable way you are still dealing with human
languages.

If you don't care about your code being correct for any but a specific
human language (e.g. English), then you may be able to not worry about
these issues. However, most of the time that's the wrong choice to make.
Code often lives longer or is used more broadly than one expects, and it's
worthwhile trying to make sure it's right the first time.

The original promise of Unicode is that most of these problems would go
away. Obviously, it hasn't been entirely successful.

Jonathan
 
P

Peter Duniho

Jonathan said:
It was just one of many examples. Of course, I want the ability to find
letters as well.

"Of course"? Why "of course"? Why is it useful for you to search for a
specific character in a case-insensitive way? How is a specific
alphabetic character useful to identify out of context? What makes the
'a' in the middle of the word "aardvark" more significant than the ones
at the beginning? Or alternatively, why is finding all three instances
of 'a' in the word "aardvark" important? And given either answer, why
is it important for the search to be case-insensitive?

Just as important, what is the source of these characters? How will
your code choose a specific character to be found? Can you eliminate
the possibility that your code will try to find "problematic"
characters, such as surrogate pair characters, or even characters that
should not compare as equal when converted to a different case (such as
the classic example of the Turkish "i" character)?

As I said before, if you can restrict your problem space to some
specific, more tractable subset of the Unicode character space, then you
can successfully make simplifications and assumptions in your code that
may accomplish your goals without errors in the implementation.

But unless you are more specific about your exact requirements, there's
no way for anyone else to suggest what simplifications or assumptions
might be permissible.
[...] Like I said, I would prefer to be efficient and, frankly, the
whole range of Unicode issues just seems like a big mess to me.

The thing that is a big mess is human language, and in particular the
written forms for those languages. Unicode was designed to try to
address the variety of human languages that exist and while it does so
in a reasonably efficient, predictable way you are still dealing with
human languages. [...]

The original promise of Unicode is that most of these problems would go
away.

I have never heard any such promise made on Unicode's behalf. If you
have some specific reference documenting such a promise, I'd love to
know about it.

The only goal I'm aware of with respect to Unicode is to consolidate all
the various character encodings used worldwide. To think that it could
possibly consolidate the way human beings communicate with each other
would IMHO be folly, and I doubt that anyone involved with the design of
Unicode made any such promise.

Pete
 
J

Jonathan Wood

Peter said:
"Of course"? Why "of course"?

Well, I did say originally that that it was just one of many examples.
Limiting it to non-letters seems somewhat arbitrary.
Why is it useful for you to search for a specific character in a
case-insensitive way? How is a specific alphabetic character useful to
identify out of context? What makes the 'a' in the middle of the word
"aardvark" more significant than the ones at the beginning? Or
alternatively, why is finding all three instances of 'a' in the word
"aardvark" important? And given either answer, why is it important for
the search to be case-insensitive?

I don't see the relevance of all this but I'm trying to write a generic
class to parse text (to be used by a less-generic class that I'm writing).
Many parsing needs rely on a range of characters that include letters, and
many parsers need to ignore case.
Just as important, what is the source of these characters? How will your
code choose a specific character to be found? Can you eliminate the
possibility that your code will try to find "problematic" characters, such
as surrogate pair characters, or even characters that should not compare
as equal when converted to a different case (such as the classic example
of the Turkish "i" character)?

I have no way to eliminate such possibilities. I want to make this
particular class as generic as possible.
As I said before, if you can restrict your problem space to some specific,
more tractable subset of the Unicode character space, then you can
successfully make simplifications and assumptions in your code that may
accomplish your goals without errors in the implementation.

Yes, I read that comment.
I have never heard any such promise made on Unicode's behalf. If you have
some specific reference documenting such a promise, I'd love to know about
it.

It was specifically stated many times that it would eliminate the
"multi-byte" case where more than one character variables are sometimes
needed to represent one character. We heard lots about that when Microsoft
started making VB and Windows Unicode.

Jonathan
 
P

Peter Duniho

Jonathan said:
Well, I did say originally that that it was just one of many examples.
Limiting it to non-letters seems somewhat arbitrary.

For us to understand the question, we need to understand the precise
specification of your design. Until you can and will provide that, a
correct answer cannot be given.
I don't see the relevance of all this but I'm trying to write a generic
class to parse text (to be used by a less-generic class that I'm
writing). Many parsing needs rely on a range of characters that include
letters, and many parsers need to ignore case.

The relevance is that you have presented the classic "I've decided on a
specific implementation for a more general problem, and I want someone
to tell me how to write the implementation, even though what might
actually be more helpful to me is information on how to choose a
different implementation that doesn't have the same problems the
implementation I've chosen has" kind of question. If you would be
willing to step back and explain in more detail the broader picture, you
could very well get advice that would be much better than what you're
receiving now.

As far as your specific reply goes: parsing generally does not concern
itself with individual characters on a character-by-character basis,
except those used as delimiters. Rather, a string is tokenized, and
then complete tokens are compared. At that point, one is dealing with
whole strings and a case-insensitive comparison makes sense.

If your parser specifically needs to compare something
character-by-character, and this comparison needs to be
case-insensitive, you should explain why that is. For example, are you
trying to resolve tokens using some kind of state graph implementation,
instead of using a hashtable?

(Though, the latter – a hashtable implementation – still being
problematic with respect to case-insensitive comparisons; frankly,
case-insensitive parsing can be complicated unless you are willing to
restrict the input so that simple techniques like simply shifting the
case of the input can work reliably).

Simply saying that you are parsing is not in and of itself explanation
enough to understand the problem, because normally your stated goal
isn't something that shows up in parsing.
I have no way to eliminate such possibilities. I want to make this
particular class as generic as possible.

If that's really true, then you are stuck converting System.Char to
single-character System.String and using the case-insensitive
comparisons in the latter class. Or, you have to write your own custom
character comparison that takes into account every single
cultural-specific rule that could come up in Unicode.
[...]
I have never heard any such promise made on Unicode's behalf. If you
have some specific reference documenting such a promise, I'd love to
know about it.

It was specifically stated many times that it would eliminate the
"multi-byte" case where more than one character variables are sometimes
needed to represent one character. We heard lots about that when
Microsoft started making VB and Windows Unicode.

First, that has nothing at all to do with the broader issues here. Yes,
surrogate pairs come up, but in spite of what you wrote, I suspect they
can be ignored for the purposes of your code. And more generally, even
single UTF-16 characters can have problems when you attempt to compare
in a case-insensitive way, because of cultural issues.

Second, while someone at Microsoft might have made that claim, that
doesn't mean it was an actual design goal for Unicode. The main issue
is that while a 16-bit character encoding was desirable, that's not
actually enough bits to represent all the world's typography. I don't
believe that the actual designers of Unicode were unaware of this.

In any case, you are stuck with the way things are. Whatever you
believe was the original promise of Unicode, we have only the Unicode
that exists today, and it promises only those things that the Unicode
that exists today can deliver. As programmers, we simply have to work
within those constraints.

Pete
 
J

Jonathan Wood

Peter said:
As far as your specific reply goes: parsing generally does not concern
itself with individual characters on a character-by-character basis,
except those used as delimiters. Rather, a string is tokenized, and then
complete tokens are compared. At that point, one is dealing with whole
strings and a case-insensitive comparison makes sense.

If you could provide an example of tokenizing without inspecting
character-by-character, I would be very interested in that.

Jonathan
 
P

Peter Duniho

Jonathan said:
If you could provide an example of tokenizing without inspecting
character-by-character, I would be very interested in that.

That's actually kind of a hard question to answer, simply because many
parsing algorithms inherently just _assume_ you've tokenized the input.
That is, the _parsing_ is the actual interpretation of the tokens, not
the initial scanning/token-generation of the input.

For example, here's the Wikipedia page for the general class of parsers
known as "recursive descent":
http://en.wikipedia.org/wiki/Recursive_descent_parser

Note that in the C example, they've simply omitted the "getsym()"
function, because it's assumed to be a simple token-iterating function.

For a more general discussion of the process of tokenization, what you
really want to look at are lexical analyzers, or "lexers". You can find
a starting discussion here:
http://en.wikipedia.org/wiki/Tokenize

Note on that page a link to this page:
http://en.wikipedia.org/wiki/List_of_C_Sharp_lexer_generators

This is a short list of lexer generators written in/for C#. Depending
on what your specific grammar that you're trying to parse looks like,
you may be able to avoid implementing the lexer/parser altogether, and
instead use these tools.

Again, until you can and will state in a clear, precise way exactly what
you're trying to parse, it's not really possible to know exactly what
techniques might be most applicable. There is a whole realm of study in
lexical analysis and parsing, with a wide variety of different solutions
that are appropriate for different specific needs. The simplest are for
case-sensitive languages like C, while the most complicated involve
natural language processing. Where your specific task falls within that
very broad range will inform with respect to what implementation is most
appropriate, and/or most likely to be efficient and easily implemented.

Pete
 
J

Jonathan Wood

Peter said:
That's actually kind of a hard question to answer, simply because many
parsing algorithms inherently just _assume_ you've tokenized the input.
That is, the _parsing_ is the actual interpretation of the tokens, not the
initial scanning/token-generation of the input.

To my way of thinking, parsing means constructing meaning from the raw text.

http://en.wikipedia.org/wiki/Parsing
"In computer science and linguistics, parsing, or, more formally, syntactic
analysis, is the process of analyzing a text, made of a sequence of tokens
(for example, words), to determine its grammatical structure with respect to
a given (more or less) formal grammar."

FWIW, I've written a language interpreter before. For me, there was never
any question about if I'd need to parse the raw text and I can't imagine
that ever being the question for such a task. The text needs parsing, I'm a
programming, so that's what should be done. Is this why we had so much
trouble communicating about what I'm trying to do?
For a more general discussion of the process of tokenization, what you
really want to look at are lexical analyzers, or "lexers". You can find a
starting discussion here:
http://en.wikipedia.org/wiki/Tokenize

Note on that page a link to this page:
http://en.wikipedia.org/wiki/List_of_C_Sharp_lexer_generators

I looked at this code for a little while. The code is hard to follow because
it's largely table-driven and not just direct processing of the input. But I
suppose the result is basically that it *is* case sensitive. So maybe my
best approach is to just be case sensitive and this whole issue will go
away. If code using the class wants to be case-insensitive, then it can
simply specify both the upper and lower case versions of any characters.
Again, until you can and will state in a clear, precise way exactly what
you're trying to parse, it's not really possible to know exactly what
techniques might be most applicable. There is a whole realm of study in
lexical analysis and parsing, with a wide variety of different solutions
that are appropriate for different specific needs. The simplest are for
case-sensitive languages like C, while the most complicated involve
natural language processing. Where your specific task falls within that
very broad range will inform with respect to what implementation is most
appropriate, and/or most likely to be efficient and easily implemented.

Dude?!? Still with the "state what you're trying to do?" Between the two
threads, I've done that repeatedly now. If you have a question about what
I'm doing, you can ask. I have no idea what part of my repeated descriptions
are unclear to you.

Jonathan
 
P

Peter Duniho

Jonathan said:
To my way of thinking, parsing means constructing meaning from the raw
text.

http://en.wikipedia.org/wiki/Parsing
"In computer science and linguistics, parsing, or, more formally,
syntactic analysis, is the process of analyzing a text, made of a
sequence of tokens (for example, words), to determine its grammatical
structure with respect to a given (more or less) formal grammar."

The above is NOT consistent with your "way of thinking". If you think
it is, then you have misunderstood something in the Wikipedia
definition. The Wikipedia definition clearly describes parsing as an
interpretive step taken _after_ the "raw text" has already been
processed into a more abstract form (i.e. tokenized).

In that definition, "parsing" is synonymous with "syntactic analysis",
which is quite distinct from the "lexical analysis" that takes place at
the "raw text" stage.
FWIW, I've written a language interpreter before. For me, there was
never any question about if I'd need to parse the raw text and I can't
imagine that ever being the question for such a task. The text needs
parsing, I'm a programming, so that's what should be done.

I'm sorry. I don't understand the above paragraph, or what you're
trying to express.
Is this why
we had so much trouble communicating about what I'm trying to do?

You never said up front what you were trying to do. _That_ is the
reason we had trouble communicating about it. Your only question
initially was specifically about a very specific implementation detail,
without any reference to the larger picture.
I looked at this code for a little while. The code is hard to follow
because it's largely table-driven and not just direct processing of the
input.

Which code? In the tokenize page? Or did you follow a link from the
lexer generator page and look at code there?

The C code on the tokenize page is not table-driven at all. The grammar
is embedded in the code itself. (In fact, some might consider that a
fundamental design limitation, as it makes any extension of the grammar
a code maintenance problem instead of a data maintenance problem).
But I suppose the result is basically that it *is* case
sensitive. So maybe my best approach is to just be case sensitive and
this whole issue will go away. If code using the class wants to be
case-insensitive, then it can simply specify both the upper and lower
case versions of any characters.

That is certainly a valid approach IMHO. Alternatively, you can require
the client code to provide a Comparer<char> for your own code to use.
That way, it has complete control over every aspect of the per-character
comparison as needed and as possible.

Yet another alternative would be to require the client code to provide a
Comparer<string> implementation, and handle the comparison _after_ the
lexical analysis has been done (i.e. after the input has been tokenized).

Personally, I'd go for that last approach. I can't think of a single
programming language where tokens would be delimited in a way that
character case would matter. At worst, there's a transition between
alphanumeric and non-alphanumeric characters, and often there will be
whitespace, commas, semi-colons, etc. between tokens. Only the
and of course in said:
Dude?!? Still with the "state what you're trying to do?" Between the two
threads, I've done that repeatedly now.

That's not true at all. The first – and ONLY – time you mentioned the
fact that you're trying to write an HTML parser was in the _other_
thread, and of course when I wrote the above, I hadn't even seen that
post in the other thread.

If you think that your statements about trying to write a "generic
parser" added anything informative with respect to my question about
"exactly what you're trying to parse", unfortunately you would be
mistaken about that. As I alluded to in my previous reply, the entire
idea of a globally "generic parser" is very abstract, and more a topic
for a graduate student's research, or a doctoral thesis. Such a
statement provides no practical insight into the actual design
requirements of what you're trying to accomplish.
If you have a question about what I'm doing, you can ask.

I have asked. And if you think it's fine for me to ask, why are you
giving me grief about asking? I would not ask a question that I already
know the answer to, so if I ask a question you _think_ you've answered,
then it should be clear to you that somehow, the answer was not
conveyed. I either misunderstood your reply, or your reply did not
actually answer the question you thought it did.

Either way, reiterating the question is the only way to make forward
progress.

I have no idea what part of my repeated
descriptions are unclear to you.

Again, there have been no "repeated descriptions" of the specific
application. You have mentioned the HTML parser application only once,
and that was in a post I had not had a chance to read when I wrote the
text (quoted above) that you are apparently so upset about.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top