string.Replace((char)146, (char)39) does not work?

M

mohaaron

Can anyone tell me why this code doesn't work?

// Key = Char to find, Value = Char to replace.
Hashtable specialChars = new Hashtable();
specialChars.Add((char)146, (char)39);

foreach (DictionaryEntry entry in specialChars)
{
this.label.Text = this.label.Text.Replace(Convert.ToChar(entry.Key),
Convert.ToChar(entry.Value));
}
 
P

Peter Duniho

mohaaron said:
Can anyone tell me why this code doesn't work?

Because C# strings are UTF-16, not some random extended ASCII encoding
you've chosen to use.

Hint: if you are casting numeric literals to System.Char, your code is
probably broken.

Both the character you are looking for and the character you want to use
as a replacement should be specified as actual characters. If you don't
want your .cs file to be a UTF-8 encoded file, you can use the \u escape
to specify a character literal using the _Unicode_ UTF-16 code for that
character.

Note that whatever your source extended ASCII encoding is, the character
found at decimal 146 in that encoding is going to have a different
character value in UTF-16. Without knowing exactly what extended ASCII
encoding you are intending to use, I can't say for sure what the actual
UTF-16 value you want is. But based on the code you posted, it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII). If so, then your code should look
like this:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;

foreach (KeyValuePair<char, char> kvp in specialChars)
{
str = str.Replace(kvp.Key, kvp.Value);
}

label.Text = str;

Note some changes I've made:

– Use actual characters in the call to Add()

– Use the strongly-typed generic Dictionary class

– Defer reassignment of the Text property until all
the processing has been done; assigning the Text
property is much more expensive than just the basic
string operations going on, so avoiding doing it over
and over is a good habit to get into :)

On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

Note also that if you have a large number of characters that _might_ be
replaced, your proposed solution is very inefficient. You will do much
better scanning the string once, checking each character you find:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;
StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
char chAppend;

if (specialChars.TryGetValue(ch, out chAppend))
{
sb.Append(chAppend);
}
else
{
sb.Append(ch);
}
}

label.Text = sb.ToString();

For any input where the string and the list of special characters are
both long, the above should perform better (and for other input, the
performance difference isn't likely to matter).


If you really want to use some other character encoding, such as some
specific extended ASCII encoding, you need to:

– Be specific about which extended ASCII encoding you're using
– Create a byte[] that includes the characters you want to convert to
System.Char
– Use the Encoding class to convert the character data stored in the
byte[] from whatever specific extended ASCII encoding you're using to a
System.String, containing the UTF-16 character instances you want

As you can imagine, doing that is a lot more complicated than just doing
things in UTF-16 in the first place. I'd recommend the simpler approach.

Pete
 
M

mohaaron

Hello Pete,

Thank you so much for your answer. I really appreciate the
thoroughness of it. I have some questions which are inline below.

Because C# strings are UTF-16, not some random extended ASCII encoding
you've chosen to use.

Where can I find this information about C# strings being UTF-16, is
this part of the C# specification?
Hint: if you are casting numeric literals to System.Char, your code is
probably broken.

Will you please explain this?
Both the character you are looking for and the character you want to use
as a replacement should be specified as actual characters.  If you don't
want your .cs file to be a UTF-8 encoded file, you can use the \u escape
to specify a character literal using the _Unicode_ UTF-16 code for that
character.

Note that whatever your source extended ASCII encoding is, the character
found at decimal 146 in that encoding is going to have a different
character value in UTF-16.  Without knowing exactly what extended ASCII
encoding you are intending to use, I can't say for sure what the actual
UTF-16 value you want is.  But based on the code you posted, it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII).  If so, then your code should look
like this:

   Dictionary<char, char> specialChars = new Dictionary<char, char>();

   specialChars.Add('’', '\'');

   string str = label.Text;

   foreach (KeyValuePair<char, char> kvp in specialChars)
   {
     str = str.Replace(kvp.Key, kvp.Value);
   }

   label.Text = str;

Note some changes I've made:

   – Use actual characters in the call to Add()

   – Use the strongly-typed generic Dictionary class

   – Defer reassignment of the Text property until all
   the processing has been done; assigning the Text
   property is much more expensive than just the basic
   string operations going on, so avoiding doing it over
   and over is a good habit to get into :)

On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

To verify, I am trying to convert a "right single quote" to a "single
quote".

I am wondering why you say to use the literal value for the
replacement char and the original char in the specialChars dictionary.
As far as I can tell there is no way to type the literal value for the
"right single quote" on the keyboard that I have which is a standard
US keyboard. How would you go about typing the literal then?

Using the actual UTF-16 values here seems easier and more efficient in
that I can easily use any char in the char set without the keyboard
involved.
Note also that if you have a large number of characters that _might_ be
replaced, your proposed solution is very inefficient.  You will do much
better scanning the string once, checking each character you find:

   Dictionary<char, char> specialChars = new Dictionary<char, char>();

   specialChars.Add('’', '\'');

   string str = label.Text;
   StringBuilder sb = new StringBuilder(str.Length);

   foreach (char ch in str)
   {
     char chAppend;

     if (specialChars.TryGetValue(ch, out chAppend))
     {
       sb.Append(chAppend);
     }
     else
     {
       sb.Append(ch);
     }
   }

   label.Text = sb.ToString();

This looks like a good efficient solution to char replacement. Very
well done.
For any input where the string and the list of special characters are
both long, the above should perform better (and for other input, the
performance difference isn't likely to matter).

If you really want to use some other character encoding, such as some
specific extended ASCII encoding, you need to:

   – Be specific about which extended ASCII encoding you're using
   – Create a byte[] that includes the characters you want to convert to
System.Char
   – Use the Encoding class to convert the character data stored inthe
byte[] from whatever specific extended ASCII encoding you're using to a
System.String, containing the UTF-16 character instances you want

As you can imagine, doing that is a lot more complicated than just doing
things in UTF-16 in the first place.  I'd recommend the simpler approach.

Pete

Thank you again for the helpful response. I really like understanding
the why in how this works.

Regards,

Aaron
 
P

Peter Duniho

mohaaron said:
[...]
Where can I find this information about C# strings being UTF-16, is
this part of the C# specification?

Yes, there and also in MSDN. See the docs for System.String and
System.Char, for example:
http://msdn.microsoft.com/en-us/library/system.string.aspx
http://msdn.microsoft.com/en-us/library/system.char.aspx
Will you please explain this?

Okay. Note I used the word "probably". The point being that if the
programmer is well-versed in the character-encoding issues, they are
very likely to use string and character literals, rather than casting
values.

It is possible to cast an actual UTF-16 value to a char type and get the
correct result. But most programmers familiar with the
character-encoding issues would write the literal with the \u escape
instead. It's the programmers who think that they can just cast a
character value from some arbitrary encoding to char and get the right
result who are preferentially more likely to write code that does that.

You can successfully cast values, and I'm pretty sure the C# compiler
can even handle that conversion for you (i.e. you don't wind up with an
actual cast in the compiled output). So it's _possible_ to find
non-broken code that uses that technique.

It's just not statistically very common. :)
[...] it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII).
[...]
On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

To verify, I am trying to convert a "right single quote" to a "single
quote".

Okay, just as I expected.
I am wondering why you say to use the literal value for the
replacement char and the original char in the specialChars dictionary.
As far as I can tell there is no way to type the literal value for the
"right single quote" on the keyboard that I have which is a standard
US keyboard. How would you go about typing the literal then?

It depends. First, I don't assume you have a specific keyboard. There
may be keyboards, either configured as US but with modifications, or
simply non-US, which have that key. Second, I don't even assume a
specific OS, though of course most people writing .NET code are using
Windows.

On the Mac I'm using the post this message, I can enter the ’ character
directly, just by using the right key combination (Shift-Option-]).

That said, even on Windows with an unmodified US keyboard, you can
always use Alt-key combinations to enter specific Unicode characters.
Just hold the Alt key down, while you enter the appropriate value on the
keypad, then release the Alt key and you get the character you wanted.
Using the actual UTF-16 values here seems easier and more efficient in
that I can easily use any char in the char set without the keyboard
involved.

You should use the technique that seems to work best for you. I prefer
character literals, because anyone looking at the code can tell exactly
what character you're talking about. But if you find hard-coded Unicode
constants using the \u easier to maintain (not just enter…code
maintenance should be a consideration), then use that instead.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top