string.Replace((char)146, (char)39) does not work?

mohaaron · Jan 30, 2010

Can anyone tell me why this code doesn't work?

// Key = Char to find, Value = Char to replace.
Hashtable specialChars = new Hashtable();
specialChars.Add((char)146, (char)39);

foreach (DictionaryEntry entry in specialChars)
{
this.label.Text = this.label.Text.Replace(Convert.ToChar(entry.Key),
Convert.ToChar(entry.Value));
}

Peter Duniho · Jan 30, 2010

mohaaron said:
Can anyone tell me why this code doesn't work?

Because C# strings are UTF-16, not some random extended ASCII encoding
you've chosen to use.

Hint: if you are casting numeric literals to System.Char, your code is
probably broken.

Both the character you are looking for and the character you want to use
as a replacement should be specified as actual characters. If you don't
want your .cs file to be a UTF-8 encoded file, you can use the \u escape
to specify a character literal using the _Unicode_ UTF-16 code for that
character.

Note that whatever your source extended ASCII encoding is, the character
found at decimal 146 in that encoding is going to have a different
character value in UTF-16. Without knowing exactly what extended ASCII
encoding you are intending to use, I can't say for sure what the actual
UTF-16 value you want is. But based on the code you posted, it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII). If so, then your code should look
like this:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;

foreach (KeyValuePair<char, char> kvp in specialChars)
{
str = str.Replace(kvp.Key, kvp.Value);
}

label.Text = str;

Note some changes I've made:

– Use actual characters in the call to Add()

– Use the strongly-typed generic Dictionary class

– Defer reassignment of the Text property until all
the processing has been done; assigning the Text
property is much more expensive than just the basic
string operations going on, so avoiding doing it over
and over is a good habit to get into

On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

Note also that if you have a large number of characters that _might_ be
replaced, your proposed solution is very inefficient. You will do much
better scanning the string once, checking each character you find:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;
StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
char chAppend;

if (specialChars.TryGetValue(ch, out chAppend))
{
sb.Append(chAppend);
}
else
{
sb.Append(ch);
}
}

label.Text = sb.ToString();

For any input where the string and the list of special characters are
both long, the above should perform better (and for other input, the
performance difference isn't likely to matter).

If you really want to use some other character encoding, such as some
specific extended ASCII encoding, you need to:

– Be specific about which extended ASCII encoding you're using
– Create a byte[] that includes the characters you want to convert to
System.Char
– Use the Encoding class to convert the character data stored in the
byte[] from whatever specific extended ASCII encoding you're using to a
System.String, containing the UTF-16 character instances you want

As you can imagine, doing that is a lot more complicated than just doing
things in UTF-16 in the first place. I'd recommend the simpler approach.

Pete

mohaaron · Feb 1, 2010

Hello Pete,

Thank you so much for your answer. I really appreciate the
thoroughness of it. I have some questions which are inline below.

Because C# strings are UTF-16, not some random extended ASCII encoding
you've chosen to use.

Where can I find this information about C# strings being UTF-16, is
this part of the C# specification?

Hint: if you are casting numeric literals to System.Char, your code is
probably broken.

Will you please explain this?

Both the character you are looking for and the character you want to use
as a replacement should be specified as actual characters. If you don't
want your .cs file to be a UTF-8 encoded file, you can use the \u escape
to specify a character literal using the _Unicode_ UTF-16 code for that
character.

Note that whatever your source extended ASCII encoding is, the character
found at decimal 146 in that encoding is going to have a different
character value in UTF-16. Without knowing exactly what extended ASCII
encoding you are intending to use, I can't say for sure what the actual
UTF-16 value you want is. But based on the code you posted, it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII). If so, then your code should look
like this:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;

foreach (KeyValuePair<char, char> kvp in specialChars)
{
str = str.Replace(kvp.Key, kvp.Value);
}

label.Text = str;

Note some changes I've made:

– Use actual characters in the call to Add()

– Use the strongly-typed generic Dictionary class

– Defer reassignment of the Text property until all
the processing has been done; assigning the Text
property is much more expensive than just the basic
string operations going on, so avoiding doing it over
and over is a good habit to get into

On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

To verify, I am trying to convert a "right single quote" to a "single
quote".

I am wondering why you say to use the literal value for the
replacement char and the original char in the specialChars dictionary.
As far as I can tell there is no way to type the literal value for the
"right single quote" on the keyboard that I have which is a standard
US keyboard. How would you go about typing the literal then?

Using the actual UTF-16 values here seems easier and more efficient in
that I can easily use any char in the char set without the keyboard
involved.

Note also that if you have a large number of characters that _might_ be
replaced, your proposed solution is very inefficient. You will do much
better scanning the string once, checking each character you find:

Dictionary<char, char> specialChars = new Dictionary<char, char>();

specialChars.Add('’', '\'');

string str = label.Text;
StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
char chAppend;

if (specialChars.TryGetValue(ch, out chAppend))
{
sb.Append(chAppend);
}
else
{
sb.Append(ch);
}
}

label.Text = sb.ToString();

This looks like a good efficient solution to char replacement. Very
well done.

For any input where the string and the list of special characters are
both long, the above should perform better (and for other input, the
performance difference isn't likely to matter).

If you really want to use some other character encoding, such as some
specific extended ASCII encoding, you need to:

– Be specific about which extended ASCII encoding you're using
– Create a byte[] that includes the characters you want to convert to
System.Char
– Use the Encoding class to convert the character data stored inthe
byte[] from whatever specific extended ASCII encoding you're using to a
System.String, containing the UTF-16 character instances you want

As you can imagine, doing that is a lot more complicated than just doing
things in UTF-16 in the first place. I'd recommend the simpler approach.

Pete

Thank you again for the helpful response. I really like understanding
the why in how this works.

Regards,

Aaron

Peter Duniho · Feb 1, 2010

mohaaron said:
[...]
Where can I find this information about C# strings being UTF-16, is
this part of the C# specification?

Yes, there and also in MSDN. See the docs for System.String and
System.Char, for example:
http://msdn.microsoft.com/en-us/library/system.string.aspx
http://msdn.microsoft.com/en-us/library/system.char.aspx

Will you please explain this?

Okay. Note I used the word "probably". The point being that if the
programmer is well-versed in the character-encoding issues, they are
very likely to use string and character literals, rather than casting
values.

It is possible to cast an actual UTF-16 value to a char type and get the
correct result. But most programmers familiar with the
character-encoding issues would write the literal with the \u escape
instead. It's the programmers who think that they can just cast a
character value from some arbitrary encoding to char and get the right
result who are preferentially more likely to write code that does that.

You can successfully cast values, and I'm pretty sure the C# compiler
can even handle that conversion for you (i.e. you don't wind up with an
actual cast in the compiled output). So it's _possible_ to find
non-broken code that uses that technique.

It's just not statistically very common.

[...] it appears
you are trying to convert a ’ character (UTF-16 0x2019 to a ' character
(UTF-16 0x27…the same as in ASCII).
[...]
On the first point, rather than the literals '’' and '\'', if you want
to you could use the \u escape with the actual UTF-16 values: '\u2019'
and '\u0027'.

Click to expand...

To verify, I am trying to convert a "right single quote" to a "single
quote".

Okay, just as I expected.

I am wondering why you say to use the literal value for the
replacement char and the original char in the specialChars dictionary.
As far as I can tell there is no way to type the literal value for the
"right single quote" on the keyboard that I have which is a standard
US keyboard. How would you go about typing the literal then?

It depends. First, I don't assume you have a specific keyboard. There
may be keyboards, either configured as US but with modifications, or
simply non-US, which have that key. Second, I don't even assume a
specific OS, though of course most people writing .NET code are using
Windows.

On the Mac I'm using the post this message, I can enter the ’ character
directly, just by using the right key combination (Shift-Option-]).

That said, even on Windows with an unmodified US keyboard, you can
always use Alt-key combinations to enter specific Unicode characters.
Just hold the Alt key down, while you enter the appropriate value on the
keypad, then release the Alt key and you get the character you wanted.

Using the actual UTF-16 values here seems easier and more efficient in
that I can easily use any char in the char set without the keyboard
involved.

You should use the technique that seems to work best for you. I prefer
character literals, because anyone looking at the code can tell exactly
what character you're talking about. But if you find hard-coded Unicode
constants using the \u easier to maintain (not just enter…code
maintenance should be a consideration), then use that instead.

Pete

String.Trim() does not work	4	Oct 23, 2006
System.InvalidOperationException and IDictionary	8	Jul 10, 2005
XmlSerializer and invalid string chars?	0	Nov 22, 2004
Interesting notice regarding pointers	14	Mar 22, 2008
XmlSerializer replacing cr+nl with nl when loading	6	Sep 2, 2008
schema.ini not working	2	Apr 11, 2007
CONCATENATE with CHAR(10) IF NOT ISBLANK	6	Oct 17, 2007
Why Does it generate different result ? using ASCIIEncoding.GetChars(..), ASCIIEncoding.GetString(..	1	Mar 25, 2004

string.Replace((char)146, (char)39) does not work?

mohaaron

Peter Duniho

mohaaron

Peter Duniho

Ask a Question

Similar Threads