String.Replace Anomoly

L

Levidikus

Normally, I never have any problems with String.Replace(). However, I
found that I need to replace multiple instances of the character
"ª" (\xAA) with a # symbol. The input file is a simple one line
file. I read in the file into a string called strLine. Then when I
do a a simple replace ... here is what I have tried:

strLine = strLine.Replace("ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"\xAA", "#"); // Doesn't replace ...

if (strLine.Contains(@"\xAA")) MessageBox.Show("found one"); // No
message box ...

if (strLine.Contains("ª")) MessageBox.Show("found one"); // No message
box ...

if (strLine.Contains(@"ª")) MessageBox.Show("found one"); // No
message box ...

Any ideas either what I'm doing wrong, or a better way to try to
replace is persistent character that just won't go away?
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Levidikus said:
Normally, I never have any problems with String.Replace(). However, I
found that I need to replace multiple instances of the character
"ª" (\xAA) with a # symbol. The input file is a simple one line
file. I read in the file into a string called strLine. Then when I
do a a simple replace ... here is what I have tried:

strLine = strLine.Replace("ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"\xAA", "#"); // Doesn't replace ...

if (strLine.Contains(@"\xAA")) MessageBox.Show("found one"); // No
message box ...

if (strLine.Contains("ª")) MessageBox.Show("found one"); // No message
box ...

if (strLine.Contains(@"ª")) MessageBox.Show("found one"); // No
message box ...

Any ideas either what I'm doing wrong, or a better way to try to
replace is persistent character that just won't go away?

I think that you have tried every possible combination except the one
that works... Try this:

strLine = strLine.Replace("\xAA", "#");
 
R

Roman Wagner

Look like there is a problem with your strLine.

Following works as expected

string test = "\xAA ª";
MessageBox.Show(
String.Concat(test,Environment.NewLine,
test.Replace("ª", "#"),
Environment.NewLine,
test.Replace("\xAA", "#")));
 
U

UL-Tomten

found that I need to replace multiple instances of the character
"ª" (\xAA) with a # symbol. The input file is a simple one line

The following works for me:

string feminineIndicatorChar = Char.ConvertFromUtf32(0xaa);
string b = feminineIndicatorChar.Replace(feminineIndicatorChar,
"#"); // b == "#"

Are you sure your strLine really contains the feminine ordinal
indicator? Can you check in a debugger?
 
J

Jon Skeet [C# MVP]

Normally, I never have any problems with String.Replace(). However, I
found that I need to replace multiple instances of the character
"ª" (\xAA) with a # symbol. The input file is a simple one line
file. I read in the file into a string called strLine. Then when I
do a a simple replace ... here is what I have tried:

strLine = strLine.Replace(@"\xAA", "#"); // Doesn't replace ...

Your mistake is using a verbatim string literal. That is looking for a
substring of backslash, x, A, A. You want it to look for the string
represented by Unicode U+00AA, i.e. '\xAA'. In other words, you *want*
the character escaping which verbatim string literals remove. Just get
rid of the @ and it will be fine.

I would warn against using \x though - because the number of
characters used varies. For instance:

\xAAOkay - does what you want
\xAABad - doesn't do what you want (it'll be U+AABA and then 'd')

Use \u00aa instead - then there's no ambiguity.

Jon
 
D

Doug Semler

Normally, I never have any problems with String.Replace(). However, I
found that I need to replace multiple instances of the character
"ª" (\xAA) with a # symbol. The input file is a simple one line
file. I read in the file into a string called strLine. Then when I
do a a simple replace ... here is what I have tried:

strLine = strLine.Replace("ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"ª", "#"); // Doesn't replace ...

strLine = strLine.Replace(@"\xAA", "#"); // Doesn't replace ...

if (strLine.Contains(@"\xAA")) MessageBox.Show("found one"); // No
message box ...

if (strLine.Contains("ª")) MessageBox.Show("found one"); // No message
box ...

if (strLine.Contains(@"ª")) MessageBox.Show("found one"); // No
message box ...

Any ideas either what I'm doing wrong, or a better way to try to
replace is persistent character that just won't go away?

*HOW* are you reading your text file? You need to match the encoding
with the encoding of the file. In this case, you'll probably need to
read the file with UTF7 encoding unless there are the encoding
specifiers at the beginning of the file.

I tried a file (and used File.ReadAllText()) with only 0xAA characters
and it would not run the replace unless I read the file UTF7...
 
J

Jon Skeet [C# MVP]

*HOW* are you reading your text file? You need to match the encoding
with the encoding of the file. In this case, you'll probably need to
read the file with UTF7 encoding unless there are the encoding
specifiers at the beginning of the file.

I tried a file (and used File.ReadAllText()) with only 0xAA characters
and it would not run the replace unless I read the file UTF7...

UTF-7 is *very* rarely used - basically it's used in mail and that's
virtually it, as far as I'm aware. How did you save your file?

That isn't the problem in this case, however.

Jon
 
D

Doug Semler

UTF-7 is *very* rarely used - basically it's used in mail and that's
virtually it, as far as I'm aware. How did you save your file?

That isn't the problem in this case, however.

Jon

THen why couldn't I get the string replace to work if I didn't specify
the encoding as UTF7?
 
L

Levidikus

Thank you very much for all the valuable information!

I am reading the file in using a standard StreamReader, without any
special flags.
 
D

Doug Semler

UTF-7 is *very* rarely used - basically it's used in mail and that's
virtually it, as far as I'm aware. How did you save your file?

That isn't the problem in this case, however.

Jon

Sorry...Clicked send accidentally:

Start Notepad. type 1234567890\n\r(10 bytes of 0xAA)
Save file (ANSI)
Verified that there was no encoding indicator on the file (file is 22
bytes)
Windows Vista (if it matters).

The following lines were used, with the indicated Encododing
string foo = File.ReadAllText(@"C:\users\doug\documents
\test.txt", encoding);
foo = foo.Replace("\u00AA", "#");
Console.WriteLine(encoding);
Console.WriteLine(foo);

ASCII
1234567890
??????????
UTF7
1234567890
##########
UTF8
1234567890
??????????
UTF32
?????
Unicode
???????????
 
D

Doug Semler

Sorry...Clicked send accidentally:

Start Notepad. type 1234567890\n\r(10 bytes of 0xAA)
Save file (ANSI)
Verified that there was no encoding indicator on the file (file is 22
bytes)
Windows Vista (if it matters).

The following lines were used, with the indicated Encododing
string foo = File.ReadAllText(@"C:\users\doug\documents
\test.txt", encoding);
foo = foo.Replace("\u00AA", "#");
Console.WriteLine(encoding);
Console.WriteLine(foo);

ASCII
1234567890
??????????
UTF7
1234567890
##########
UTF8
1234567890
??????????
UTF32
?????
Unicode
???????????

P.S. Not specifying an encoding gives me the ASCII result.
Specifying Encoding.Default (which resolves to SBCSCodePageEncoding)
gives me the correct behavior.
 
J

Jon Skeet [C# MVP]

P.S. Not specifying an encoding gives me the ASCII result.
Specifying Encoding.Default (which resolves to SBCSCodePageEncoding)
gives me the correct behavior.

And that's because Encoding.Default uses the same as what "ANSI" means
in Notepad. UTF-7 just *happened* to work - and I suspect it shouldn't
really have done.

When you don't specify an encoding, almost everything in .NET assumes
UTF-8.

Jon
 
D

Doug Semler

And that's because Encoding.Default uses the same as what "ANSI" means
in Notepad. UTF-7 just *happened* to work - and I suspect it shouldn't
really have done.

When you don't specify an encoding, almost everything in .NET assumes
UTF-8.

Right. But my entire point is that the OP needs to specify the
correct encoding when opening the file. If he doesn't do that, NONE
of the (correct) solutions pointed out earlier will work. In this
case Encoding.Default (if you say UTF7 is wrong) needs to be passed to
the StreamReader constructor.
 
U

UL-Tomten

P.S. Not specifying an encoding gives me the ASCII result.
Specifying Encoding.Default (which resolves to SBCSCodePageEncoding)
gives me the correct behavior.

The default single-byte character set code page encoding (==
SBCSCodePageEncoding) is there to provide an encoding-less encoding,
as far as I can tell. It is to encodings what InvariantCulture is to
cultures: you can use it if you don't care about the encoding and
nobody but you will read what you've written using it. (In other
words; if and only if you wrote the file using Encoding.Default on the
same OS installation, it's safe to use Encoding.Default to read it
back.)
 
J

Jon Skeet [C# MVP]

UL-Tomten said:
The default single-byte character set code page encoding (==
SBCSCodePageEncoding) is there to provide an encoding-less encoding,
as far as I can tell. It is to encodings what InvariantCulture is to
cultures: you can use it if you don't care about the encoding and
nobody but you will read what you've written using it. (In other
words; if and only if you wrote the file using Encoding.Default on the
same OS installation, it's safe to use Encoding.Default to read it
back.)

The bit in brakcets is right - but it's *not* the same as saying it's
an "encoding-less encoding".

An encoding is basically a mapping between byte sequences and character
sequences. 8859-1 is as close to an "encoding-less encoding" as you'll
get, as it maps bytes 0-255 to Unicode 0-255; Encoding.Default doesn't
necessarily do that (and indeed doesn't in most environments).

For instance, on my box byte 128 converts to U+20AC (the Euro symbol).

Use of Encoding.Default should be regarded as "legacy" really - few
things should just use the default encoding for the OS.
 
U

UL-Tomten

The bit in brakcets is right - but it's *not* the same as saying it's
an "encoding-less encoding".

Well, since an encoding by definition specifies encoding rules, I
thought that much was obvious... =]

Maybe there should have been an Encoding.InvariantEncoding instead of
an Encoding.Default, to communicate that the resulting bits are
unknown at compile-time, and perhaps avoid the temptation of using it
for text others might read back.
 
U

UL-Tomten

8859-1 is as close to an "encoding-less encoding" as you'll
get, as it maps bytes 0-255 to Unicode 0-255;

I've always thought of that more as a curse than a blessing.
 
J

Jon Skeet [C# MVP]

The bit in brakcets is right - but it's *not* the same as saying it's
an "encoding-less encoding".

Well, since an encoding by definition specifies encoding rules, I
thought that much was obvious... =]

But there's such a thing as a "trivial" encoding, which pretty much
sums up ISO-8859-1.
Maybe there should have been an Encoding.InvariantEncoding instead of
an Encoding.Default, to communicate that the resulting bits are
unknown at compile-time, and perhaps avoid the temptation of using it
for text others might read back.

InvariantEncoding sounds like it would do the same on all boxes,
regardless of environment though. Encoding.Default *isn't* invariant -
it varies by environment. I'd have preferred
Encoding.OperatingSystemDefault or something similar. Certainly it
gets confusing that Encoding.Default isn't the encoding which is used
by default by most .NET classes :)

Jon
 
L

Levidikus

[snip]

Thank you again for all of the outstanding responses.

The file that I am working with is originated on a solaris 8 unix
system. How would I go about identifying the correct "encoding"?
Also, with the File.ReadAllText(), would I even need a streamreader
for that?

Once again, thanks for all the feedback!

James
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top