Reading Unicode escape sequences from File

John Ztwin · Jun 20, 2008

Hello,

I have a file that contains ordinary text and some special charaters in
Unicode escape sequences (\uxxxx).

When I read the file using e.g. StreamReader Unicode escape sequences are
not converted to their character representation. They are shown excatly same
way than in file. Literals in C# code's variables are shown corretly.

Can anyone tell how to read Unicode escape sequences from file so that they
are presented like literals?

Thanks,

Jon Skeet [C# MVP] · Jun 20, 2008

John Ztwin said:
I have a file that contains ordinary text and some special charaters in
Unicode escape sequences (\uxxxx).

When I read the file using e.g. StreamReader Unicode escape sequences are
not converted to their character representation.

No, I wouldn't expect them to be. That's done by the C# compiler - it
would be a big mistake for it to be done by StreamReader.

They are shown excatly same
way than in file. Literals in C# code's variables are shown corretly.

Can anyone tell how to read Unicode escape sequences from file so that they
are presented like literals?

You basically need to parse the text you've read, just like the C#
compiler does. You can search for \u fairly easily, then take the next
four digits, complain if they're not all hex, convert the hex to a
char, then replace the whole section with the character value.

John Ztwin · Jun 20, 2008

A little bit more work than in Java if I remember right,
Thanks for reply!

Jon Skeet [C# MVP] · Jun 20, 2008

John Ztwin said:
A little bit more work than in Java if I remember right,

Well, not if you use the normal BufferedReader and InputStreamReader in
Java.

Java's Properties class will do the unescaping for properties files,
but it isn't general purpose.

Arne Vajhøj · Jun 21, 2008

John said:
I have a file that contains ordinary text and some special charaters in
Unicode escape sequences (\uxxxx).

When I read the file using e.g. StreamReader Unicode escape sequences are
not converted to their character representation. They are shown excatly same
way than in file. Literals in C# code's variables are shown corretly.

Can anyone tell how to read Unicode escape sequences from file so that they
are presented like literals?

You will need to make a text replace.

Example code:

public static string U2U(string s)
{
string res = s;
MatchCollection reg = Regex.Matches(res, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++) {
res = res.Replace(reg.Groups[0].Value, "" +
(char)int.Parse(reg.Groups[1].Value, NumberStyles.HexNumber));
}
return res;
}

Arne

Michael Justin · Jun 22, 2008

John said:
I have a file that contains ordinary text and some special charaters in
Unicode escape sequences (\uxxxx).

If the file always uses \u then there is no risk. However, some
standards (like the C# spec) allow \U (uppercase) escape sequences:

unicode-escape-sequence:
\u hex-digit hex-digit hex-digit hex-digit
\U hex-digit hex-digit hex-digit hex-digit hex-digit
hex-digit hex-digit hex-digit

http://msdn.microsoft.com/en-us/library/aa664812.aspx

Best regards

Unicode in .NET	8	Apr 30, 2010
How do you ignore escape sequences in a variable?	3	Feb 12, 2004
Escape Sequences in Strings	3	Sep 22, 2003
Reading \r From File As Carriage Return	3	Jun 13, 2006
Is there an escape sequence for Space char? for space delimted tx	12	Feb 4, 2010
How to read a Unicode data saved as ASCII in notepad file as txt ?	3	Aug 8, 2007
C# and encodings	30	Feb 3, 2009
Reading from a text file, with non-English letters	2	Nov 11, 2011

Reading Unicode escape sequences from File

John Ztwin

Jon Skeet [C# MVP]

John Ztwin

Jon Skeet [C# MVP]

Arne Vajhøj

Michael Justin

Ask a Question

Similar Threads