Unicode Parsing

headware · Apr 24, 2009

I'm reading in a file that contains unicode code point values spelled
out in plain text. An example would look like this:

"this is some \U+2205 text from the file"

What I need to do is be able to take any one of these unicode
codepoint strings and convert them to the actual unicode string value.
A single case would be something like this:

string s = s.Replace("\\U+2205", "\u2205");

That results in the the correct string value but, of course, it only
works for one code point. I need a solution that will work for any
code point. I can extract the value after the "\U+" but I don't know
what to do with it to get it into a regular C# string. Any ideas?

Thanks,
Dave

headware · Apr 25, 2009

I'm reading in a file that contains unicode code point values spelled
out in plain text. An example would look like this:

"this is some \U+2205 text from the file"

What I need to do is be able to take any one of these unicode
codepoint strings and convert them to the actual unicode string value.
A single case would be something like this:

string s = s.Replace("\\U+2205", "\u2205");

That results in the the correct string value but, of course, it only
works for one code point. I need a solution that will work for any
code point. I can extract the value after the "\U+" but I don't know
what to do with it to get it into a regular C# string. Any ideas?

Thanks,
Dave

Thanks to everyone for the help. This is what I ended up doing:

MatchEvaluator eval = new MatchEvaluator(EvaluateUnicodeMatch);
s = Regex.Replace(s, "(\\\\[Uu]\\+[0-9a-fA-F]{4})", eval);

private string EvaluateUnicodeMatch(Match m)
{
Group g = m.Groups[1];
string strNum = g.Value.Substring(3); //get the codepoint value
string ucValue = ((char)Int32.Parse(strNum,
System.Globalization.NumberStyles.HexNumber)).ToString(); //turn into
unicode
return ucValue;
}

and it seems to be working. It doesn't handle the case of surrogate
pairs, but I'll have to think about whether or not that's necessary.

Thanks again,
Dave

Tim Roberts · Apr 25, 2009

headware said:
Thanks to everyone for the help. This is what I ended up doing:

MatchEvaluator eval = new MatchEvaluator(EvaluateUnicodeMatch);
s = Regex.Replace(s, "(\\\\[Uu]\\+[0-9a-fA-F]{4})", eval);

private string EvaluateUnicodeMatch(Match m)
{
Group g = m.Groups[1];
string strNum = g.Value.Substring(3); //get the codepoint value
string ucValue = ((char)Int32.Parse(strNum,
System.Globalization.NumberStyles.HexNumber)).ToString(); //turn into
unicode
return ucValue;
}

and it seems to be working. It doesn't handle the case of surrogate
pairs, but I'll have to think about whether or not that's necessary.

Why do you think that? A surrogate pair should be represented by two
consecutive \U+xxxx strings, which your code will correctly translate into
two Unicode characters. The fact that they are a surrogate pair should
only be an issue when you render the string.

Pavel Minaev · Apr 25, 2009

Why do you think that? A surrogate pair should be represented by two
consecutive \U+xxxx strings, which your code will correctly translate into
two Unicode characters.

I think he rather meant the case of a single \U+xxxxxxxx escape, which
should (if supported) be translated to a surrogate pair for C# UTF-16
strings.

Unicode character conversation	4	Mar 21, 2006
convert Int32 Unicode character code to its value	2	Oct 8, 2008
Converting Unicode	6	Jun 29, 2005
Unicode beyond U+FFFF	1	Mar 4, 2010
silliest \u question (decimal to unicode)	3	Jun 16, 2004
size of a file and unicode	6	Mar 2, 2010
Unicode values	2	May 13, 2008
C# and encodings	30	Feb 3, 2009

Unicode Parsing

headware

headware

Tim Roberts

Pavel Minaev

Ask a Question

Similar Threads