Unicode Parsing

H

headware

I'm reading in a file that contains unicode code point values spelled
out in plain text. An example would look like this:

"this is some \U+2205 text from the file"

What I need to do is be able to take any one of these unicode
codepoint strings and convert them to the actual unicode string value.
A single case would be something like this:

string s = s.Replace("\\U+2205", "\u2205");

That results in the the correct string value but, of course, it only
works for one code point. I need a solution that will work for any
code point. I can extract the value after the "\U+" but I don't know
what to do with it to get it into a regular C# string. Any ideas?

Thanks,
Dave
 
H

headware

I'm reading in a file that contains unicode code point values spelled
out in plain text. An example would look like this:

"this is some \U+2205 text from the file"

What I need to do is be able to take any one of these unicode
codepoint strings and convert them to the actual unicode string value.
A single case would be something like this:

string s = s.Replace("\\U+2205", "\u2205");

That results in the the correct string value but, of course, it only
works for one code point. I need a solution that will work for any
code point. I can extract the value after the "\U+" but I don't know
what to do with it to get it into a regular C# string. Any ideas?

Thanks,
Dave

Thanks to everyone for the help. This is what I ended up doing:

MatchEvaluator eval = new MatchEvaluator(EvaluateUnicodeMatch);
s = Regex.Replace(s, "(\\\\[Uu]\\+[0-9a-fA-F]{4})", eval);

private string EvaluateUnicodeMatch(Match m)
{
Group g = m.Groups[1];
string strNum = g.Value.Substring(3); //get the codepoint value
string ucValue = ((char)Int32.Parse(strNum,
System.Globalization.NumberStyles.HexNumber)).ToString(); //turn into
unicode
return ucValue;
}

and it seems to be working. It doesn't handle the case of surrogate
pairs, but I'll have to think about whether or not that's necessary.

Thanks again,
Dave
 
T

Tim Roberts

headware said:
Thanks to everyone for the help. This is what I ended up doing:

MatchEvaluator eval = new MatchEvaluator(EvaluateUnicodeMatch);
s = Regex.Replace(s, "(\\\\[Uu]\\+[0-9a-fA-F]{4})", eval);

private string EvaluateUnicodeMatch(Match m)
{
Group g = m.Groups[1];
string strNum = g.Value.Substring(3); //get the codepoint value
string ucValue = ((char)Int32.Parse(strNum,
System.Globalization.NumberStyles.HexNumber)).ToString(); //turn into
unicode
return ucValue;
}

and it seems to be working. It doesn't handle the case of surrogate
pairs, but I'll have to think about whether or not that's necessary.

Why do you think that? A surrogate pair should be represented by two
consecutive \U+xxxx strings, which your code will correctly translate into
two Unicode characters. The fact that they are a surrogate pair should
only be an issue when you render the string.
 
P

Pavel Minaev

Why do you think that?  A surrogate pair should be represented by two
consecutive \U+xxxx strings, which your code will correctly translate into
two Unicode characters.

I think he rather meant the case of a single \U+xxxxxxxx escape, which
should (if supported) be translated to a surrogate pair for C# UTF-16
strings.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top