Valid Characters

P

preport

I'm trying to ensure that all the characters in my XML document are
characters specified in this document:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets

Would a function like this work:

private static string formatXMLString(string n)
{
if (string.IsNullOrEmpty(n)) return n;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
char[] chrs = n.ToCharArray();
char c;
int x, j = chrs.Length;
for (x = 0; x < j; x++)
{
c = chrs[x];
if (c == 0x9 || c == 0xA || c == 0xD ||
(c > 0x20 && c < 0xd7ff) ||
(c > 0xe000 && c < 0xffd) ||
(c > 0x10000 && c < 0x10ffff))
{
sb.Append(c);
}
}
return sb.ToString();
}

I've never compared characters to like this (0x9, 0xffd, etc...)?
I'm not trying to be lazy and not test it myself, I just don't know if this
type of character comparison is the correct logic for the results I'm
looking for.

Any input?
 
D

David Browne

preport said:
I'm trying to ensure that all the characters in my XML document are
characters specified in this document:
http://www.w3.org/TR/2000/REC-xml-20001006#charsets

Would a function like this work:

private static string formatXMLString(string n)
{
if (string.IsNullOrEmpty(n)) return n;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
char[] chrs = n.ToCharArray();
char c;
int x, j = chrs.Length;
for (x = 0; x < j; x++)
{
c = chrs[x];
if (c == 0x9 || c == 0xA || c == 0xD ||
(c > 0x20 && c < 0xd7ff) ||
(c > 0xe000 && c < 0xffd) ||
(c > 0x10000 && c < 0x10ffff))
{
sb.Append(c);
}
}
return sb.ToString();
}

I've never compared characters to like this (0x9, 0xffd, etc...)?
I'm not trying to be lazy and not test it myself, I just don't know if
this type of character comparison is the correct logic for the results I'm
looking for.

Any input?

Sure. Don't be lazy.

And a char is a 2-byte type, so your literals should all be 2-byte literals,
and should be cast to char for comparison.

eg
char space = (char)0x0020;

David
 
L

Luke Zhang [MSFT]

Hello,

The data type "char" in C# is for 16-bit Unicode character, and its range
is from U+0000 to U+ffff. Therefore, the following line may be not
necessary in your code:

(c > 0x10000 && c < 0x10ffff))

It has been beyond the range of C# char, and we won't get such a value in
C# application.

When load strings or file into XMLDocument element, the charactors will be
valid and exceptions will be thrown if there is any invalid charactors.
Your function will check the string before this. I think this is a good way
since you can control the validation. Anyway, is it possible that some data
will be lost if you just remove the invalid charactors? How about throw an
exception?

Sincerely,

Luke Zhang

Microsoft Online Community Support
==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
M

Mihai N.

The data type "char" in C# is for 16-bit Unicode character, and its range
is from U+0000 to U+ffff. Therefore, the following line may be not
necessary in your code:

(c > 0x10000 && c < 0x10ffff))

It has been beyond the range of C# char, and we won't get such a value in
C# application.

C# uses UTF-16, so it can cover all Unicode range (up to U+10FFFF) using
surrogates.

See http://www.unicode.org/faq/utf_bom.html#UTF16
and http://mailman.ic.ac.uk/pipermail/xml-dev/1999-September/014933.html
 
P

PrePort

My problem is that I have a simple object that I return from a webservice.
Sometimes this object gets populated with bad data (from a database), but it
gets serialized just fine (no exception) and sent down to the clients.

The clients error out because of the bad data though. So I thought as I
populate these objects I can strip out all the "illegal characters". I'm
not worried about losing data, because it shouldn't be there in the first
place.

So, except for the last condition (0x10000..) do you think this is a valid
option? Again, I don't know enough about these character ranges and I don't
want to accidently strip out "random" valid characters.

Thanks for you input so far..
 
P

preport

I've tested this function in our system and it did exactly what I was
affraid of. It solved the "clients blowing up" problem, but it is stripping
out all the spaces.

Why is this? I know this sounds stupid...because it is....but what the hell
am I stripping out? What are these character ranges? I'm familiar with the
ASCII tables and am use to checking against normal integers..but what the
heck are these character ranges? Can someone point me to a reference where
I can educate myself a little.
 
J

Jon Skeet [C# MVP]

preport said:
I've tested this function in our system and it did exactly what I was
affraid of. It solved the "clients blowing up" problem, but it is stripping
out all the spaces.

Why is this? I know this sounds stupid...because it is....but what the hell
am I stripping out? What are these character ranges? I'm familiar with the
ASCII tables and am use to checking against normal integers..but what the
heck are these character ranges? Can someone point me to a reference where
I can educate myself a little.

You're stripping out spaces because of this condition:

(c > 0x20 && c < 0xd7ff)

That (and all the other ranges) should use inclusive comparisons, not
exclusive:

(c >= 0x20 && c <= 0xd7ff)

Note that the condition (c > 0xe000 && c < 0xffd) should use 0xfffd as
the top part, not 0xffd.

As for what you're stripping out:
1) some "control" characters (U+0000 to U+0020 except tab, carriage
return and line feed)
2) the byte order marker (U+FFFE)
3) U+FFFF (I can't remember off-hand if that has any special meaning)
4) The surrogate block (used for representing characters
U+10000-U+10FFFF)

However, if you want to be able to represent characters >= U+10000,
you'll need to only strip "rogue" characters from the surrogate block,
leaving "valid" surrogate pairs alone. This unfortunately makes the
code significantly messier - if you don't care about representing those
characters, you could leave the code stripping anything from the
surrogate block: do document this omission though!
 
J

Jon Skeet [C# MVP]

Sure. Don't be lazy.

And a char is a 2-byte type, so your literals should all be 2-byte literals,
and should be cast to char for comparison.

eg
char space = (char)0x0020;

While I agree that the "high order" comparisons are invalid, where do
you see the benefit in converting to char for comparison?
 
P

preport

OK, let me redifine my question...
I get the 0x9, 0xa, 0xd, and (> 0x20)...but what is:

oxd7ff and
(0xe000 >= c >= 0xffd)

I don't understand what those characters are?
 
P

preport

Thank you.....I was aware of the tab, carriage return, and line feed....and
I did end up catching my "inclusive comparison" problem, but I did not know
what (U+FFFE) and U+10000-U+10FFFF.

Thank you.
 
D

David Browne

Jon Skeet said:
While I agree that the "high order" comparisons are invalid, where do
you see the benefit in converting to char for comparison?

Just to remove implicit conversions from the code. The same reason I like
to see parentheses control order of operations instead of relying on
operator precedence. I think it makes the code more readable and less
fragile to have the the operations explicit.

David
 
J

Jon Skeet [C# MVP]

Just to remove implicit conversions from the code. The same reason I like
to see parentheses control order of operations instead of relying on
operator precedence. I think it makes the code more readable and less
fragile to have the the operations explicit.

In some cases I'd agree, but in this case I think it would make it
harder to read overall. Personal preference though...
 
M

Mihai N.

Yes, but a surrogate will occupy two char's. A char is not a Unicode
character; it's an unsigned 16bit integer, and the range of a char is not
U+0000 to U+ffff, it's 0x0000 to 0xffff.
Yes.
Which means that the condition
(c > 0x10000 && c < 0x10ffff))
has to be rewriten in terms of surrogates.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top