XmlDocument Escaping - and I Don't Want It To

jehugaleahsa · Apr 3, 2009

Hello:

I am creating an HTML document at runtime and my particular situation
calls for me to replace tabs with four spaces. However, browsers skip
all but the first space. So, I need to replace the tab with  
four times.

However, when I add any escaped characters to the document,
XmlDocument automatically escapes my escaped characters. So I am
getting &nbsp; four times instead. Is there are way to shut this
off?

Thanks,
Travis

jehugaleahsa · Apr 3, 2009

I also need to be able to replace Environment.NewLine with <br/>. So,
I can't escape the < or > either.

jehugaleahsa · Apr 3, 2009

Ha! From your subject, I thought you meant that your XmlDocument was
getting away from you.

Remember...if you truly love something, set it free. If it comes back,
it's yours forever, and if not, it was never yours to start with.

But, I digress...

[...]
However, when I add any escaped characters to the document,
XmlDocument automatically escapes my escaped characters. So I am
getting &nbsp; four times instead. Is there are way to shut this
off?

Click to expand...

How are you replacing the characters in the first place?

The answer is to use the actual   entity, rather than the string
" ". But the specific approach may depend on how you're generating
the replacement content.

Pete

How doth one create an "entity"? I am willing to adjust my approach at
this early point in development.

jehugaleahsa · Apr 4, 2009

Well, as I said, that depends on how you're dealing with the data. But, a
"character entity" is just a specific character with a specific escape
sequence that will generate it. You can always specify the character as
an explicit character. The   entity corresponds to the Unicode
character value of 0x00A0, so you could use the C# character literal
'\u00A0' to specify the   character.

Assuming you're dealing with a string, and you're just doing a
search-and-replace, replacing " " with " ", you could instead replace
" " with "\u00A0".

Pete

I should have mentioned this before. We are creating a new reporting
framework. These applications are dynamic enough that we don't want to
bother with Crystal Reports, etc. We want to dynamically generate a
HTML document (with formatting, etc.) and display it with
System.Windows.Forms.WebBrowser. Why? Because we tried GDI in the past
and it wasn't very friendly.

So, since I am viewing the output in a browser, the output has to be a
valid HTML file with all the special characters recognized. The
browser will ignore spaces unless they are explicitly indicated using
 .

So, I still need those darn special characters in there. It is kinda
frustrating, and I am thinking about generating the XML using
XmlWriter or something. It will mean more work on my part, but I'll
survive. I was just hoping for some simple flag on the XmlDocument
class, like, "Hey, don't escape my text!". It seems no such thing
exists.

I was reading on another site that part of the reason is because
  isn't an XML recognized special character -- just an HTML
special character. Perhaps I can find another solution to my tab
problem.

Pavel Minaev · Apr 4, 2009

I should have mentioned this before. We are creating a new reporting
framework. These applications are dynamic enough that we don't want to
bother with Crystal Reports, etc. We want to dynamically generate a
HTML document (with formatting, etc.) and display it with
System.Windows.Forms.WebBrowser. Why? Because we tried GDI in the past
and it wasn't very friendly.

So, since I am viewing the output in a browser, the output has to be a
valid HTML file with all the special characters recognized. The
browser will ignore spaces unless they are explicitly indicated using
 .

If you are writing that sort of thing, then you should familiarize
yourself with details of HTML, or at least pay closer attention to
what Peter told you.

  is not a magic thing in HTML. It is just a _named_ character
entity that can be used in place of a specific Unicode symbol known as
"no-break space". As Peter had pointed out to you, this is a symbol
with Unicode codepoint 0x00A0. Thus, if the HTML you generate is in
any Unicode encoding (e.g. UTF-8 or UTF-16), you can just output
"\u00A0" to your HTML, and any browser will correctly interpret that.
If the HTML should be in some other encoding, then XmlDocument should
still take care of that when you write the output file, and
automatically escape the character using a _numeric_ character entity,
such as &#xA0. This is also absolutely equivalent to  

So, I still need those darn special characters in there. It is kinda
frustrating, and I am thinking about generating the XML using
XmlWriter or something.

You can try that; you might be surprised to find out that it will,
too, escape those characters in your text, unless you use specific
methods that are defined to work around that.

I was just hoping for some simple flag on the XmlDocument
class, like, "Hey, don't escape my text!". It seems no such thing
exists.

Sure. It's because XmlDocument has an obligation to ensure that
whatever is in it is valid XML. Furthermore, what it presents to you
is not just raw XML - it is the XML infoset (http://www.w3.org/TR/xml-
infoset/). So it can't let you add special XML characters in it willy-
nilly.

I was reading on another site that part of the reason is because
  isn't an XML recognized special character -- just an HTML
special character. Perhaps I can find another solution to my tab
problem.

No, it's not because of that. It's simply because XmlDocument (and
virtually any other XML processing API) works on a higher level of
abstraction, so when you tell it that you want element <div> to have a
string " " inside it, it takes it to mean that you want anyone
else who reads XML according to the spec to be able to parse that
document, and, after all necessary processing (which includes handling
all the character entities!), to get the string " ". To do so,
when writing the string, the ampersand has to be escaped, and so it
does that. In case you haven't noticed that yet, it actually does that
to any ampersand anywhere, and also to opening angle brackets.

jehugaleahsa · Apr 4, 2009

If you are writing that sort of thing, then you should familiarize
yourself with details of HTML, or at least pay closer attention to
what Peter told you.

  is not a magic thing in HTML. It is just a _named_ character
entity that can be used in place of a specific Unicode symbol known as
"no-break space". As Peter had pointed out to you, this is a symbol
with Unicode codepoint 0x00A0. Thus, if the HTML you generate is in
any Unicode encoding (e.g. UTF-8 or UTF-16), you can just output
"\u00A0" to your HTML, and any browser will correctly interpret that.
If the HTML should be in some other encoding, then XmlDocument should
still take care of that when you write the output file, and
automatically escape the character using a _numeric_ character entity,
such as &#xA0. This is also absolutely equivalent to  

You can try that; you might be surprised to find out that it will,
too, escape those characters in your text, unless you use specific
methods that are defined to work around that.

Sure. It's because XmlDocument has an obligation to ensure that
whatever is in it is valid XML. Furthermore, what it presents to you
is not just raw XML - it is the XML infoset (http://www.w3.org/TR/xml-
infoset/). So it can't let you add special XML characters in it willy-
nilly.

No, it's not because of that. It's simply because XmlDocument (and
virtually any other XML processing API) works on a higher level of
abstraction, so when you tell it that you want element <div> to have a
string " " inside it, it takes it to mean that you want anyone
else who reads XML according to the spec to be able to parse that
document, and, after all necessary processing (which includes handling
all the character entities!), to get the string " ". To do so,
when writing the string, the ampersand has to be escaped, and so it
does that. In case you haven't noticed that yet, it actually does that
to any ampersand anywhere, and also to opening angle brackets.

I will try it.

jehugaleahsa · Apr 5, 2009

I will try it.- Hide quoted text -

- Show quoted text -

It worked! Sorry, I assumed since \u00A0 was essentially the same
thing as a space, the browser would treat it as such. I knew that
browsers ignore all but the first contiguous space, so I was worried
it wouldn't work. It's not that I don't know HTML, the tags that is,
it is that I don't know every smiggen of detail about how browsers
interpret unicode. That's not really something I think I should be
ashamed of. It's just a waste of memory cells.

So, thanks for the help. It should be enough for me to move forward!
Thanks!

jehugaleahsa · Apr 5, 2009

Not really a waste of memory cells at all. In fact, if you expect to be
dealing with HTML, you owe it to yourself and others to commit this kind
of knowledge to your memory cells.

Do you really think knowing about how a browser interprets a character
entity, how it decides whether to print it or not, is really something
people need to know every time they touch HTML? I believe in the basic
principle that no one should need to know _everything_ about a tool in
order to use it. This forum really is proof that you _don't_ need to
know everything about C# in order to program in it. If you did need to
know everything, only people like yourself would even stand a chance
of using it. The only thing people like myself require is a place to
find answers to those questions on that rare occasion when things are
black and white. Hence documentation and forums like this one.

I think my understanding of HTML is thorough, but far from complete.
Most programmers with a few web applications under their belts should
at least know about HTML and probably CSS. But you have to also
realize that a person programming in ASP .NET for years may be capable
of getting along using Visual Studio's designer and never once worry
about HTML tags. I think that is a good thing, IMO. Sure, when that
rare occasion comes up, that person may be left in the dark. But why
know HTML until it you need to? Who knows, in 20 years, knowing HTML
will be like knowing system calls in most programs today - most people
never care to look at code at that level.

I don't consider web development as my focus. I work mostly on Windows
Forms applications, database-drive server applications and systems
architecture. Sure, I have made some pretty large web applications,
recently, but ASP .NET hid most of the HTML from me. I leave user
interfaces up to other developers who are specialized in user
interface beautification. If every programmer on the team needs to
know how to do every task perfectly, you're not being efficient.
Unfortunately, my lead and I are the only programmers at work, since
my peer left a few months ago. He was the artist; I believe I am best
utilized elsewhere.

This isn't an issue about "how browsers interpret Unicode". It's an issue
about what putting " ", or any character entity, into your HTML
_really_ does, as well as what is the definition of "white space" in the
context of HTML.

The named character entities really are just a expression interpreted as a
specific character. Thus the name, "character entities". When you use a
named character entity like that, you are in effect saying "put character
<foo> here", where <foo> is the character point for the entity (regardless
of character encoding). The reason   doesn't disappear as white
space isn't because you wrote it " ", it's because that specific
character, '\u00A0', isn't among the kinds of characters defined to be
"white space" in HTML. It doesn't matter how you get that character into
the HTML, it's not white space regardless.

And see, that is where I overlooked your response. It was my fault for
doing so. I should have tried it before discarding it. I'm sorry for
doing that. Unfortunately as early in development as I was, I didn't
want to do a lot of work to test your suggestion. It wouldn't have
been that much work to test that, so it was me being lazy at 4pm on a
Friday.

For more information, see:http://www.w3.org/TR/html4/sgml/ent...pedia.org/wiki/SGML_entity#Character_entities

And next time someone suggests a solution, please just try it before you
state it won't work. Especially if you can't be bothered to post any code
showing what you're actually doing, you owe it to those trying to help you
to test solutions they offer before you deem them unusable.

Sorry, it is hard to gut a multi-file code set and present it as a
short example. This framework I created involved 26 source files. The
heart of the example would have been something like this, though:

public static string PrepareText(string text)
{
text = text.Replace(' ', '\u00A0');
text = text.Replace("\t", "\u00A0\u00A0\u00A0\u00A0");
return text;
}

Of course, this misses the code that places the text in an XmlText
object, and the code to add that XmlText to the parent tag. And that
code is part of another class, which appears in multiple places
throughout the code. I would have had to pretty much make another,
smaller, code set just to make a concise example.

In fact, I had to change my approach quite a bit to make the example
above that simple. Originally, I was trying to replace
Environment.NewLine with <br />. I ended up creating a HtmlNewLineItem
decorator that I wrapped around IHtmlReportItem, so that whenever I
created the tag for the HTML item, the decorator would follow it with
a br element. Creating a decorator for something like that seems quite
excessive in hind-sight. I provide a method called AppendLine, which
handles the creation of new-lines -- I don't even worry about the user
entering in Environment.NewLine manually! Otherwise, I would have to
split the text and replace each new-line with a br element. The amount
of work it takes to make this all work -- it might have been easier to
not use .NET's classes at all!

If it means anything, I am sorry for not taking the time to create
example. I am sorry that I didn't take the time to try your advice.
You have helped me a lot in the past. Thanks for all your help.

Incomplete Escaping Functionality??	3	Oct 18, 2006
How to add Escape as an accelerator to a menu item?	2	Dec 4, 2006
Unwanted Escape Codes In String...	10	Aug 2, 2003
BUG - XML Escaping in .NET	2	Mar 23, 2004
Batch file path	4	May 17, 2013
how to retrieve section CDATA from xml file	1	May 28, 2007
using System.Diagnostics.Process and schtasks to create a task.	1	Dec 2, 2008
XmlDocument - while saving end tage moves to next line	2	Jan 11, 2008

XmlDocument Escaping - and I Don't Want It To

jehugaleahsa

jehugaleahsa

jehugaleahsa

jehugaleahsa

Pavel Minaev

jehugaleahsa

jehugaleahsa

jehugaleahsa

Ask a Question

Similar Threads