XmlDocument Escaping - and I Don't Want It To

J

jehugaleahsa

Hello:

I am creating an HTML document at runtime and my particular situation
calls for me to replace tabs with four spaces. However, browsers skip
all but the first space. So, I need to replace the tab with  
four times.

However, when I add any escaped characters to the document,
XmlDocument automatically escapes my escaped characters. So I am
getting   four times instead. Is there are way to shut this
off?

Thanks,
Travis
 
J

jehugaleahsa

I also need to be able to replace Environment.NewLine with <br/>. So,
I can't escape the < or > either.
 
J

jehugaleahsa

Ha!  From your subject, I thought you meant that your XmlDocument was  
getting away from you.

Remember...if you truly love something, set it free.  If it comes back, 
it's yours forever, and if not, it was never yours to start with.

But, I digress...

[...]
However, when I add any escaped characters to the document,
XmlDocument automatically escapes my escaped characters. So I am
getting &amp;nbsp; four times instead. Is there are way to shut this
off?

How are you replacing the characters in the first place?

The answer is to use the actual &nbsp; entity, rather than the string  
"&nbsp;".  But the specific approach may depend on how you're generating  
the replacement content.

Pete

How doth one create an "entity"? I am willing to adjust my approach at
this early point in development.
 
J

jehugaleahsa

Well, as I said, that depends on how you're dealing with the data.  But, a  
"character entity" is just a specific character with a specific escape  
sequence that will generate it.  You can always specify the character as  
an explicit character.  The &nbsp; entity corresponds to the Unicode  
character value of 0x00A0, so you could use the C# character literal  
'\u00A0' to specify the &nbsp; character.

Assuming you're dealing with a string, and you're just doing a  
search-and-replace, replacing " " with "&nbsp;", you could instead replace  
" " with "\u00A0".

Pete

I should have mentioned this before. We are creating a new reporting
framework. These applications are dynamic enough that we don't want to
bother with Crystal Reports, etc. We want to dynamically generate a
HTML document (with formatting, etc.) and display it with
System.Windows.Forms.WebBrowser. Why? Because we tried GDI in the past
and it wasn't very friendly.

So, since I am viewing the output in a browser, the output has to be a
valid HTML file with all the special characters recognized. The
browser will ignore spaces unless they are explicitly indicated using
&nbsp;.

So, I still need those darn special characters in there. It is kinda
frustrating, and I am thinking about generating the XML using
XmlWriter or something. It will mean more work on my part, but I'll
survive. I was just hoping for some simple flag on the XmlDocument
class, like, "Hey, don't escape my text!". It seems no such thing
exists.

I was reading on another site that part of the reason is because
&nbsp; isn't an XML recognized special character -- just an HTML
special character. Perhaps I can find another solution to my tab
problem.
 
P

Pavel Minaev

I should have mentioned this before. We are creating a new reporting
framework. These applications are dynamic enough that we don't want to
bother with Crystal Reports, etc. We want to dynamically generate a
HTML document (with formatting, etc.) and display it with
System.Windows.Forms.WebBrowser. Why? Because we tried GDI in the past
and it wasn't very friendly.

So, since I am viewing the output in a browser, the output has to be a
valid HTML file with all the special characters recognized. The
browser will ignore spaces unless they are explicitly indicated using
&nbsp;.

If you are writing that sort of thing, then you should familiarize
yourself with details of HTML, or at least pay closer attention to
what Peter told you.

&nbsp; is not a magic thing in HTML. It is just a _named_ character
entity that can be used in place of a specific Unicode symbol known as
"no-break space". As Peter had pointed out to you, this is a symbol
with Unicode codepoint 0x00A0. Thus, if the HTML you generate is in
any Unicode encoding (e.g. UTF-8 or UTF-16), you can just output
"\u00A0" to your HTML, and any browser will correctly interpret that.
If the HTML should be in some other encoding, then XmlDocument should
still take care of that when you write the output file, and
automatically escape the character using a _numeric_ character entity,
such as &#xA0. This is also absolutely equivalent to &nbsp;
So, I still need those darn special characters in there. It is kinda
frustrating, and I am thinking about generating the XML using
XmlWriter or something.

You can try that; you might be surprised to find out that it will,
too, escape those characters in your text, unless you use specific
methods that are defined to work around that.
I was just hoping for some simple flag on the XmlDocument
class, like, "Hey, don't escape my text!". It seems no such thing
exists.

Sure. It's because XmlDocument has an obligation to ensure that
whatever is in it is valid XML. Furthermore, what it presents to you
is not just raw XML - it is the XML infoset (http://www.w3.org/TR/xml-
infoset/). So it can't let you add special XML characters in it willy-
nilly.
I was reading on another site that part of the reason is because
&nbsp; isn't an XML recognized special character -- just an HTML
special character. Perhaps I can find another solution to my tab
problem.

No, it's not because of that. It's simply because XmlDocument (and
virtually any other XML processing API) works on a higher level of
abstraction, so when you tell it that you want element <div> to have a
string "&nbsp;" inside it, it takes it to mean that you want anyone
else who reads XML according to the spec to be able to parse that
document, and, after all necessary processing (which includes handling
all the character entities!), to get the string "&nbsp;". To do so,
when writing the string, the ampersand has to be escaped, and so it
does that. In case you haven't noticed that yet, it actually does that
to any ampersand anywhere, and also to opening angle brackets.
 
J

jehugaleahsa

If you are writing that sort of thing, then you should familiarize
yourself with details of HTML, or at least pay closer attention to
what Peter told you.

&nbsp; is not a magic thing in HTML. It is just a _named_ character
entity that can be used in place of a specific Unicode symbol known as
"no-break space". As Peter had pointed out to you, this is a symbol
with Unicode codepoint 0x00A0. Thus, if the HTML you generate is in
any Unicode encoding (e.g. UTF-8 or UTF-16), you can just output
"\u00A0" to your HTML, and any browser will correctly interpret that.
If the HTML should be in some other encoding, then XmlDocument should
still take care of that when you write the output file, and
automatically escape the character using a _numeric_ character entity,
such as &#xA0. This is also absolutely equivalent to &nbsp;


You can try that; you might be surprised to find out that it will,
too, escape those characters in your text, unless you use specific
methods that are defined to work around that.


Sure. It's because XmlDocument has an obligation to ensure that
whatever is in it is valid XML. Furthermore, what it presents to you
is not just raw XML - it is the XML infoset (http://www.w3.org/TR/xml-
infoset/). So it can't let you add special XML characters in it willy-
nilly.


No, it's not because of that. It's simply because XmlDocument (and
virtually any other XML processing API) works on a higher level of
abstraction, so when you tell it that you want element <div> to have a
string "&nbsp;" inside it, it takes it to mean that you want anyone
else who reads XML according to the spec to be able to parse that
document, and, after all necessary processing (which includes handling
all the character entities!), to get the string "&nbsp;". To do so,
when writing the string, the ampersand has to be escaped, and so it
does that. In case you haven't noticed that yet, it actually does that
to any ampersand anywhere, and also to opening angle brackets.

I will try it.
 
J

jehugaleahsa

I will try it.- Hide quoted text -

- Show quoted text -

It worked! Sorry, I assumed since \u00A0 was essentially the same
thing as a space, the browser would treat it as such. I knew that
browsers ignore all but the first contiguous space, so I was worried
it wouldn't work. It's not that I don't know HTML, the tags that is,
it is that I don't know every smiggen of detail about how browsers
interpret unicode. That's not really something I think I should be
ashamed of. It's just a waste of memory cells.

So, thanks for the help. It should be enough for me to move forward!
Thanks!
 
J

jehugaleahsa

Not really a waste of memory cells at all.  In fact, if you expect to be  
dealing with HTML, you owe it to yourself and others to commit this kind  
of knowledge to your memory cells.

Do you really think knowing about how a browser interprets a character
entity, how it decides whether to print it or not, is really something
people need to know every time they touch HTML? I believe in the basic
principle that no one should need to know _everything_ about a tool in
order to use it. This forum really is proof that you _don't_ need to
know everything about C# in order to program in it. If you did need to
know everything, only people like yourself would even stand a chance
of using it. The only thing people like myself require is a place to
find answers to those questions on that rare occasion when things are
black and white. Hence documentation and forums like this one.

I think my understanding of HTML is thorough, but far from complete.
Most programmers with a few web applications under their belts should
at least know about HTML and probably CSS. But you have to also
realize that a person programming in ASP .NET for years may be capable
of getting along using Visual Studio's designer and never once worry
about HTML tags. I think that is a good thing, IMO. Sure, when that
rare occasion comes up, that person may be left in the dark. But why
know HTML until it you need to? Who knows, in 20 years, knowing HTML
will be like knowing system calls in most programs today - most people
never care to look at code at that level.

I don't consider web development as my focus. I work mostly on Windows
Forms applications, database-drive server applications and systems
architecture. Sure, I have made some pretty large web applications,
recently, but ASP .NET hid most of the HTML from me. I leave user
interfaces up to other developers who are specialized in user
interface beautification. If every programmer on the team needs to
know how to do every task perfectly, you're not being efficient.
Unfortunately, my lead and I are the only programmers at work, since
my peer left a few months ago. He was the artist; I believe I am best
utilized elsewhere.
This isn't an issue about "how browsers interpret Unicode".  It's an issue  
about what putting "&nbsp;", or any character entity, into your HTML  
_really_ does, as well as what is the definition of "white space" in the  
context of HTML.

The named character entities really are just a expression interpreted as a  
specific character.  Thus the name, "character entities".  When you use a  
named character entity like that, you are in effect saying "put character 
<foo> here", where <foo> is the character point for the entity (regardless  
of character encoding).  The reason &nbsp; doesn't disappear as white  
space isn't because you wrote it "&nbsp;", it's because that specific  
character, '\u00A0', isn't among the kinds of characters defined to be  
"white space" in HTML.  It doesn't matter how you get that character into  
the HTML, it's not white space regardless.

And see, that is where I overlooked your response. It was my fault for
doing so. I should have tried it before discarding it. I'm sorry for
doing that. Unfortunately as early in development as I was, I didn't
want to do a lot of work to test your suggestion. It wouldn't have
been that much work to test that, so it was me being lazy at 4pm on a
Friday.
For more information, see:http://www.w3.org/TR/html4/sgml/ent...pedia.org/wiki/SGML_entity#Character_entities

And next time someone suggests a solution, please just try it before you  
state it won't work.  Especially if you can't be bothered to post any code  
showing what you're actually doing, you owe it to those trying to help you  
to test solutions they offer before you deem them unusable.

Sorry, it is hard to gut a multi-file code set and present it as a
short example. This framework I created involved 26 source files. The
heart of the example would have been something like this, though:

public static string PrepareText(string text)
{
text = text.Replace(' ', '\u00A0');
text = text.Replace("\t", "\u00A0\u00A0\u00A0\u00A0");
return text;
}

Of course, this misses the code that places the text in an XmlText
object, and the code to add that XmlText to the parent tag. And that
code is part of another class, which appears in multiple places
throughout the code. I would have had to pretty much make another,
smaller, code set just to make a concise example.

In fact, I had to change my approach quite a bit to make the example
above that simple. Originally, I was trying to replace
Environment.NewLine with <br />. I ended up creating a HtmlNewLineItem
decorator that I wrapped around IHtmlReportItem, so that whenever I
created the tag for the HTML item, the decorator would follow it with
a br element. Creating a decorator for something like that seems quite
excessive in hind-sight. I provide a method called AppendLine, which
handles the creation of new-lines -- I don't even worry about the user
entering in Environment.NewLine manually! Otherwise, I would have to
split the text and replace each new-line with a br element. The amount
of work it takes to make this all work -- it might have been easier to
not use .NET's classes at all!

If it means anything, I am sorry for not taking the time to create
example. I am sorry that I didn't take the time to try your advice.
You have helped me a lot in the past. Thanks for all your help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top