c# .net write html to word special characters not writing

rhitam · May 21, 2009

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Now there are characters like the trademark character (&trade

and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam

rhitam · May 21, 2009

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

...... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codettaâ„¢

Any ideas?

rhitam · May 21, 2009

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

...... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codettaâ„¢

Any ideas?

rhitam · May 21, 2009

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

..... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

Click to expand...

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codetta(tm)

Any ideas?

Wierd , in this editor it showing correct ... i mean â ¢ (without
the spaces)

rhitam · May 21, 2009

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

..... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

Click to expand...

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codetta(tm)

Any ideas?

Wierd , in this editor it showing correct ... i mean â ¢ (without
the spaces)

rhitam · May 21, 2009

But without knowing exactly what you're doing, it's impossible to say for

sure. If the "existing html file" is not in fact exactly the same
characters you wind up writing to the new ".doc" file, then any number of
differences in the way that Word ultimately winds up parsing the file data
could explain what you're seeing. For example, if you don't write a BOM
at the beginning of your UTF-8 file (which you don't), and the character
is not in fact written in the file as "™" but rather as the actual
character, then Word may not realize the file is UTF-8 and instead
interpret the bytes as something else (e.g. some ANSI code page).

what's a BOM ?

rhitam · May 21, 2009

But without knowing exactly what you're doing, it's impossible to say for

sure. If the "existing html file" is not in fact exactly the same
characters you wind up writing to the new ".doc" file, then any number of
differences in the way that Word ultimately winds up parsing the file data
could explain what you're seeing. For example, if you don't write a BOM
at the beginning of your UTF-8 file (which you don't), and the character
is not in fact written in the file as "™" but rather as the actual
character, then Word may not realize the file is UTF-8 and instead
interpret the bytes as something else (e.g. some ANSI code page).

what's a BOM ?

Ignacio Machin ( .NET/ C# MVP ) · May 21, 2009

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Now there are characters like the trademark character (&trade and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way

Ignacio Machin ( .NET/ C# MVP ) · May 21, 2009

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Now there are characters like the trademark character (&trade and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way

rhitam · May 21, 2009

Hi all,

Click to expand...

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

Click to expand...

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

Click to expand...

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Click to expand...

Now there are characters like the trademark character (&trade and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Click to expand...

Thanks ,

Click to expand...

Rhitam

Click to expand...

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way- Hide quoted text -

- Show quoted text -

I need the mark up part too since i need to display the content
exactly the same way as in the html file ie with alll the styles etc .
so i need the head section too which has a lot of links to external

You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

In this case , should i use a word object or something rather than
just writing into a text file?

rhitam · May 21, 2009

Hi all,

Click to expand...

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

Click to expand...

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

Click to expand...

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Click to expand...

Now there are characters like the trademark character (&trade and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Click to expand...

Thanks ,

Click to expand...

Rhitam

Click to expand...

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way- Hide quoted text -

- Show quoted text -

I need the mark up part too since i need to display the content
exactly the same way as in the html file ie with alll the styles etc .
so i need the head section too which has a lot of links to external

You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

In this case , should i use a word object or something rather than
just writing into a text file?

Rudy Velthuis · May 21, 2009

rhitam said:
what's a BOM ?

A Unicode Byte Order Mark:

http://en.wikipedia.org/wiki/Byte-order_mark

Rudy Velthuis · May 21, 2009

rhitam said:
what's a BOM ?

A Unicode Byte Order Mark:

http://en.wikipedia.org/wiki/Byte-order_mark

rhitam · May 22, 2009

Hi Guys ,

Figured out a workaround (coz this doesnt exactly qualify as a
solution.. but it works ;-) ) :

basically i noticed the the two characters ™ and ’ were
the culprit

So i took IGnachio's suggestion and replaced by there escape
sequences :

Somehtml = Somehtml.Replace("\u2122", "™");
Somehtml = Somehtml.Replace("\u2019", "’");

Works like a charm

rhitam · May 22, 2009

Hi Guys ,

Figured out a workaround (coz this doesnt exactly qualify as a
solution.. but it works ;-) ) :

basically i noticed the the two characters ™ and ’ were
the culprit

So i took IGnachio's suggestion and replaced by there escape
sequences :

Somehtml = Somehtml.Replace("\u2122", "™");
Somehtml = Somehtml.Replace("\u2019", "’");

Works like a charm

Read HTML code into c# program, store in string	2	Sep 20, 2010
Special character to &abc equivalents	8	May 7, 2005
How to remove HTML special characters and tags from a Google RSSfeed?	2	Nov 16, 2009
Converting Word to GOOD HTML	3	Aug 9, 2003
FWT Newsletter - Weekly - October 25, 2004	8	Oct 25, 2004

c# .net write html to word special characters not writing

rhitam

rhitam

rhitam

rhitam

rhitam

rhitam

rhitam

Ignacio Machin ( .NET/ C# MVP )

Ignacio Machin ( .NET/ C# MVP )

rhitam

rhitam

Rudy Velthuis

Rudy Velthuis

rhitam

rhitam

Ask a Question

Similar Threads