c# .net write html to word special characters not writing

R

rhitam

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Now there are characters like the trademark character (™) and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam
 
R

rhitam

I aplogize , i should have been more descriptive ,


Well first of all the writetofile parameters are taken care of
already :



public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}


public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}




...... .



Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;


StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);


Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codettaâ„¢

Any ideas?
 
R

rhitam

I aplogize , i should have been more descriptive ,


Well first of all the writetofile parameters are taken care of
already :



public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}


public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}




...... .



Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;


StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .

But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);


Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codettaâ„¢

Any ideas?
 
R

rhitam

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

..... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .
But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codetta(tm)

Any ideas?

Wierd , in this editor it showing correct ... i mean â ¢ (without
the spaces)
 
R

rhitam

I aplogize , i should have been more descriptive ,

Well first of all the writetofile parameters are taken care of
already :

public static void WriteToFile(string strPath, ref string
strData)
{
WriteToFile(strPath, ref strData, FileMode.OpenOrCreate,
FileAccess.Write, FileShare.None);
}

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

..... .

Now, these set of html files that i am reading is using mshtml like
this :

using mshtml;

StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml =
TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass
();
doc.write(new object[] { TopLinkHtml });
HTMLDocumentClass domdoc =
(HTMLDocumentClass)doc;
string TopLinkBodyElem =
domdoc.body.innerHTML;
BodyElem += TopLinkBodyElem;
TopLinkStream.Close();
IHTMLElementCollection HeadColl =
(IHTMLElementCollection)domdoc.all.tags("head");
foreach (IHTMLElement Elem in HeadColl)
{
string TopLinkHeadElem =
Elem.innerHTML;
HeadElem += TopLinkHeadElem;
}

basically , from a set of 10 htm files , i am reading the head
elements seperately , and then the body elements seperately , then
finally creating one big html out of it .
But
you can save it a lot of time and trouble if you'd just use a proper
extension, such as ".html" or ".htm" in the first place.

I tried that too . But the same result is coming. I tried it like
this :

WriteToFile("somefile.html", Htmltext);

Even in the generated html file , the same junk chracters are
appearing.

for example codetta™ is shown as : Codetta(tm)

Any ideas?

Wierd , in this editor it showing correct ... i mean â ¢ (without
the spaces)
 
R

rhitam

But without knowing exactly what you're doing, it's impossible to say for
sure. If the "existing html file" is not in fact exactly the same
characters you wind up writing to the new ".doc" file, then any number of
differences in the way that Word ultimately winds up parsing the file data
could explain what you're seeing. For example, if you don't write a BOM
at the beginning of your UTF-8 file (which you don't), and the character
is not in fact written in the file as "™" but rather as the actual
character, then Word may not realize the file is UTF-8 and instead
interpret the bytes as something else (e.g. some ANSI code page).


what's a BOM ?
 
R

rhitam

But without knowing exactly what you're doing, it's impossible to say for
sure. If the "existing html file" is not in fact exactly the same
characters you wind up writing to the new ".doc" file, then any number of
differences in the way that Word ultimately winds up parsing the file data
could explain what you're seeing. For example, if you don't write a BOM
at the beginning of your UTF-8 file (which you don't), and the character
is not in fact written in the file as "™" but rather as the actual
character, then Word may not realize the file is UTF-8 and instead
interpret the bytes as something else (e.g. some ANSI code page).


what's a BOM ?
 
I

Ignacio Machin ( .NET/ C# MVP )

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

 string Htmltext = somehtml;
 WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
        {
            FileStream FS = File.Open(strPath, FM, FA, FSHR);
            byte[] b = Encoding.UTF8.GetBytes(strData);
            FS.Write(b, 0, b.Length);
            FS.Close();
        }

Now there are characters like the trademark character  (™)   and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters.  But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way
 
I

Ignacio Machin ( .NET/ C# MVP )

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

 string Htmltext = somehtml;
 WriteToFile("somefile.doc", Htmltext);

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
        {
            FileStream FS = File.Open(strPath, FM, FA, FSHR);
            byte[] b = Encoding.UTF8.GetBytes(strData);
            FS.Write(b, 0, b.Length);
            FS.Close();
        }

Now there are characters like the trademark character  (™)   and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters.  But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Thanks ,

Rhitam

Hi,
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way
 
R

rhitam

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:
 string Htmltext = somehtml;
 WriteToFile("somefile.doc", Htmltext);
public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
        {
            FileStream FS = File.Open(strPath, FM, FA, FSHR);
            byte[] b = Encoding.UTF8.GetBytes(strData);
            FS.Write(b, 0, b.Length);
            FS.Close();
        }
Now there are characters like the trademark character  (™)   and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters.  But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Hi,
You are not writting it to a Word (a la  MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way- Hide quoted text -

- Show quoted text -

I need the mark up part too since i need to display the content
exactly the same way as in the html file ie with alll the styles etc .
so i need the head section too which has a lot of links to external
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.


In this case , should i use a word object or something rather than
just writing into a text file?
 
R

rhitam

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:
 string Htmltext = somehtml;
 WriteToFile("somefile.doc", Htmltext);
public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
        {
            FileStream FS = File.Open(strPath, FM, FA, FSHR);
            byte[] b = Encoding.UTF8.GetBytes(strData);
            FS.Write(b, 0, b.Length);
            FS.Close();
        }
Now there are characters like the trademark character  (™)   and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters.  But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Hi,
You are not writting it to a Word (a la  MS Word) file, you are just
concatenating them to a text file.

With that in mind you have a couple of problems, first is that you
only want the text being displayed, without the markup part. you will
have to parse the document and extract the innerText.
for the escaped chars you will need to have a table and convert them,
I do not think there is other way- Hide quoted text -

- Show quoted text -

I need the mark up part too since i need to display the content
exactly the same way as in the html file ie with alll the styles etc .
so i need the head section too which has a lot of links to external
You are not writting it to a Word (a la MS Word) file, you are just
concatenating them to a text file.


In this case , should i use a word object or something rather than
just writing into a text file?
 
R

rhitam

Hi Guys ,


Figured out a workaround (coz this doesnt exactly qualify as a
solution.. but it works ;-) ) :

basically i noticed the the two characters ™ and ’ were
the culprit

So i took IGnachio's suggestion and replaced by there escape
sequences :

Somehtml = Somehtml.Replace("\u2122", "™");
Somehtml = Somehtml.Replace("\u2019", "’");

Works like a charm :)
 
R

rhitam

Hi Guys ,


Figured out a workaround (coz this doesnt exactly qualify as a
solution.. but it works ;-) ) :

basically i noticed the the two characters ™ and ’ were
the culprit

So i took IGnachio's suggestion and replaced by there escape
sequences :

Somehtml = Somehtml.Replace("\u2122", "™");
Somehtml = Somehtml.Replace("\u2019", "’");

Works like a charm :)
 
Top