Streamwriter removes french characters


R

Rob

Hi,

I have a small VB.Net program that reads in an HTML file using a
FileStream (this file was created by MS Word "Save as HTML" feature),
uses regular expressions to remove all unwanted code and then re-writes
the file.

It works fine but when I execute this on a french web page...the
StreamWriter removes all of the french characters. Here's a piece of the
code:

Dim filename As String = txtFilename.Text.ToString

Dim sr As StreamReader

sr = File.OpenText(filename)

Dim textstream As String = sr.ReadToEnd()

sr.Close()

Dim newtext As String

newtext = CleanHTML(textstream)

Dim fs As New FileStream(OutputFilename, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs)

' I've also tried:
'Dim sw as New StreamWriter(fs, System.Text.Encoding.UTF8)

sw.WriteLine(newtext)

sw.Close()



I'm kinda new to .Net Development...does anyone see what's wrong here?



Thanks
 
Ad

Advertisements

G

Guest

newtext = CleanHTML(textstream)

What does this function do? Is it stripping out french characters?

When you "clean" HTML... what are you doing? Have you tried
Server.HTMLEncode instead?
 
R

Rob

When I run the CleanHTML function...I remove a lot of garbage that comes
with saving a Word document as HTML. I remove all the styles, span tags
and a lot of other tags that are not needed.

When I clean up a french webpage, the StreamReader or StreamWriter
doesn't seen to understand french characters and strips them out but I
want to preserve them.

Another thing I've tried is, rather than using this line:
Dim sr As StreamReader
sr = File.OpenText(filename)

I've replaced it with this
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

but this returns strange result as well.
 
R

Rob

I got it.

I added System.text.Encoding.Default to my StreamWriter as well as my
StreamReader and that fixed it.

Thanks
 
Ad

Advertisements

H

Herfried K. Wagner [MVP]

Rob said:
I have a small VB.Net program that reads in an HTML file using a
FileStream (this file was created by MS Word "Save as HTML" feature),
uses regular expressions to remove all unwanted code and then re-writes
the file.

It works fine but when I execute this on a french web page...the
StreamWriter removes all of the french characters. Here's a piece of the
code:

Dim filename As String = txtFilename.Text.ToString

Remove the '.ToString' as 'txtFilename.Text' is already of type 'String'.
Dim sr As StreamReader

sr = File.OpenText(filename)

Dim textstream As String = sr.ReadToEnd()

Note that the 'StreamReader' object is initialized for the UTF-8 encoding by
default. If your file has been encoded using a different encoding, such as
Windows' default ANSI codepage, then pass 'System.Text.Encoding.Default' in
the parameter accepting an encoding of the 'StreamReader''s constructor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top