Reading and writing UTF-8 files

P

Peter Webb

I have to do some simple text editing to large-ish (2 Mbyte) html files
generated by Word. They are, I believe, in UTF-8.

It is the !%^&* problem where single apostrophes become sequences of funny
characters, some spaces are shown as unprintable characters, etc.

The following code just reads and writes a file, and shows the problem. It
applies whether I use a String or a StringBuilder, and whether or not I
explicitly force UTF-8 encoding.

Can somebody just tell me basically how to copy an html file by reading it
in and then writing it out, which is all the following methods are supposed
to do:

private StringBuilder getHtml(string fullfilename)
{
StringBuilder concatenated = new StringBuilder();
string line;
// Read the file and display it line by line.
System.IO.StreamReader htmlFile =
new
System.IO.StreamReader(fullfilename,System.Text.Encoding.UTF8);
while ((line = htmlFile.ReadLine()) != null)
concatenated.Append(line + "\r\n")
htmlFile.Close();
return concatenated;
}

private void writehtmltofile(string outputfilenamewithpath,
StringBuilder HTMLstring)
{
StreamWriter sw = new StreamWriter(outputfilenamewithpath
,false, System.Text.Encoding.UTF8);
{
sw.WriteLine(HTMLstring.ToString());
sw.Close();
};
sw.Dispose();
}

Any assistance greatly appreciated.
 
H

Harlan Messinger

Peter said:
I have to do some simple text editing to large-ish (2 Mbyte) html files
generated by Word. They are, I believe, in UTF-8.

It is the !%^&* problem where single apostrophes become sequences of
funny characters, some spaces are shown as unprintable characters, etc.

The following code just reads and writes a file, and shows the problem.
It applies whether I use a String or a StringBuilder, and whether or not
I explicitly force UTF-8 encoding.

Can somebody just tell me basically how to copy an html file by reading
it in and then writing it out, which is all the following methods are
supposed to do:

private StringBuilder getHtml(string fullfilename)
{
StringBuilder concatenated = new StringBuilder();
string line;
// Read the file and display it line by line.
System.IO.StreamReader htmlFile =
new
System.IO.StreamReader(fullfilename,System.Text.Encoding.UTF8);
while ((line = htmlFile.ReadLine()) != null)
concatenated.Append(line + "\r\n")
htmlFile.Close();
return concatenated;
}

private void writehtmltofile(string outputfilenamewithpath,
StringBuilder HTMLstring)
{
StreamWriter sw = new StreamWriter(outputfilenamewithpath
,false, System.Text.Encoding.UTF8);
{
sw.WriteLine(HTMLstring.ToString());
sw.Close();
};
sw.Dispose();
}

Any assistance greatly appreciated.

My very first guess is that the input file is actually not UTF-8, since
you said you only believed it to be.
 
P

Peter Duniho

Peter said:
I have to do some simple text editing to large-ish (2 Mbyte) html files
generated by Word. They are, I believe, in UTF-8.

Why do you believe that? As Harlan says, they very well might not be.
It is the !%^&* problem where single apostrophes become sequences of
funny characters, some spaces are shown as unprintable characters, etc.

The following code just reads and writes a file, and shows the problem.
It applies whether I use a String or a StringBuilder, and whether or not
I explicitly force UTF-8 encoding.

Can somebody just tell me basically how to copy an html file by reading
it in and then writing it out, which is all the following methods are
supposed to do:

I would say that if you just want to copy the file, the biggest problem
in your code is that you are interpreting the data at all. Why bother
with that if you just want to make a copy of the original data?

If you do have a need to interpret the data, then you'll have to _know_,
and not just believe, the encoding for the source data.

Pete
 
M

Mihai N.

I have to do some simple text editing to large-ish (2 Mbyte) html files
generated by Word. They are, I believe, in UTF-8.

You should not believe, you should check.
The encoding of the generated html file is affected by a Word setting,
it can be anything.

But there is a meta tag in there, check it:
(something like <meta http-equiv="content-type" content="text/html;
charset=iso-8859-1"> or whatever)
 
P

Peter Webb

Mihai N. said:
You should not believe, you should check.
The encoding of the generated html file is affected by a Word setting,
it can be anything.

But there is a meta tag in there, check it:
(something like <meta http-equiv="content-type" content="text/html;
charset=iso-8859-1"> or whatever)

Thankyou very much for that. I had no idea even how to find out the encoding
type of thje html.

Turns out it is <meta http-equiv=Content-Type content="text/html;
charset=windows-1252">

That's Western European Windows. Unfortunately, StreamReader and
StreamWriter do not contain an option for this encoding.

The whole thing also strikes me as somehow wrong, because I have to read the
file using (say) ASCII, extract the encoding, then re-open the file with a
new StreamReader with the correct encoding. It surely can't be that
cumbersome?

I am at the point of reading it in as a binary file of bytes, and doing the
whole thing by "hand". However, the reason I am reading and writing the file
is to do some text substitions, and so I want to use Strings internally. I
am worried that my conversion to/from strings and byte arrays may introduce
the same problem.

Any help - would be appreciated. I really just need code to copy an html
file line by line, it sounds easy ...
 
H

Harlan Messinger

Peter said:
Thankyou very much for that. I had no idea even how to find out the
encoding type of thje html.

Turns out it is <meta http-equiv=Content-Type content="text/html;
charset=windows-1252">

That's Western European Windows. Unfortunately, StreamReader and
StreamWriter do not contain an option for this encoding.

It isn't one of the few that have been pre-named for you. You just need
to use System.Text.Encoding.GetEncoding("Windows-1252").
The whole thing also strikes me as somehow wrong, because I have to read
the file using (say) ASCII, extract the encoding

And that presupposes that the META tag exists, and, if it exists, that
it's correct.
, then re-open the file
with a new StreamReader with the correct encoding. It surely can't be
that cumbersome?

It isn't, if you're just trying to copy the file. Use
System.IO.File.Copy(sourcePath, destinationPath). If you do need to read
the file, as characters, *and you don't know what the encoding is in
advance*, then, yes, it's that complicated. Just like you can't hire
someone to translate arbitrary foreign-language documents that you may
receive into English unless either (a) you know in advance that they're
all in, say, French, or (b) you open each one to find out what language
it's in before you hire a translator who can translate it.

On the other hand, if you know that the files are *all* Windows 1252,
then there's no problem at all.
I am at the point of reading it in as a binary file of bytes, and doing
the whole thing by "hand". However, the reason I am reading and writing
the file is to do some text substitions, and so I want to use Strings
internally. I am worried that my conversion to/from strings and byte
arrays may introduce the same problem.

Well, then you *do* need to decode it. Oh, well!
 
M

Mihai N.

Unfortunately, StreamReader and
StreamWriter do not contain an option for this encoding.

They do.
StreamWriter and StreamReader have some constructors
that take an Encoding:

------------
Encoding enc = Encoding.GetEncoding( "windows-1252" );

FileStream fs = new FileStream( fileName,
FileMode.CreateNew, FileAccess.Write, FileShare.None );

StreamWriter sw = new System.IO.StreamWriter( fs, enc );
 
P

Peter Webb

Thankyou both.

Using System.Text.Encoding.GetEncoding("Windows-1252") for reading and
writing works fine. Its a highly structured process for creating the html
from Word, buggered if I'm going to write and debug this for six other
encoding schemes which I have no way to properly test ...

Never had any problem reading and writing text files before. Came as a nasty
surprise.

Thanks again.
 
M

Mihai N.

Using System.Text.Encoding.GetEncoding("Windows-1252") for reading and
writing works fine. Its a highly structured process for creating the html
from Word, buggered if I'm going to write and debug this for six other
encoding schemes which I have no way to properly test ...

It great that works, but just don't hard-code "windows-1252" there.
Take and use the stuff in the meta. A bit more work, but not much,
and you know it's going to work for a lot of other things.

If someone prefers utf-8 your stuff will fail, and will give a pretty
bad impression. Personally, do almost everythign with utf-8, and
utf-8 is getting close to 50% on the web:
http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top