Saving a Word DOC as HTML

P

Peter Rooney

I've discovered that by Saving a document As HTML, it will preserve
attributes such as italic ( <I> ... </I> ), bold, and certain characters,
such as #268; (which is a Czech accent). If the document is a table, the
content of the cells is saved inside of htm pairs like <td> ... </td> . In
fact, the Save As HTM gives me a text file which I can analyze and process
with external programs.

So far so good. The problem is, that the Save As conversion also gives me a
lot of trash that is of no use to me, such as:
<p class=MsoPlainText style='margin-left:.25 in. etc etc
<span style='font-size:12 etc ... </span>
etc

I can search and replace some of these strings, and I have developed
filters that can take care of others. But it's a laborious process involving
various software. It would be better not to get the "trash" in the first
place. The Save As XML option is even worse. Is there a way to make a simple
conversion using Word or 3rd party software?
 
T

Tony Jollans

One man's trash is another's treasure trove. The short answer to your
question is no because only you know which bits of formatting you want to
interrogate and which you want to ignore.

That said, I believe there is a way to remove some of the bloat that Word
adds to HTML documents. I can't remember it off the top of my head but I
think most of what it removes is relatively easy to identify yourself.
 
P

Peter Rooney

Thanks. The "Office 2000 HTML Filter" - downloaded as MSOHTMF2.EXE - seems
to be just what I'm looking for (aside from the limitation that you need to
have Office 2000 on your system to install it - Office 2003 won't do. It's a
ridiculous limitation, but can be circumvented).


*"Stephen Glynn wrote "according to an article at
*http://techrepublic.com.com
*there are a couple of free utilities you can download *from Microsoft
that'll clear out the gubbins that Word *introduces into HTML"
 
B

Bob Buckland ?:-\)

Hi Peter,

In Word 2002 and Word 2003 using
File=>Save As Web Page-Filtered will do
basically the same thing as the Office 2000
HTML filter (at least the part used inside of
Word that lists it as File=>Save as Compact HTML
by using the MSFilter.DOT addin.

In Word 2002 and 2003 the 'Filtered'
content will take into consideration the settings
you have in Tools=>Options=>General=>[Web Options]


You can still use the standalone Office 2000 MSFilter.exe
tool to batch process already created Word HTML files and
it will remove the CSS style formatting from the filtered
HTML pages as well.

You can also use apps such as HTMLTidy to process the files.
Creating 'public use' web pages wasn't really the design goal
for the Word files after Word 97 :) but rather as a way to create
a 'browser viewable' version of a Word document while retaining
all of the parts of a .doc file that a browser didn't support
so you could turn it back into a doc file when opened in Word
from a browser ('roundtripping').

For 'web page' MS Office Frontpage was the app targeted.

=======
Thanks. The "Office 2000 HTML Filter" - downloaded as MSOHTMF2.EXE - seems
to be just what I'm looking for (aside from the limitation that you need to
have Office 2000 on your system to install it - Office 2003 won't do. It's a
ridiculous limitation, but can be circumvented). >>
--
Let us know if this helped you,

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*

For Everyday MS Office tips to "use right away" -
http://microsoft.com/events/series/administrativetipsandtricks.mspx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top