Word to HTML

W

whatwasthat

I'm need to write a program that will modify some formatting
in html documents that are saved from Word. I notice that the HTML documents
generated by Word have a particular "look". Is there any documentation around
on what the HTML document generated by Word will look like?

Or if there is no formal doco, back of an envelope description will do.
 
B

Bob Buckland ?:-\)

I'm not sure what you mean by how the HTML document will 'look'. The appearance depends, in part on what browser a person uses to
view the Word generated web document.

Word uses a combination HTML, CSS and XML to produce a web document with the goal being to (a) make the document able to be reopened
in Word without loss of content and (b) to try to make it look similar to the Word document would look on screen. Browsers don't
support all of Word so there are some tradeoffs, but the documentation is in a number of forms.

- Basic overview
http://support.microsoft.com/kb/212270/en-us?FR=1

- A bit more involved.
http://msdn.microsoft.com/en-us/library/aa338201.aspx

===========
I'm need to write a program that will modify some formatting
in html documents that are saved from Word. I notice that the HTML documents
generated by Word have a particular "look". Is there any documentation around
on what the HTML document generated by Word will look like?

Or if there is no formal doco, back of an envelope description will do.>>
--

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*
 
W

whatwasthat

Hi,

Thanks for the response. Because I need to interrogate the HTML document
generated by Word I thought I could save some effort if I could make
assumptions about the HTML generated by Word, like when does it generate a
new class and what name does it use, when does it put a style in the <style>
block, when does it use an inline style, when does it use the <em> tag, ..etc.

I guess I'm really after the algorithm/logic that Word uses to generate the
HTML document.

Failing that, the specific information I'm really after is, does Word always
put Font-Size and and Font-Family information in the <style> tag, if not when
does it make these inline styles.
 
B

Bob Buckland ?:-\)

You're probably going to have to do some reading in the links previously provided and do some testing with documents created by Word
that represent your environment <g>. Part of the parsing of the file depends on what it is you're looking to extract if you don't
want to use one of Word's two built in 'web document save formats'. If you're parsing on a 'look for X and ignore the rest' that
may be easier than trying to generate your own HTML and still have it look like the original document.

There isn't a simple 'always' answer for the HTML document anymore than there is for 'what's in a regular Word document' as far as
content and style and formatting (the Word 2007 spec on this runs to a 1,000+ pages <g>)

For example, Word creates CSS, but a CSS template can also be attached to a document and applied.

Assuming the HTML was generated using the 'Word Web Document' save format, rather than the 'Word Web Document-Filtered' file type
choice, then there is usually a <div class...> for each new section of the document, but you can add numerous sections in Word as
'section breaks' of various types. There are also <p class...> that Word will generate for a Style change that is listed in the
<Styles> section, among others.

The Styles section in the Web document reflects all the styles in use in the regular Word document, which can include the default
ones built into Word (and that varies by version), any created by the user, or Word on the fly, and can include direct formatting,
if the user paints the text in the document with formatting that isn't part of a given style.

For example, if the text in Word was typed in

This is sample text paragraph 1
This is sample text paragraph 2.

and the text was just entered with the default, out-of-the box, "normal" style in a Word 2003 document (and that style becomes
MsoNormal in the web document), then the text in Word generated 'HTML' would be

===========typed in sample ==========
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>This is text sample paragraph 1.</p>
<p class=MsoNormal>This is text sample paragraph 2.</p>
</div>
</body>
</html>
==============end typed in sample

but, if the same text is pasted into the document it could be

===========pasted in================
<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>This is sample text paragraph 1.<o:p></o:p></p>

<p class=MsoNormal>This is sample text paragraph 2.<o:p></o:p></p>

</div>

</body>
============end basic typed sample =========================

If the word 'sample' was painted with italics in the first paragraph and with the yellow highlighter tool in the 2nd paragraph then
the typed text becomes
the following (Word generally adds <span...> tags for direct formatting over a text with an applied style.

=============== italics and highlighter tool sample ============
<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>This is text <i style='mso-bidi-font-style:normal'>sample</i>paragraph 1.</p>

<p class=MsoNormal>This is text <span style='background:yellow;mso-highlight:yellow'>sample</span> paragraph 2.</p>

</div>
</body>
=============
If someone applied, through either promote/demote in outline view, or by style selection the default Heading 2 style to the 2nd
sentence, you'd find that for that particular style Word would not use the <Styles> listing, but would use the HTML <H2> style as
shown here trying to use a W3C 'standard' (HTML) formatting as first choice.

========== Word built in HTML style ===========
<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>This is text sample paragraph 1.</p>

<h2>This is text sample paragraph 2.</h2>

</div>

</body>
=========

To keep from starting from scratch <g> you may find the Word2HTML.XSL style sheet tool, helpful in parsing the Word [web]
documents. It's part of the WMLView.exe download linked from
http://blogs.msdn.com/brian_jones/archive/2005/09/30/475794.aspx

=============
Hi,

Thanks for the response. Because I need to interrogate the HTML document
generated by Word I thought I could save some effort if I could make
assumptions about the HTML generated by Word, like when does it generate a
new class and what name does it use, when does it put a style in the <style>
block, when does it use an inline style, when does it use the <em> tag, ..etc.

I guess I'm really after the algorithm/logic that Word uses to generate the
HTML document.

Failing that, the specific information I'm really after is, does Word always
put Font-Size and and Font-Family information in the <style> tag, if not when
does it make these inline styles.
====================
--

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top