ms-word creates html dash problem for front page

G

Guest

greetings from canada!!

i am using ms-word to transform/create a word-mode file into an html-mode
file (about a manual of about 350+ pages)..... (using the "save as .html"
function).

background to the application: the html file has been run through the html
filter to eliminate the ms-word superflous coding that would normally enable
"full circle" editing using ms-word....

we do not need that ms-word coding as the html file will not be going back
to ms-word - and although there maybe some minor editing by ms-frontpage, the
file will then be passed onward to a custom program that automatically
generates thousands of hyperlinks as well as a permuted cross-reference from
the content itself....

my problem: regardless of whether the html filter is used or not, the html
output from ms-word contains a small flaw: on some blank lines throughout
the html-mode file, there is a "_" (i.e. looks like an "underscore"
character) in column 1 --

my request for assistance: i would like to know how to completely prevent
this character from ever occurring so that we might alter our word processing
text entry rules to avoid its presence in the html-mode file....

any suggestions will be very much appreciated....

jack bonney
 
L

lostinspace

----- Original Message -----
From: "jack w. bonney vancouver" <>
Newsgroups: microsoft.public.word.docmanagement
Sent: Thursday, July 07, 2005 11:01 PM
Subject: ms-word creates html dash problem for front page

greetings from canada!!

i am using ms-word to transform/create a word-mode file into an html-mode
file (about a manual of about 350+ pages)..... (using the "save as .html"
function).

background to the application: the html file has been run through the
html
filter to eliminate the ms-word superflous coding that would normally
enable
"full circle" editing using ms-word....

we do not need that ms-word coding as the html file will not be going back
to ms-word - and although there maybe some minor editing by ms-frontpage,
the
file will then be passed onward to a custom program that automatically
generates thousands of hyperlinks as well as a permuted cross-reference
from
the content itself....

my problem: regardless of whether the html filter is used or not, the
html
output from ms-word contains a small flaw: on some blank lines throughout
the html-mode file, there is a "_" (i.e. looks like an "underscore"
character) in column 1 --

my request for assistance: i would like to know how to completely prevent
this character from ever occurring so that we might alter our word
processing
text entry rules to avoid its presence in the html-mode file....

any suggestions will be very much appreciated....

jack bonney

Hello Jack,
Ever hear of Patterson Park ;-)))

Your first three paragraphs are in terrible conflict.
It's impossible to create a html page from Word that does not include both
some invalid and deprecated html. ( A simple example is in bolding fonts.
<b></b>)

You haven't specified which version of Word your using which may assist the
MVP's?

I'm assuming that you started from scratch and created your document in Word
with both font and page layout formatting? Perhaps even some other Word
goodies added in?

My suggestion is to take your entire document of approx 350 pages and save
it as a STRAIGHT-TEXT file with not a solitary piece of formatting or layout
included and create your html pages from that text. (It's the only was to
avoid such simple errors as you currently face.)

The older versions of FrontPage were no better when using fonts or
components than Word. FP also used deprecated html.
MS has said that 2003 FP (Standalone) would be better, however I've heard
from others that's not so.

In your second last paragraph you provide the following:

" there is a '_' (i.e. looks like an 'underscore' character) in column 1 --
"

There is conflict here as well?
Is it an underscore or a double-hyphen and is it surrounded by quotes or
have you just used the quotes for emphasis?

I'm inclined to believe that these mystery characters are weaknesses in the
html cleaner. Nor do I believe you'll find a setting in Word which will
control the use of these characters (as least as related to the html
component.)

If the quotes are there?
A search and replace with any text editor would be very easy to do.

If the underscore is the mystery character than it's possibly related to the
image links that Word creates.

Before composing this reply to you?
I used word to create my only ever, 2nd html page with Word. I have numerous
settings in Word (as related to html) turned off as I dug for solutions for
others.

I copied a complex CSS-HTML page into Word and then saved as html (with the
aforementioned settings.) I do not have the html cleaner for Word2000
installed as I never intend to use Word to create web pages.

The result of the copy and save as related to the viewing of the html
afterward was a far cry from being anywhere in the same continent as clean
html.
 
B

Bob Buckland ?:-\)

Hi Jack,

To add to the reply from lostinspace,
without knowing the content of the Word .doc file,
a snippet of the HTML that would show the 'mystery'
character and the version of Word you're using it's
not easy to have an idea on where to focus attention.

If you're using the Office 2000 HTML filter, you'll
usually find that using the standalone MSFilter.exe
tool (Start=>Programs=>Microsoft Office Tools as well)
will give you more options for stripping out the information
than if you're using File=>Export to HTML from inside of
Word.

This article has additional information on what the filter
can remove.

http://office.microsoft.com/en-us/assistance/HA010549981033.aspx

========
greetings from canada!!

i am using ms-word to transform/create a word-mode file into an html-mode
file (about a manual of about 350+ pages)..... (using the "save as .html"
function).

background to the application: the html file has been run through the html
filter to eliminate the ms-word superflous coding that would normally enable
"full circle" editing using ms-word....

we do not need that ms-word coding as the html file will not be going back
to ms-word - and although there maybe some minor editing by ms-frontpage, the
file will then be passed onward to a custom program that automatically
generates thousands of hyperlinks as well as a permuted cross-reference from
the content itself....

my problem: regardless of whether the html filter is used or not, the html
output from ms-word contains a small flaw: on some blank lines throughout
the html-mode file, there is a "_" (i.e. looks like an "underscore"
character) in column 1 --

my request for assistance: i would like to know how to completely prevent
this character from ever occurring so that we might alter our word processing
text entry rules to avoid its presence in the html-mode file....

any suggestions will be very much appreciated....

jack bonney >>
--
Let us know if this helped you,

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*

For Everyday MS Office tips to "use right away" -
http://microsoft.com/events/series/administrativetipsandtricks.mspx
 
A

Amedee Van Gasse

lostinspace shared this with us in microsoft.public.word.docmanagement:
My suggestion is to take your entire document of approx 350 pages and
save it as a STRAIGHT-TEXT file with not a solitary piece of
formatting or layout included and create your html pages from that
text. (It's the only was to avoid such simple errors as you currently
face.)

This also works:
* Open your document in Word.
* Copy everything to the clipboard: CTRL+A CTRL+C
* Open Dreamweaver (I don't know Frontpage, never used it)
* Paste from clipboard: CTRL+V

This method preserves most of the formatting AND gives you better HTML.
Dreamweaver 2004 is even supposed to be HTML 4.01 and XHTML compliant.
 
G

Guest

amadee - thank you for your fast response to my question.... patterson park
is the trotting track in ladner, right??

unfortunately, at the moment, it is not a perfect world out there!! the
original document follows a typical pattern of evolution: it was created by
collaboration between 4 amateurs using word 3.1 (1995) with a little bit of
wordperfect thrown in for good measure.... in other words, it was a dog's
breakfast!!

everything was converted/consolidate under word 97 and subsequently
migrated up to its current platform of windows xp and ms-word 2000....

anyway, all the word processing must be preserved because there are several
hundreds of copies in the field, and ms-word is used to maintain the
paper-based version.... (and some management egos are involved, not to
mention budget)....

at the same time, we want to transform the wordprocessed version to
electronic display.....

the "mystery character" looks like the underscore character but is actually
a bit shorter, and when viewed with frontpage on the split-screen, each
character represents a "chunk" of html code, but there is no underscore
character in the code string..... therefore, i can not "search and
replace"....

the object of the exercise is to be able to transform a legacy document over
to electronic display as automatically as possible, with only the bare
minimum of cosmetic "touch-up" with front page..... i.e keeping the human
intervention to an absolute minimum.... i couldn't provide the output image
as it would not copy to this panel....

however, here is the html code for three lines with the mystery character
that displays in frontpage and ms-explorer: when i highlight the character as
it displays in the design panel, only the string is highlighted
accordingly in the code panel..... but in fact, nothing is supposed to
display because it is intended to be a blank line....

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>


the result is an underscore displayed down the center of the page..... the
situation occurs intermittently throughout the document....

thank you in advance for your continued interest in this issue and for any
further advice you may be able to offer....

jack.
 
D

Daiya Mitchell

My suggestion is to take your entire document of approx 350 pages and
This also works:
* Open your document in Word.
* Copy everything to the clipboard: CTRL+A CTRL+C
* Open Dreamweaver (I don't know Frontpage, never used it)
* Paste from clipboard: CTRL+V

This method preserves most of the formatting AND gives you better HTML.
Dreamweaver 2004 is even supposed to be HTML 4.01 and XHTML compliant.

I've been told this method also works for FrontPage. However, pasting 10
pages in DW caused a pause for me while it showed up--the OP might want to
do this in chunks.
 
L

lostinspace

----- Original Message -----
From: "jack w. bonney vancouver" <>
Newsgroups: microsoft.public.word.docmanagement
Sent: Friday, July 08, 2005 10:20 AM
Subject: Re: ms-word creates html dash problem for front page & ms-explorer

amadee - thank you for your fast response to my question.... patterson
park
is the trotting track in ladner, right??

unfortunately, at the moment, it is not a perfect world out there!! the
original document follows a typical pattern of evolution: it was created
by
collaboration between 4 amateurs using word 3.1 (1995) with a little bit
of
wordperfect thrown in for good measure.... in other words, it was a dog's
breakfast!!

everything was converted/consolidate under word 97 and subsequently
migrated up to its current platform of windows xp and ms-word 2000....

anyway, all the word processing must be preserved because there are
several
hundreds of copies in the field, and ms-word is used to maintain the
paper-based version.... (and some management egos are involved, not to
mention budget)....

at the same time, we want to transform the wordprocessed version to
electronic display.....

the "mystery character" looks like the underscore character but is
actually
a bit shorter, and when viewed with frontpage on the split-screen, each
character represents a "chunk" of html code, but there is no underscore
character in the code string..... therefore, i can not "search and
replace"....

the object of the exercise is to be able to transform a legacy document
over
to electronic display as automatically as possible, with only the bare
minimum of cosmetic "touch-up" with front page..... i.e keeping the human
intervention to an absolute minimum.... i couldn't provide the output
image
as it would not copy to this panel....

however, here is the html code for three lines with the mystery character
that displays in frontpage and ms-explorer: when i highlight the character
as
it displays in the design panel, only the string is highlighted
accordingly in the code panel..... but in fact, nothing is supposed to
display because it is intended to be a blank line....

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>


the result is an underscore displayed down the center of the page.....
the
situation occurs intermittently throughout the document....

thank you in advance for your continued interest in this issue and for any
further advice you may be able to offer....

jack.

Jack,
Aye! Patterson Park a trotting park. I'm not sure of the then or
today location. (1960 reference) some seem to believe that Patterson is
today's Sandown. At one time I was attempting to trace a trotting circuit
which began in the spring in locations throught NorthWest Canada and ended
in the Fall in the Wash, Oregon and California (NorthWest US.) I didn't
have any success as the documentation for these things are rather skimpy.

Given your method of creation for the doc and with faces to save?
Your options are going to be very limited.
I don't believe either Word or the html cleaner will help. Nor do I believe
you'll find and automated method of removing mystery characters.

I downloaded the Standalone html cleaner that Bob Buckland provided a link
to and ran the aforementioned page that I created. The result was that the
cleaner removed most everything, however some items (such as absolute
paragraph position) still remain. [Even though I had NOT any absolute
paragraph positioning in the original web page.]

I copy and pasted the three html lines that you provided into the html
option of FrontPage and was left with a blank web page. No mystery
characters. (Leads me to believe that something exclusive to you end [server
or OS] is the cause. Although you did previously add that the mystery
character were not appearing consistently.

Additionally, most everything contained in the three lines of html that you
provide are eith invalid or deprecated html.

Thos three lines should read:

<p></p>

<p></p>

<p></p>

And no more. Anything in excess, is bloat caused by either Word, the html
cleaner or FrontPage.

It's still my opinion that the most effective and most efficient method is
for you to start from scratch with basic text and design your layout with
CSS/html.
The laternative for server side is PHP/MySQL.

Web pages created by Word only make your already complicated situation, more
complicated.
In the long run, that lack of a solution today will provide you far more
headaches in the future.

As far as Dreamweaver?
Many folks provide that's a very useful software, while others provide that
it compares in many ways to FP in creating invalid HTML. However using DW as
option (that was expressed) as an html cleaner may be an option to explore,
although not worth the price of purchase.

Unless Bob Buckland or one of the others are able to provide additional
insight?
I don't see that many options for you that will provide what you desire to
accomplish.
 
G

Guest

to all who have contributed to addressing my problem regarding the mystery
underline character appearing as a result of the transform from .doc to .html
using ms-word....

well, my technical support team has a motto: "we can do anything".... and
while every now and again, i do have reservations on the accuracy of that
motto, they have never let me down.....

so, given all the information that you folks contributed, we have managed to
resolve the issue by simply programming the darn stuff out of the file.....

in case, you are interested in the having a piece of the problem for
analysis, i have copied to this panel:

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;mso-bidi-font-size:10.0pt'><![if
!supportEmptyParas]> <![endif]><o:p></o:p></span></u></p>

the precise "bug" starts at &nbsp and extends to p>

although i am puzzled as to how it gets there, i can imagine it has to do
with the metamorphoses that the file has gone thru....

thanks again for your help - we appear to be back on the tracks....

best regards,

jack bonney.
 
L

lostinspace

"the precise "bug" starts at &nbsp and extends to p>"


This character is the ESC code for the space bar and is inserted by
FrontPage in ALL open space paragraphs.
In FP If you hit return and insert a line break (to the next paragraph,) FP
will insert an entire row of these characters into the html, the number of
which I've never counted when deleting.
 
A

Amedee Van Gasse

lostinspace shared this with us in microsoft.public.word.docmanagement:

I don't know? I'm from .be.
Ouch!

Ouch!
PHB-alert!!!

The underscore part is <u> </u>, which is BAD html: it is deprecated in
HTML 4.0.
The recommended alternative is a style (css):

text-decoration: underline;

There exists html cleanup software that works in batch. GIYF.

Put it on public webspace and provide a link.

3 empty centered underlinded paragraphs with a large font.

Let me guess: near titles, or where titles used to be?
Given your method of creation for the doc and with faces to save?
Your options are going to be very limited.
I don't believe either Word or the html cleaner will help. Nor do I
believe you'll find and automated method of removing mystery
characters.

I downloaded the Standalone html cleaner that Bob Buckland provided a
link to and ran the aforementioned page that I created. The result
was that the cleaner removed most everything, however some items
(such as absolute paragraph position) still remain. [Even though I
had NOT any absolute paragraph positioning in the original web page.]

What html cleaner was that?
I use Absolute Html Compressor, and I'm investigating Linux/CygWin
based code cleanders.
Additionally, most everything contained in the three lines of html
that you provide are eith invalid or deprecated html.

Thos three lines should read:

<p></p>

<p></p>

<p></p>

And no more. Anything in excess, is bloat caused by either Word, the
html cleaner or FrontPage.

I totally agree.
It's still my opinion that the most effective and most efficient
method is for you to start from scratch with basic text and design
your layout with CSS/html. The laternative for server side is
PHP/MySQL.

I agree. See above for a CSS example. Read the html/css specs for more
info ;-)
Web pages created by Word only make your already complicated
situation, more complicated. In the long run, that lack of a
solution today will provide you far more headaches in the future.

In other words: invest now to avoid hidden costs in the future.
As far as Dreamweaver?
Many folks provide that's a very useful software, while others
provide that it compares in many ways to FP in creating invalid HTML.

That *seriously* depends on the version *and* some configuration
options. DW 2004 for example creates almost 100% compliant HTML out of
the box. But unfortunately a lot of people tinker with the
configuration to make it work like previous versions, or it's the
result of some really bad hand-hacking.
However using DW as option (that was expressed) as an html cleaner
may be an option to explore, although not worth the price of purchase.

If you already happen to have DreamWeaver (don't bother to buy it if
you don't have it) it is indeed a good html cleaner. You can tell it to
clean Word html, and it does a rather good job. In this case, it
totally deletes your 3 underlined centered empty paragraphs.
Unless Bob Buckland or one of the others are able to provide
additional insight? I don't see that many options for you that will
provide what you desire to accomplish.

I have not tried Nvu yet, but it promises to be a DreamWeaver clone.
There is no harm trying it, because it is Free(libre) and therefor
free(gratis). www.nvu.com
 
B

Bob Buckland ?:-\)

Hi Jack,

As you mentioned, you need to maintain the Word version of
the file. What your snippet shows is that there is a typed
space that has underlining applied to it and that space is
centered on the page - a guess would be that it was a visual
paragraph divider at one time, as the font size is set to 28pts,
about 1/4" in height.

The interaction between Word 2000 and Frontpage 2000 via the
clipboard could produce some unexpected results as FrontPage
2000 hadn't been fully updated to handle what Word 2000 was
putting on the clipboard as 'HTML' format.

Within a spare *copy* the Word document itself see wha
happens if you use Edit=>Replaceand there use the
[More] choice then [Format]=>Font andset it to
look for underlined text and 28 point font size.
In the 'Find what box' type a space then ^p
and in replace with, just to give you a visual check,
type XXXX then do a replace all.

If that 'finds' the text (i.e. replaces it with XXXX)
you can click the undo button on the Word toolbar and
in the 'replace with' box leave it blank (replace with nothing).

=========
amadee - thank you for your fast response to my question.... patterson park
is the trotting track in ladner, right??

unfortunately, at the moment, it is not a perfect world out there!! the
original document follows a typical pattern of evolution: it was created by
collaboration between 4 amateurs using word 3.1 (1995) with a little bit of
wordperfect thrown in for good measure.... in other words, it was a dog's
breakfast!!

everything was converted/consolidate under word 97 and subsequently
migrated up to its current platform of windows xp and ms-word 2000....

anyway, all the word processing must be preserved because there are several
hundreds of copies in the field, and ms-word is used to maintain the
paper-based version.... (and some management egos are involved, not to
mention budget)....

at the same time, we want to transform the wordprocessed version to
electronic display.....

the "mystery character" looks like the underscore character but is actually
a bit shorter, and when viewed with frontpage on the split-screen, each
character represents a "chunk" of html code, but there is no underscore
character in the code string..... therefore, i can not "search and
replace"....

the object of the exercise is to be able to transform a legacy document over
to electronic display as automatically as possible, with only the bare
minimum of cosmetic "touch-up" with front page..... i.e keeping the human
intervention to an absolute minimum.... i couldn't provide the output image
as it would not copy to this panel....

however, here is the html code for three lines with the mystery character
that displays in frontpage and ms-explorer: when i highlight the character as
it displays in the design panel, only the string is highlighted
accordingly in the code panel..... but in fact, nothing is supposed to
display because it is intended to be a blank line....

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>

<p class=MsoNormal align=center style='text-align:center'><u><span
style='font-size:28.0pt;'> </span></u></p>


the result is an underscore displayed down the center of the page..... the
situation occurs intermittently throughout the document....

thank you in advance for your continued interest in this issue and for any
further advice you may be able to offer....

jack>>
--
Let us know if this helped you,

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*

For Everyday MS Office tips to "use right away" -
http://microsoft.com/events/series/administrativetipsandtricks.mspx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top