MSWord -> Html

G

Guest

Hi,

Im making a html converter from word, and i have some difficulties in the
algorithm that parses the bolds, italics, and underlines...

if in a word document i have with some bolds, italic and underlined formats
(for instance, i have this paragraph) :

IM BOLD, IM UNDERLINED, IM ITALIC

i would like to convert to:

<b>IM BOLD,</B> <U>IM UNDERLINED,</U><I>IM ITALIC</I>

for me is easy to do:
<b>IM</b><b>BOLD</b><u>IM</u><u>UNDERLINED</u>.....
but following this way i have a lot of tags, and i would like to minimize
this...

If anybody knows how could i organize this algorithm (taking in count that
when a word is going to be parsed, i have to check what tags are open
before,etc...) i would be grateful...
 
N

Nicholas Paldino [.NET/C# MVP]

Josema,

The bold style you could reduce, but you can not reduce the underline
style. Having this:

<u>I'm</u> <u>underlined</u>

Is not the same as:

<u>I'm underlined</u>

That being said, you could look for words that are in bold, which are
separated by nothing but whitespace, and then wrap the bold tags around
that.

I think since Office XP, you can save word documents as HTML. Since you
are accessing the object model for word already to do this, why not just use
that facility instead?

Hope this helps.
 
G

Guest

Hi Nicholas, first of all thanks for your fast and useful response...

Im accessing to the Office object model, cause Office 2000 when you use the
option save as html, a lot of code trash is created around...

--
Thanks again.
Regards.
Josema


Nicholas Paldino said:
Josema,

The bold style you could reduce, but you can not reduce the underline
style. Having this:

<u>I'm</u> <u>underlined</u>

Is not the same as:

<u>I'm underlined</u>

That being said, you could look for words that are in bold, which are
separated by nothing but whitespace, and then wrap the bold tags around
that.

I think since Office XP, you can save word documents as HTML. Since you
are accessing the object model for word already to do this, why not just use
that facility instead?

Hope this helps.
 
R

rossum

Hi,

Im making a html converter from word, and i have some difficulties in the
algorithm that parses the bolds, italics, and underlines...

if in a word document i have with some bolds, italic and underlined formats
(for instance, i have this paragraph) :

IM BOLD, IM UNDERLINED, IM ITALIC

i would like to convert to:

<b>IM BOLD,</B> <U>IM UNDERLINED,</U><I>IM ITALIC</I>

for me is easy to do:
<b>IM</b><b>BOLD</b><u>IM</u><u>UNDERLINED</u>.....
but following this way i have a lot of tags, and i would like to minimize
this...

If anybody knows how could i organize this algorithm (taking in count that
when a word is going to be parsed, i have to check what tags are open
before,etc...) i would be grateful...


Crude fix: replace "</b> <b>" with " ".

If you want to do it the proper way, then you are going to have a lot
of on/off switches. Some sort of bit array might be useful.

rossum




The ultimate truth is that there is no ultimate truth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top