File size and HTML

R

Rob van Albada

Hi,

I am using Word2007 to edit a rather largish bilingual dictionary.
When I strip all superfluous HLML-tags, the size is around 6 MB.
The file produced by Word used to be around 1 MB larger, about 7 MB.
I use a DOS32 program to strip the file of its superfluous tags for
advanced processing.
However, lately, the file size has increased enormously.
Under Word-2007 (before I used Word-2000) the file size has increased
from 6 MB to 15.9 MB approx.
For instance, the header now contains a list of all available fonts
(several hundred, while I use only two: Times New Roman and Symbol).
Also, every two or three words the file contains totally superfluous
information of the font, language and font size.
How can I bring back the file size to something more normal?
Word slows down considerably with a file of this size.

Thanks for your help,

Rob in Amsterdam.
 
B

Bob Buckland ?:-\)

Hi Rob,

Word 2007's new features (langauge neutral architecture, quick style sets, font pairs in themes...) can put quite a bit of
information into a Word web document to allow restoring to a .doc,.docX/M file type from a web page.

If you use Office Button=>Save As=>Other File Types=>Web Page-Filtered
you may see quite a bit of that removed.

What is the DOS utility you're using to filter the HTML output?

=============
Hi,

I am using Word2007 to edit a rather largish bilingual dictionary.
When I strip all superfluous HLML-tags, the size is around 6 MB.
The file produced by Word used to be around 1 MB larger, about 7 MB.
I use a DOS32 program to strip the file of its superfluous tags for
advanced processing.
However, lately, the file size has increased enormously.
Under Word-2007 (before I used Word-2000) the file size has increased
from 6 MB to 15.9 MB approx.
For instance, the header now contains a list of all available fonts
(several hundred, while I use only two: Times New Roman and Symbol).
Also, every two or three words the file contains totally superfluous
information of the font, language and font size.
How can I bring back the file size to something more normal?
Word slows down considerably with a file of this size.

Thanks for your help,

Rob in Amsterdam>>
--

Bob Buckland ?:)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*
 
R

Rob van Albada

Hi Bob,

Thanks. I followed your advice and got a file which is 11.975.311
bytes in size, i.e. around 4 MB smaller than the one I had but not
nearly as small as the file made by Word-2000.

Here is a sample of the code I get:

Fragment of the header:


<!--
/* Font Definitions */
@font-face
{font-family:Helvetica;
panose-1:2 11 5 4 2 2 2 2 2 4;}
@font-face
{font-family:Courier;
panose-1:2 7 4 9 2 2 5 2 4 4;}
@font-face
{font-family:"Tms Rmn";
panose-1:2 2 6 3 4 5 5 2 3 4;}
@font-face
{font-family:Helv;
panose-1:2 11 6 4 2 2 2 3 2 4;}
@font-face
{font-family:"New York";
panose-1:2 4 5 3 6 5 6 2 3 4;}
@font-face
{font-family:System;
panose-1:0 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"MS Mincho";
panose-1:2 2 6 9 4 2 5 8 3 4;}


Fragment of the body of the file:


<p class=MsoNormal><b><span lang=PT-BR
style='font-size:9.0pt'>acak</span></b><span
lang=PT-BR style='font-size:9.0pt'>-<b>acakan</b> II bi: ongeordend,
verward,
wanordelijk, rommelig Tw {<i>Sapa wani kandha yèn aku nyambutgawé
acak-acakan?</i>
Tr253}·</span></p>

<p class=MsoNormal><b><span lang=PT-BR
style='font-size:9.0pt'>acak</span></b><span
lang=PT-BR style='font-size:9.0pt'>-<b>acak</b> III Gun: meevragen,
vragen om
mee te komen {\<i>Lha mbah Nan ki yahéné wis acak-acak ki pité
jawané...</i>
Ros3}; zo <i>ajak</i>·</span></p>

<p class=MsoNormal><b><span lang=PT-BR
style='font-size:9.0pt'>acala</span></b><span
lang=PT-BR style='font-size:9.0pt'> bt: berg·</span></p>

<p class=MsoNormal><i><span lang=PT-BR
style='font-size:9.0pt'>ora</span></i><span
lang=PT-BR style='font-size:9.0pt'>, <i>durung</i> <b>acan</b> gw:
helemaal
(nog) niet·</span></p>

As you see, the font definitions take space, but most space is used by
tags in the body of the text. The whole text is 9pt Times, with a few
arrows which are from the Symbol font strewn in between (also 9 pt.)
The language setting also does not change anywhere in the file. (It is
only relevant to the key code setting, I suppose.) So there is no need
at all to repeat it every few words.

The program I use to remove superfluous HTML code is STRIPHTM.EXE,
which I wrote in Stonybrook Modula-2.
If you wish I can mail you a copy.

Kind regards,

Rob, Amsterdam.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top