Question re XPS format

Alan · May 8, 2009

I hope this is a good place to post this --- could not find a
better suited group. . . .

I need to be able to extract some text from a particular XPS file that
is updated periodically but has a standard format. I have read up on
the XPS file format and understand its structure.

Here's my problem: The UnicodeString in the file has a bunch of words
stuck together. For example, the string:

"1 Here is a string of words strung together. I need to separate them
to extract them...."

would be represented something like this:

<FixedPage . . .
<Glyphs Fill="#ff000000" FontUri="/Documents/1/Resources/Fonts/
C33C1892-4299-487A-9A63-97230919AAA4.odttf"
FontRenderingEmSize="10.5596" StyleSimulations="None" OriginX="38.08"
OriginY="229.12"
Indices="3,27;25,331;39;40;36;39,73;3,27;50,77;53,73;3,27;36;47;44;57;40,855;49;36,250;47,447;20;21,57;20;3;11;21;12,164;23;21;24,57;3,27;11;26;12,361;25;19,57;3,27;11;23;12,200;21;3;11,34;23;12,1080;22,57;28;3,27;11,34;28;12,380;22;22,319;19;3;11;20;12,258;41;3,27;22;16,34;23,377;16,341;20;24,87;11;26;12,214;23;18,27;20,382;28;18;20"UnicodeString="1Hereisastringofwordsstrungtogether.Ineedtoseparatethemtoextractthem...." /. . .
</FixedPage>

In the above, the Indices shown do not go with the string, but I just
wanted to give you the general idea.

I know the second fields in Indices are AdvanceWidths, but I have not
found an easy (or any) way to determine where to put spaces in the
output string.

Can anyone shed some light on this or point me to a good source of
information?

Thanks, Alan

Andrew McLaren · May 8, 2009

Hi Alan,

The answer will depend on your scenario. What are you using to read the XPS
file? In other words, are you manipulating it programmatically with .NET
code, such as C#? Or are you opening the unzipped XPS in a text editor (for
example) and trying to read the contents?

It is permissible for the UnicodeString property of the Glyph object to
contain spaces at word breaks. However, this is not strictly required;
because UnicodeString is just the list of Unicode chars. The spacing,
kerning, etc of those chars can be controlled by the Indices property of the
Glyph (not all word breaks in all docs will be formed by space chars). I'd
guess that whatever app produced the XPS output you're working with, decided
not to bother inserting raw 0x0020 chars in the UnicodeString, preferring to
control spacing via the graphical representation encoded in the Glyph's
Indices property.

However, if you used the right combination of .NET APIs you can convert the
Glyph object back into text; see for example,
http://www.microsoft.com/whdc/xps/xps-read.mspx.

But, .NET APIs might not help if you're looking for, say, a scripting or
interactive solution. So like I said, we may need to know more about what
you have planned.

XPS is oriented more towards final-form presentation of documents; rather
than active processing of text data. If you can intercept the data at an
earlier stage of production, such as an XLM or plain text form, that might
be easier to deal with than final XPS. I think you could certainly find a
solution for reading the XPS data; but without some .NET programming (ie,
taking advantage of the smarts already built-in to the .NET Framework) you'd
have some pretty complex work.

(By way of aside: the .NET world is just catching up with IBM mainframes,
which have made a clear distinction between FFT ("Final form text") and RFT
(Revisable form text") documents for several decades now. In the Microsoft
world pre-XPS, nearly all documents were in a revisable format; such as MS
Word *.DOC files).

XPS generally falls into the realm of WPF (Windows Presentation Foundation)
programming, so maybe any WPF forums you can find would have ideas too.

Hope it helps a bit,
Andrew

Alan · May 8, 2009

What are you using to read the XPS
file? In other words, are you manipulating it programmatically with .NET
code, such as C#?

I want to manipulate it programmatically using Java. Thanks, Alan

Andrew McLaren · May 9, 2009

Alan said:
I want to manipulate it programmatically using Java. Thanks, Alan

I suspect that's going to be quite difficult. Not impossible - it's only
software, right? - but in Java you would need to recreate a lot of complex
rendering logic, which already exists as ready-to-call APIs in .NET.

If you are lucky, even a simple .NET ToString() method on the Glyph object
will give you a readable text rendering.

I'd recommend sitting back and taking a second look at the whole problem:
are you using the right tools for the job? Is this the best way to extract
the data you want? etc. If you can intercept the data before it is committed
to XPS - say, as an XML document - then you can use any XML parser you like,
and Java might be a good choice. Once the data is encoded as an XPS
document, it's pretty much a creature of the .NET world.

Good luck,
Andrew

Alan · May 9, 2009

Andrew,

I was probably unclear. All I want is to extract some text
strings, not render them. However, the particular application that
creates the XPS files (not under my control) does not put whitespace
in the UnicodeString. This is allowable, and the Indices provide
AdvanceWidth settings for each UnicodeString to do that.

However, I have not found documentation on how to determine whether or
not an AdvanceWidth is large enough to constitute whitespace.

Thanks, Alan

Andrew McLaren · May 9, 2009

However, I have not found documentation on how to determine whether or

not an AdvanceWidth is large enough to constitute whitespace.

Ah, right ... gotcha. Okay.

I'm not sure what the best approach is, there. AdvanceWidth by default is
equal to the font height; but, there could be many nuances.

Possibly a better forum to try would be on MSDN:

http://social.msdn.microsoft.com/Forums/en-US/windowsxps/threads

(not sure if you need to sign-in, to post there; I think so. MSDN IDs are
free and instant).

I get that you're interested in the text data, not in rendering per se. But
XPS is pretty highly oriented towards the *display* of documents. It doesn't
really seem to preserve data independent from how the data will appear on a
device (screen, printer, whatever). I dunno if that's a good thing or not
....

Cheers
Andrew

Alan · May 9, 2009

Possibly a better forum to try would be on MSDN:

http://social.msdn.microsoft.com/Forums/en-US/windowsxps/threads

Andrew,
Thanks for the reference to the other forum. Alan

Question re XPS format

Alan

Andrew McLaren

Alan

Andrew McLaren

Alan

Andrew McLaren

Alan