Text parser

Big D · Mar 2, 2004

Hi all,

I'm working on a little app that will go through a text file (right now a
"rich text" document), and parse it into a pseudo-html that our flash
programmers can use in their presentation.

I'm having a lot of trouble, because the rtf format is quite complicated...
at first we thought it seemed that there was no "nesting" of formatting, but
every once in a while it seems like there is. Also, depending on the
complexity of the original document, we may end up with lots of
un-decypherable syntax. In other words it's not as simple as:

{\b this is bold text}{\b\i this is bold and italicised text}

because every once in a while you'll have something like:

{\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
italicised text}{{/das/dd /d More text} /d/as///jh/}

So there's no way to easily break content into just the {/format Text}
definitions.

It all means something, I'm sure, but rather than try and re-work the whole
spec for rtf -> my format, I was hoping that there was a simplier format
that the text could be saved as before parsing. The originals are word
documents. The target pre-parsing format simply needs to include line
breaks, bolding, italicising, and underlining. All other formatting can go
out the window.

There are commerical components that handle rtf -> HTML, but that's not
really what I need and would have to re-parse it all anyway.

Is there a format that does this? Or does anyone have any good ideas?

Thanks for any input,

MCD

John Eikanger [MSFT] · Mar 3, 2004

You could probably do this with a custom clipboard format, but I expect
that would be as much work as you are already facing.

You will probably have better luck posting this in one of the SDK groups,
such as microsoft.public.win32.programmer.ui or maybe
microsoft.public.platformsdk.shell. The folks there may be more familiar
that straight dotnet programmers.

Thank you for choosing the MSDN Managed Newsgroups,

John Eikanger
Microsoft Developer Support

This posting is provided “AS IS” with no warranties, and confers no rights.
--------------------
| From: "Big D" <[email protected]>
| Subject: Text parser
| Date: Tue, 2 Mar 2004 14:44:24 -0700
| X-Tomcat-NG: microsoft.public.dotnet.languages.vb
|
| Hi all,
|
| I'm working on a little app that will go through a text file (right now a
| "rich text" document), and parse it into a pseudo-html that our flash
| programmers can use in their presentation.
|
| I'm having a lot of trouble, because the rtf format is quite
complicated...
| at first we thought it seemed that there was no "nesting" of formatting,
but
| every once in a while it seems like there is. Also, depending on the
| complexity of the original document, we may end up with lots of
| un-decypherable syntax. In other words it's not as simple as:
|
| {\b this is bold text}{\b\i this is bold and italicised text}
|
| because every once in a while you'll have something like:
|
| {\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
| italicised text}{{/das/dd /d More text} /d/as///jh/}
|
| So there's no way to easily break content into just the {/format Text}
| definitions.
|
| It all means something, I'm sure, but rather than try and re-work the
whole
| spec for rtf -> my format, I was hoping that there was a simplier format
| that the text could be saved as before parsing. The originals are word
| documents. The target pre-parsing format simply needs to include line
| breaks, bolding, italicising, and underlining. All other formatting can
go
| out the window.
|
| There are commerical components that handle rtf -> HTML, but that's not
| really what I need and would have to re-parse it all anyway.
|
| Is there a format that does this? Or does anyone have any good ideas?
|
|
| Thanks for any input,
|
| MCD
|
|
|

Erik Frey · Mar 3, 2004

Big D said:
Hi all,

I'm working on a little app that will go through a text file (right now a
"rich text" document), and parse it into a pseudo-html that our flash
programmers can use in their presentation.

I'm having a lot of trouble, because the rtf format is quite complicated...
at first we thought it seemed that there was no "nesting" of formatting, but
every once in a while it seems like there is. Also, depending on the
complexity of the original document, we may end up with lots of
un-decypherable syntax. In other words it's not as simple as:

{\b this is bold text}{\b\i this is bold and italicised text}

because every once in a while you'll have something like:

{\b bold text}\d\adsfaa\adsagd\aeaqwewe\a\\\\asdf\\{\b\i this is bold and
italicised text}{{/das/dd /d More text} /d/as///jh/}

So there's no way to easily break content into just the {/format Text}
definitions.

It all means something, I'm sure, but rather than try and re-work the whole
spec for rtf -> my format, I was hoping that there was a simplier format
that the text could be saved as before parsing. The originals are word
documents. The target pre-parsing format simply needs to include line
breaks, bolding, italicising, and underlining. All other formatting can go
out the window.

There are commerical components that handle rtf -> HTML, but that's not
really what I need and would have to re-parse it all anyway.

Is there a format that does this? Or does anyone have any good ideas?

Thanks for any input,

MCD

This is a kludge, but I thought I'd throw it out there if you are strapped
for ideas. The RichTextBox control has two members that may be of use to
you, once you load your rtf into the control:

RichTextBox.Select(int start, int length)

and the property

RichTextBox.SelectionFont

Between those two, you could programmatically figure out what has been made
bold, underlined, or italicised. You could also parse line breaks.

It may not be an efficient algorithm, but it will be easy to write.

Erik

Text parser

Big D

John Eikanger [MSFT]

Erik Frey