Parsin an RTF file

  • Thread starter Thread starter james.dixon
  • Start date Start date
J

james.dixon

Hi All

My current task is go through an rtf file, and extract text based on
its format. ie Find and extract bold text as 'heading', text between
this and next bold text is 'content', repeat until the end of the
document.

Is there a way to do this? I have looked at the rtf source (very
complicated) and RTFTextBox, and haven't worked out how to do it.

Does anybody have any hints on this one?

Thanks in advance.

James
 
Hi James,

Sounds like a job for the regex class.
I have used it to parse rtf files into html, concatenate multiple rtf
files to a single rtf file and translate multi page rtf files for printing.
My approach was to split the rtf file into three parts.
. Header Info
. Font Table
. Text Body
Then parse the text body into an array of text & control words.
Because control words start with \ ,you can then re-emit the text in a
different format (e.g. xhtml).

I did this using pcre.dll (perl regex lib) in another language (not c#).

Mark
 
Hi,

You need to read and understand the rtf format, IIRC it's a markup format,
if so you will need to parse the file.
a RegEx mayl help you, but I think that you may need something different,
like a parser ( a la LEX ).

First, do a search in google as this is probably something that has been
asked before.
 
Back
Top