DOC -> TXT component

L

LtCommander

Hi all,

I am trying to extract text from an uploaded DOC file so that I can do
some regex on the text in order to fill up some textboxes on the ASP
page. I know that Writely (www.writely.com) does this but I think the
use C# or something special to preserve the formatting of the uploaded
DOC file.

My questions:

(a) Does anybody know of a way to extract text from DOC files? Any
component out there that's cheap?

(b) Is there any way to be able to preview an uploaded DOC file? Maybe
convert it into XML and use some default styles or something.

(c) If DOC->Text isn't possible, is DOC->RTF possible? Since RTF is
ASCII, something could still be done..

Thanks a lot for your time / any response.

Vince
 
C

Cowboy \(Gregory A. Beamer\)

LtCommander said:
Hi all,

I am trying to extract text from an uploaded DOC file so that I can do
some regex on the text in order to fill up some textboxes on the ASP
page. I know that Writely (www.writely.com) does this but I think the
use C# or something special to preserve the formatting of the uploaded
DOC file.

My questions:

(a) Does anybody know of a way to extract text from DOC files? Any
component out there that's cheap?

Cheapest is the office DLLs. It is also the most perf heavy and there is a
potential of licensing issues. There are libraries to go from text to word,
but I am not certain in the other direction (sure they are out there, but I
do not know of them personally).
(b) Is there any way to be able to preview an uploaded DOC file? Maybe
convert it into XML and use some default styles or something.

I am sure there is a component (componentsource.com) or an open source
library (sourceforge.net is a good resource, as is codeplex.com).
(c) If DOC->Text isn't possible, is DOC->RTF possible? Since RTF is
ASCII, something could still be done..

It is pulling from DOC that is the issue. DOC > RTF is easy enough with the
Office libs, but you go back to the weight of having Office components on
your system.

I would look at third party libraries. I know there are components like Word
Writer (not sure if it goes both ways, however). I would also look at the
open source community. You might still have to pay (depending on license),
but you should be able to work out a reasonable deal.

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA
http://gregorybeamer.spaces.live.com

*************************************************
Think outside of the box!
*************************************************
 
L

LtCommander

Olaf said:
Hi,


also, MS recommends not to install Office on a server. Worth reading in
this context:
http://support.microsoft.com/default.aspx?scid=kb;en-us;257757

Cheers,
Olaf

Thanks Greg and Olaf.
Having the Office libraries on the server is certainly the last resort.
I don't even think that our hist provider will allow that. I will look
at componetsource.com to see if I can find something. If something else
suddenly strikes you, please let me know.
Thanks again for your help.

Vince
 
O

Olaf Rabbachin

Hi,
Having the Office libraries on the server is certainly the last resort.
I don't even think that our hist provider will allow that. I will look
at componetsource.com to see if I can find something. If something else
suddenly strikes you, please let me know.

I know this might not really apply to your current problem, but just FYI:
You don't need Office installed on the server to actually *create*
Office-files. There's a couple of ways to create files (at least starting
with Office2003), for Word that'd be i.e. via HTML (streamed as .Doc to the
client), XML-templates (i.e. a Word-template saved as XML, placeholders
within, those replaced by your code), or with XML/XSLT.

For the latter, check out this KB-article:
http://support.microsoft.com/Default.aspx?id=311461

However, if you actually need to *extract* text, then you'll probably be
forced to use some 3rdP-component. Why is it that you get DOCs uploaded (as
opposed to plain text)? That is, could you change that to "readable" text?

Cheers,
Olaf
 
L

LtCommander

Thanks Olaf. That's because we are doing resumes. The jobseekers would
normally upload DOC files as opposed to TXT or ASCII files. We want to
be able to fill up most of the online textboxes like Name, Age and so
on, by actually extracting text from the DOC file and doing a Regex on
it. So far, I found some components that claim to do this but nothing
really concrete. I have to keep looking!
If you can think of something else, please let me know. I'll continue
my component research and see if I can find something useful. Many
charge over USD 900!!

Cheers!
Vince
 
O

Olaf Rabbachin

Hi,
If you can think of something else, please let me know.

sorry, guess I don't have to offer much on extracting ...
I'll continue my component research and see if I can find something
useful. Many charge over USD 900!!

I'd appreciate your posting here when you found your solution.

Cheers,
Olaf
 
L

LtCommander

Sure. Thanks for your help and I'll post my findings here if they are
useful.

Cheers!
 
L

LtCommander

Hi Olaf,
just wanted to tell you that the component works like a charm. Now, the
mission is to find a cheaper one!!
 
O

Olaf Rabbachin

Hi,
just wanted to tell you that the component works like a charm. Now, the
mission is to find a cheaper one!!

question is what you want to do with the component. If you plan on
integrating it into your application for selling it, you'd have to buy the
$1200-license. If all the tool does is extracting the readable text-portion
of a .doc (without i.e. tables or field-names, etc), then I guess you could
dig into the format and find out about that yourself. Also I'm wondering
what such a tool would do with the hidden stuff. I.e., create a new DOC,
type in some text and save it. Then overwrite part or all of that text and
save again. When you open that document with a text-editor, you'll find
that the original text (after first save) will still be there unless you
removed the "hidden data" (check out ...
www.microsoft.com/downloads/details.aspx?familyid=144E54ED-D43E-42CA-BC7B-5446D34E5360&displaylang=en
....).

Cheers,
Olaf
 
L

LtCommander

thanks olaf. you know, resumes are pretty standard stuff, maybe tables
here and there but certainly no hidden text or comments. what we want
to do is to:

a) get the job seeker to upload his resume and fill up most of the
usual boxes (name, age, sex, dob..) ourselves. so, the incentive to
upload the resume inlcudes filling up an online profile for the seeker
automatically

i did not realize that we would have to buy the 1200 thing. that's
crazy! thanks! i am sure we won't we using the component now!

thanks for the article olaf. hidden text / comments aren't going to be
that big a problem compared to the regex stuff we would need to extract
various fields like name and so on from the text fiile. at 1200, i
think this feature of ours has to be aborted.
 
O

Olaf Rabbachin

Hi,
thanks olaf. you know, resumes are pretty standard stuff, maybe tables
here and there but certainly no hidden text or comments. what we want
to do is to:

a) get the job seeker to upload his resume and fill up most of the
usual boxes (name, age, sex, dob..) ourselves. so, the incentive to
upload the resume inlcudes filling up an online profile for the seeker
automatically

if that means that the files being uploaded to your server will be DOCs
that were downloaded from the same server before, meaning that you created
those DOCs, then create them as XML! In that case, you receive an "XML'd"
DOC which you may parse without any other component.
i did not realize that we would have to buy the 1200 thing. that's
crazy! thanks! i am sure we won't we using the component now!

That license would only make sense if you need to deploy that component on
more than 4 servers as it's 250.-/server.

Cheers,
Olaf
 
L

LtCommander

We would create our resume (the blanks that the user fills in on our
site) as XML but for a first time user, we don't want to bother them
with filling all the stuff in. So, an incentive for them to upload
their DOC would be that the most common textboxes would be filled in
automatically. So...
I have been informed of some 50 buck thing that's supposed to do the
same thing. Have to test that next!

cheers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top