Scanned Doc to OCR to Indexed SQL DB

R

Ryan Ternier

Hello,

We're on the way to start a new web application that deals with users
scanning in Printed off Documents, and uploading them to the their website.
When these docs hit the website, they will be ran through an online OCR
utility, to get the raw text from it.
From this point, they will be stripped down and shoved (maybe even stomped)
into a SQL DB.

I've never attempted this before, and was looking for some advice with the
DB section.
Once the document is uploaded and stripped into it's text form, what is the
best way to index it so users can do searches on it. We could just put the
whole thing into
a large text field, but indexing seems faster.

If anyone could shed light, or point me in a direction it'd be appreciated.

Thanks.

Ryan Ternier
Code Monkey
 
M

Mary Chipman

I'd suggest looking into SQL Server's full-text search capabilities.
There's quite decent documentation in SQL Books Online.

--Mary
 
D

Douglas Laudenschlager [MS]

Ryan, I thought you might be interested to know that Microsoft Office 2003
ships with a little-known component, Microsoft Office Document Imaging
(MODI), which exposes a programmable COM API for automation. This component
could be used to run OCR on all the TIFF images in a given folder, for
instance.

http://msdn.microsoft.com/library/d.../en-us/Mspauto/html/ditocOMMap.asp?frame=true

-Doug

--
Douglas Laudenschlager
Microsoft SQL Server documentation team
Redmond, Washington, USA

This posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top