Scanned Doc to OCR to Indexed SQL DB

Ryan Ternier · Jul 15, 2004

Hello,

We're on the way to start a new web application that deals with users
scanning in Printed off Documents, and uploading them to the their website.
When these docs hit the website, they will be ran through an online OCR
utility, to get the raw text from it.
From this point, they will be stripped down and shoved (maybe even stomped)
into a SQL DB.

I've never attempted this before, and was looking for some advice with the
DB section.
Once the document is uploaded and stripped into it's text form, what is the
best way to index it so users can do searches on it. We could just put the
whole thing into
a large text field, but indexing seems faster.

If anyone could shed light, or point me in a direction it'd be appreciated.

Thanks.

Ryan Ternier
Code Monkey

Mary Chipman · Jul 15, 2004

I'd suggest looking into SQL Server's full-text search capabilities.
There's quite decent documentation in SQL Books Online.

--Mary

Douglas Laudenschlager [MS] · Jul 16, 2004

Ryan, I thought you might be interested to know that Microsoft Office 2003
ships with a little-known component, Microsoft Office Document Imaging
(MODI), which exposes a programmable COM API for automation. This component
could be used to run OCR on all the TIFF images in a given folder, for
instance.

http://msdn.microsoft.com/library/d.../en-us/Mspauto/html/ditocOMMap.asp?frame=true

-Doug

--
Douglas Laudenschlager
Microsoft SQL Server documentation team
Redmond, Washington, USA

This posting is provided "AS IS" with no warranties, and confers no rights.

Scanned Doc to OCR to Indexed SQL DB

Ryan Ternier

Mary Chipman

Douglas Laudenschlager [MS]