PC Review


Reply
Thread Tools Rate Thread

Direct OCR of electronic document

 
 
Hi Ho Silver
Guest
Posts: n/a
 
      13th Apr 2007
I have some electronic documents in the form of un-editable pictures in PDF
files. I need to convert them to editable text, not necessarily in PDF
format. What I have been doing is printing these PDF files, scanning them
on my HP 3970 scanner as a document [using the option to "Scan for editable
text (OCR)]" and saving as a MS Word file. I am using the HP Photo &
Imaging Software Version 2.1 that came with the scanner. What I would like
to do is convert the original electronic documents directly from their PDF
picture file to the editable text without the printing step. My questions:

1. Is it possible for me to do this direct conversion with the HP software
I have?
2. If not, what are other ways I could do this; e.g. with other software?
3. Is there an internet online service that can do this without me buying
new software?

Thanks



 
Reply With Quote
 
 
 
 
Dances With Crows
Guest
Posts: n/a
 
      13th Apr 2007
Hi Ho Silver staggered into the Black Sun and said:
> I have some PDF files. I need to convert them to editable text,
> not necessarily in PDF format.


Huh? PDF is essentially a write-once format. Even with expensive crap
like Acrobrat (full) and Enfocus Pitstop, trying to change text in a PDF
is an exercise in pain and futility.

> printing these PDF files, scanning them on my HP 3970 scanner as a
> document (using [an OCR engine]) and saving as a MS Word file.
> I would like to convert the original electronic documents directly
> from PDF to editable text without the printing step. Is it possible
> for me to do this direct conversion with the HP software I have?


Bundled software is almost always broken and/or lacking useful features.
I doubt there's a way to do that.

> If not, what are other ways I could do this with other software?


This script requires ImageMagick, bash, and Xpdf. All of these should
be already installed on Real OSes, but they're also available for 'Doze.

pdftoppm -r 300 -mono file.pdf prefix
for i in prefix*.pbm ; do
j=`echo $i | sed -e 's/.pbm/.tif/' `
convert -compress Group4 -resolution 300 -units PixelsPerInch \
$i $j
rm -f $i
done

(run TIFFs through OCR engine. You may have to do that manually, since
too few commercial OCR engines are scriptable in any sane way.)

NOTE: Depending on how these PDFs are set up, you may want to use -gray
and -compress LZW instead of -mono and -compress Group4. Try both on a
short PDF and see what you get in terms of image quality and OCRed text
quality.

> Is there an internet service that can do this without me buying new
> software?


Why would you need to buy software to do this? There are so many Free
tools out there that do so many things that there's very little need to
buy software in this modern age. (Unless you're not familiar with using
your computer to its full potential. Lots of people aren't, and they
pay for it with $, time, lost data, malware, and stupid problems.)

--
I will rule you all with my iron fist. YOU! Obey the fist!
--Invader Zim
Matt G|There is no Darkness in Eternity/But only Light too dim for us to see
 
Reply With Quote
 
Barry Watzman
Guest
Posts: n/a
 
      13th Apr 2007
If you get full version Acrobat (not just the reader), the pages can be
directly exported as graphics pages (e.g. JPEG or TIFF).

Also, many OCR programs can directly accept PDF files as their input.

However, the software that comes with hardware (e.g. scanners) is
normally low-end stripped down. What you want is possible, but you will
probably have to buy some real software.


Hi Ho Silver wrote:
> I have some electronic documents in the form of un-editable pictures in PDF
> files. I need to convert them to editable text, not necessarily in PDF
> format. What I have been doing is printing these PDF files, scanning them
> on my HP 3970 scanner as a document [using the option to "Scan for editable
> text (OCR)]" and saving as a MS Word file. I am using the HP Photo &
> Imaging Software Version 2.1 that came with the scanner. What I would like
> to do is convert the original electronic documents directly from their PDF
> picture file to the editable text without the printing step. My questions:
>
> 1. Is it possible for me to do this direct conversion with the HP software
> I have?
> 2. If not, what are other ways I could do this; e.g. with other software?
> 3. Is there an internet online service that can do this without me buying
> new software?
>
> Thanks
>
>
>

 
Reply With Quote
 
MyVeryOwnSelf
Guest
Posts: n/a
 
      13th Apr 2007
> I have some electronic documents in the form of un-editable pictures
> in PDF files. I need to convert them to editable text, not
> necessarily in PDF format. ... What I would like to do is convert the
> original electronic documents directly from their PDF picture file to
> the editable text without the printing step. My questions:
>
> 1. Is it possible for me to do this direct conversion with the HP
> software I have?


Dunno.



> 2. If not, what are other ways I could do this; e.g. with other
> software?


The following worked for me in Windows XP with MS-Office 2003.

In the Acrobat viewer, print to "Microsoft Office Document Image Writer."
This is a virtual printer, like a pdf printer, but it uses a different file
format: mdi.

Mdi files open in a "Microsoft Office Document Imaging" application. There,
I used:
Tools -> Send text to Word
to activate the OCR software that's in MS-Office.



> 3. Is there an internet online service that can do this
> without me buying new software?


Dunno.
 
Reply With Quote
 
Toni Nikkanen
Guest
Posts: n/a
 
      13th Apr 2007
Dances With Crows <(E-Mail Removed)> writes:

> Huh? PDF is essentially a write-once format. Even with expensive crap
> like Acrobrat (full) and Enfocus Pitstop, trying to change text in a PDF
> is an exercise in pain and futility.


Actually that depends a lot on the PDF - there are many kinds of PDF.
Some of them are really just completely stupid large clumps of pixels,
while others actually know about chapters, sections, headings,
text.... the latter are searchable, indexable and more modifiable
than the first.

As a useful bit of information, Adobe Acrobat, as far as I know, has
the ability to OCR non-editable bitmap PDF files into searchable,
editable ones. Pretty nifty considering the original non-editable
probably came straight out of a scanner!
 
Reply With Quote
 
Barry Watzman
Guest
Posts: n/a
 
      14th Apr 2007
Wow, I didn't know that any of that stuff was even in Office. Very
interesting (although I'm not sure it's the best answer to the original
poster's question).


MyVeryOwnSelf wrote:
>> I have some electronic documents in the form of un-editable pictures
>> in PDF files. I need to convert them to editable text, not
>> necessarily in PDF format. ... What I would like to do is convert the
>> original electronic documents directly from their PDF picture file to
>> the editable text without the printing step. My questions:
>>
>> 1. Is it possible for me to do this direct conversion with the HP
>> software I have?

>
> Dunno.
>
>
>
>> 2. If not, what are other ways I could do this; e.g. with other
>> software?

>
> The following worked for me in Windows XP with MS-Office 2003.
>
> In the Acrobat viewer, print to "Microsoft Office Document Image Writer."
> This is a virtual printer, like a pdf printer, but it uses a different file
> format: mdi.
>
> Mdi files open in a "Microsoft Office Document Imaging" application. There,
> I used:
> Tools -> Send text to Word
> to activate the OCR software that's in MS-Office.
>
>
>
>> 3. Is there an internet online service that can do this
>> without me buying new software?

>
> Dunno.

 
Reply With Quote
 
Maris V. Lidaka Sr.
Guest
Posts: n/a
 
      14th Apr 2007
I don'g know about the HP software, but in OmniPage one would use
"File-Import" rather than "File-Open". Try it.

Maris

Hi Ho Silver wrote:
> I have some electronic documents in the form of un-editable pictures
> in PDF files. I need to convert them to editable text, not
> necessarily in PDF format. What I have been doing is printing these
> PDF files, scanning them on my HP 3970 scanner as a document [using
> the option to "Scan for editable text (OCR)]" and saving as a MS Word
> file. I am using the HP Photo & Imaging Software Version 2.1 that
> came with the scanner. What I would like to do is convert the
> original electronic documents directly from their PDF picture file to
> the editable text without the printing step. My questions:
> 1. Is it possible for me to do this direct conversion with the HP
> software I have?
> 2. If not, what are other ways I could do this; e.g. with other
> software? 3. Is there an internet online service that can do this without
> me
> buying new software?



 
Reply With Quote
 
Hi Ho Silver
Guest
Posts: n/a
 
      14th Apr 2007

"MyVeryOwnSelf" <(E-Mail Removed)> wrote in message
news:Xns99119495462A7RCLQUSHB976@216.196.97.136...
>> I have some electronic documents in the form of un-editable pictures
>> in PDF files. I need to convert them to editable text, not
>> necessarily in PDF format. ... What I would like to do is convert the
>> original electronic documents directly from their PDF picture file to
>> the editable text without the printing step.


>> 2. ...what are other ways I could do this; e.g. with other
>> software?

>
> The following worked for me in Windows XP with MS-Office 2003.
>
> In the Acrobat viewer, print to "Microsoft Office Document Image Writer."
> This is a virtual printer, like a pdf printer, but it uses a different
> file
> format: mdi.
>
> Mdi files open in a "Microsoft Office Document Imaging" application.
> There,
> I used:
> Tools -> Send text to Word
> to activate the OCR software that's in MS-Office.

..

Thanks very much for this post, I learned a lot; had not even known about
"Microsoft Office Document Image Writer."! But I have now installed it from
my Office XP disk. A way I have found that works for me now:

1. Use Acrobat PDF selection tool to copy to clipboard.
2. Open “Microsoft Office Document Imaging” – choose Edit\Paste Pages
3. Do the OCR and/or convert to a Word document.
4. Looks pretty good.

Thanks again!


 
Reply With Quote
 
CSM1
Guest
Posts: n/a
 
      7th Aug 2009
You have probably not heard of Omnipage software either.
http://www.nuance.com/imaging/products/omnipage.asp

Maybe one of the worlds best OCR software.

--
CSM1
http://www.carlmcmillan.com
--
"dayavincent" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
>
> Since Last I am using document management software from microdocsusa.com
> .It is a Scaleable software, from the smallest application to the
> largest enterprise. Designed to conform to our needs and business
> rules,Share documents company wide.Powerful search capabilities yet
> completely user friendly, very nice software . i have no idea about HP
> Photo & Imaging Software Version 2.1.
>
>



 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
adding an electronic signature to a document cherin Microsoft Word Document Management 1 17th Mar 2009 04:28 AM
Electronic Signature Within a Word or PDF document Brian Simmons Microsoft ASP .NET 2 22nd Oct 2007 04:57 AM
ELECTRONIC DOCUMENT =?Utf-8?B?UG9sbHk=?= Microsoft Word Document Management 1 3rd May 2007 02:56 PM
Signing an electronic document =?Utf-8?B?RERyb3dl?= Microsoft Access 4 26th May 2005 03:27 PM
OLE Electronic Document Link thinkwelldesign Microsoft Access Form Coding 0 1st May 2004 08:02 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 02:04 AM.