Using my HP C6100 OCR feature to copy pages from a book

W

W. eWatson

I have a book with about 40 pages of an old computer (FORTRAN, book from
1967) program that I would like to copy as text into a file. I have an
HP C6100 with OCR capabilities. It sometimes has trouble with = signs.
Others are (,), periods, and the four basic arithmetic operators. I
think it uses .GE. for >=. The four margins around the code are
yellowing, particularly the area around the binding. I've had some
success with about 3-4 pages, but would like to better. Are any OCR
tips for dealing with math symbols?

I did copy a page of the book to a sheet of paper, but the margins still
look grayish. Perhaps a paper copy is better, since I can probably
darken the letters.

I hope Windows has a FORTRAN compiler. If not, I can probably find one
in Linux. It would be good even there is some program that could
translate this to say Python or C. Since there are several math routines
like least squares it would be easy to replace them.
 
D

default

I have a book with about 40 pages of an old computer (FORTRAN, book from
1967) program that I would like to copy as text into a file. I have an
HP C6100 with OCR capabilities. It sometimes has trouble with = signs.
Others are (,), periods, and the four basic arithmetic operators. I
think it uses .GE. for >=. The four margins around the code are
yellowing, particularly the area around the binding. I've had some
success with about 3-4 pages, but would like to better. Are any OCR
tips for dealing with math symbols?

You may not want to restrict yourself to what came with the scanner. Once
you've scanned to a standard format, you can use any of quite a few OCR
programs, some free. I tried ABBYY, which did a good job but was over
budget. (You can get a free trial version -- 400 Meg download!)

FORTRAN doesn't use real math symbols, only a fairly small character set
(probably not even lower case, if from the 60's). It would be nice if the
OCR s/w could be told to only look for a specific character set, as it
could do a better job. I don't know if there are any such. I have been
using tesseract (free), which is making mistakes like the ones you
mentioned. I've read it can be "trained", but don't know how and don't do
enough OCR to make it worth the effort.
I did copy a page of the book to a sheet of paper, but the margins still
look grayish. Perhaps a paper copy is better, since I can probably
darken the letters.

In theory it shouldn't be necessary to make a copy before doing OCR, as
it should work on the contrast differences, but maybe inferior s/w would
benefit.
I hope Windows has a FORTRAN compiler. If not, I can probably find one
in Linux. It would be good even there is some program that could
translate this to say Python or C. Since there are several math routines
like least squares it would be easy to replace them.

gcc is available for Windows, and does FORTRAN. You may have to install
something like cygwin. But if you're happy with Linux, use it (I do).
 
D

David Harper

<snip>

I> You may not want to restrict yourself to what came with the scanner. Once
you've scanned to a standard format, you can use any of quite a few OCR
programs, some free. I tried ABBYY, which did a good job but was over
budget. (You can get a free trial version -- 400 Meg download!)

I just tried using ABBYY FinrReader 9.0 (not the latest version) on a page
of FORTRAN code from the 1974 Book "Fortran for Humans". It is 10" x 7.5"
Paperback. It did not do a very good job. Lots of errors with the
punctuation characters.

- David Harper
 
D

default

<snip>

I> You may not want to restrict yourself to what came with the scanner.
Once

I just tried using ABBYY FinrReader 9.0 (not the latest version) on a
page of FORTRAN code from the 1974 Book "Fortran for Humans". It is 10"
x 7.5" Paperback. It did not do a very good job. Lots of errors with
the punctuation characters.
That's disappointing, but makes me glad I didn't put up the money for it.
Honestly, though, it did a pretty good job in my brief testing. I had
Courier font and told it to look for Courier (or so I thought). My
recollection is the punctuation came out OK, but I didn't do a really
scientific test. It was whatever version you could download from their
website last week.

I've been using Tesseract for three days now. It seems to have a bias for
English, and I suspect many programs do. (In fact, you are able to tell
it what language your material is, but probably only natural languages.)
For example, ! in my C program tends to come out as 'I' (but not always).
Ells and ones are a problem, as are ohs and zeroes. There are many
errors, which I correct by hand (using a compiler to find errors). It's
better than typing in the whole thing by hand. The documentation says you
can give it "training" material, but I don't have enough to do to make it
worth the time to figure that out.

Good luck with the project.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top