Scanning 100 years of newspapers. Advice?

T

Tristan Miller

Greetings.

I've volunteered to help a publisher produce a digital archive of their
newspaper. The newspaper has been printed monthly since 1904 on A4 paper,
with about 20 pages per issue. Issues until about 1965 are
black-and-white, then spot colour until around 2003. My task will be to
scan the printed copies (up to about 1995; thereafter I have access to the
original electronic files) and produce OCR'd PDFs for distribution on
CD/DVD/Internet.

I thought I'd ask for some tips or recommendations on the following
aspects:

1) What sort of scanning DPI is typically used nowadays to archive
documents? I have two high-speed professional RICOH scanners which can do
up to 600 dpi.

2) The RICOH devices have a "Text OCR" setting with dropout colour, which I
presume is best for postprocessing the image with OCR software. (The
scanner does not do OCR itself.) The resulting image is a 1-bit TIFF.
There are also settings for grayscale and colour JPEGs.

Any suggestions on what scan settings I should use for the black and white
pages, and for the spot-colour pages?

I presume that for the spot colour pages, I should scan once with the "Text
OCR" setting, for the purpose of OCR, and then once again with the
full-colour JPEG setting for presentation purposes. That is, the JPEG
images will be stitched together to form a PDF, with the OCR text captured
from the TIFF image "underneath".

For the black and white pages, would it make any sense to take a similar
approach? That is, should I make a grayscale scan of the page, or will
the 1-bit TIFF look good enough in a PDF?

3) Any recommendations for OCR software? I am working on a GNU/Linux
machine and have gocr and ocrad installed, but don't have much experience
with them. I would prefer to use free/open-source software, but can
obtain an MS-Windows machine and commercial OCR software if necessary. As
mentioned above, I will need the software to be able to make PDFs with
text "underneath" a TIFF or JPEG image. This way the user will see the
original scanned page in his PDF viewer, but will also be able to select
the text with the mouse or search for it with the Find tool.

Because of the huge volume of newspapers I have to process, my primary
criterion for the OCR software is that it should be as close to "batch
mode" as possible -- I want it to run with minimum user interaction.

Regards,
Tristan
 
E

Eino Uikkanen

Tristan Miller said:
Greetings.

I've volunteered to help a publisher produce a digital archive of their
newspaper. The newspaper has been printed monthly since 1904 on A4 paper,
with about 20 pages per issue. Issues until about 1965 are
black-and-white, then spot colour until around 2003. My task will be to
scan the printed copies (up to about 1995; thereafter I have access to the
original electronic files) and produce OCR'd PDFs for distribution on
CD/DVD/Internet.

I would ask the people behind this project for experiences and advice;

http://digi.lib.helsinki.fi/index.html?language=en

Eino Uikkanen
http://www.kolumbus.fi/eino.uikkanen/index.htm
 
D

Dances With Crows

["Followup-To:" header set to comp.periphs.scanners.]
I've volunteered to help a publisher produce a digital archive of
their newspaper.

Volunteered? This is going to be a metric arseload of work, so if I
were you, I'd ask for $.
The newspaper has been printed monthly since 1904 on A4 paper, with
about 20 pages per issue.

Monthly, small paper? That sounds more like a journal or magazine than
a newspaper. This makes things easier; fewer pages to scan, less
complicated layouts.
What sort of scanning DPI is typically used nowadays to archive
documents?

Depends on the documents. 300 DPI should be fine for this sort of work
unless you have Chinese/Japanese text or text < 8 point.
Should I make a grayscale scan of the page, or will the 1-bit TIFF
look good enough in a PDF?

300 DPI 1-bit looks Just Fine in a PDF. Hell, if the text is reasonably
clean, 150 DPI would be fine, though 300 is probably better for OCR
accuracy.
Any recommendations for OCR software? I am working on a GNU/Linux
machine and have gocr and ocrad installed

Those suck. A 2000-era commercial engine produces better results on
clean images, and insanely better results on noisy images. There are
many areas where Free software is better than payware, but OCR is not
one of them. Not yet, anyway.
I will need the software to be able to make PDFs with text
"underneath" a TIFF or JPEG image.

The company I work for tried that. It didn't exactly do what we wanted.
Then again, we were writing custom code, and there may have been a bug
in it.
original scanned page in his PDF viewer, but will also be able to
select the text with the mouse or search for it with the Find tool.

If I were you, I'd save the original images and the OCRed text in
separate files. Have them associated; "19040101-0001.tif" and
"19040101-0001.txt" or something, then a quick grep finds all
occurrences of the word the user wants by page. No need to mess with
the hideously slow Acrobrat implementation of zgrep then.
Because of the huge volume of newspapers I have to process, my primary
criterion for the OCR software is that it should be as close to "batch
mode" as possible -- I want it to run with minimum user interaction.

You're going to have to edit images by hand for maximum quality. No
automated process will be able to cope with the wide range of page
conditions you're going to see. Don't forget deskewing, either--nothing
can find and fix all skewed images, and if an ADF is involved, skew
happens. The company I work for has done this sort of work for a long
time, and we've gotten fairly good at it. Holler at my Gmail (mind the
spam trap) if you're interested in a solution that costs money.
 
T

Tristan Miller

Greetings.

Volunteered? This is going to be a metric arseload of work, so if I
were you, I'd ask for $.

They're providing me with an apartment and living expenses, so don't worry,
I'm not going to starve. :) I need to have an index of this newspaper for
my doctoral dissertation anyway, so at least part of the work is going to
be done whether or not I'm paid for it.
Monthly, small paper? That sounds more like a journal or magazine than
a newspaper. This makes things easier; fewer pages to scan, less
complicated layouts.


Depends on the documents. 300 DPI should be fine for this sort of work
unless you have Chinese/Japanese text or text < 8 point.

On Monday and Tuesday I scanned all issues from 1970 to 1997 at 1-bit 600
DPI. I ended up settling on 600 over 300 because there are lots of
halftone photographs that I want accurately reproduced. I made separate
scans of the colour pages as 600 dpi JPEGs. Since it's all spot colour, I
am now converting these images to 1- or 2-bit PNGs. This requires a bit
of work by hand, since most image software's colour reduction algorithms
aren't smart enough to correctly find the 1 or 2 spot colours plus the
paper colour. I usually have to manually specify the 2- or 3-colour
palette.
The company I work for tried that. It didn't exactly do what we wanted.
Then again, we were writing custom code, and there may have been a bug
in it.

Well, I've seen PDFs of the sort I described, so I know it's possible to do
somehow.
If I were you, I'd save the original images and the OCRed text in
separate files. Have them associated; "19040101-0001.tif" and
"19040101-0001.txt" or something, then a quick grep finds all
occurrences of the word the user wants by page. No need to mess with
the hideously slow Acrobrat implementation of zgrep then.

Unfortunately, we can't count on users being so technically savvy. The CDs
and DVDs we will distribute will include some search engine which can use
separate index files, but the PDFs available for download from the
Internet should have the OCR text embedded so that users can run search
queries through individual issues.
You're going to have to edit images by hand for maximum quality. No
automated process will be able to cope with the wide range of page
conditions you're going to see. Don't forget deskewing, either--nothing
can find and fix all skewed images, and if an ADF is involved, skew
happens. The company I work for has done this sort of work for a long
time, and we've gotten fairly good at it. Holler at my Gmail (mind the
spam trap) if you're interested in a solution that costs money.

The scanners I use perform skew correction, and from the looks of the scans
already completed, skew isn't going to be a major problem. (Then again,
I'm just eyeballing things; I haven't actually tried running any OCR
software on the images.) As I mentioned before, I am having to manually
edit the colour images as it's wasteful to have a 24 megabyte JPEG of a
page with only two ink colours; it takes me only a few minutes to reduce
these to 800K 2-bit PNGs.

I can probably convince the publisher to farm out some of the scanning
and/or OCR work to a company, provided it's not prohibitively expensive.
(They're a non-profit political organization, not a commercial publisher.)
If you want to provide a rough estimate for OCRing and making PDFs from
approximately 21000 scanned A4 pages (90 years * 12 issues * about 20
pages per issue) then feel free to send me an e-mail.

Regards,
Tristan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top