Distributed Scanners (non-profit project to benefit the public domain)

J

Jon Noring

[This is a proposed project. If you are interested in being involved
as a founder/leader, let me know. Whether or not I put in the effort
to get this launched depends on whether I can assemble a core group of
movers/leaders.]


Google's recent announcement to scan millions of books in the next
couple decades (in association with a few major libraries) has raised
both applause and concern. Applause that this is long overdue -- let's
get our printed heritage online. Concern that the scanned page images
from the public domain books will not be completely and freely
accessible.

Fortunately, the Internet Archive is planning a parallel project (also
in association with a few major libraries) to likewise scan a large
number of books, and is currently shaking down their system with a
test project in Canada -- this test project appears to be going
smoothly. The IA *will* make the public domain books (and those under
copyright where they secure the necessary rights) freely and
completely available to the world. Bravo!

However, both projects use a centralized approach (with paid/trained
staff to run the process), and both will likely use very expensive
($100,000) robotic page-turning scanners so as not to damage the book
bindings, and to get high throughput.

Thus, I'm considering an alterative approach (which is intended to run
in parallel with, and to augment, the Internet Archive project) where
a volunteer network will be setup (and volunteers include both
institutions and individuals) to scan donated books using relatively
inexpensive and high throughput sheet feed scanners. I call this
project "Distributed Scanners", inspired by "Distributed
Proofreaders", an innovative online interface to assist volunteers in
proofing older texts, and which this project, if launched, will
associate with.

We will rely on the fact that there are a lot of books out there whose
bindings are pretty much gone -- the books are falling apart -- so,
except for rare "collectors" books, there's no issue in removing the
broken bindings (maybe by chopping) and running the pages through
high-speed sheet feeding scanners. After each book is scanned, the
pages are then appropriately packaged/boxed, indexed and archived
(they will NOT be thrown away).

The books can be given as a charitable donation to the project (in the
U.S. through a 501(c)3 organization -- IA has been asked to be the
umbrella for this project, and to archive the original books, but I
have not yet heard back from them), thus private donors might be able
to claim a tax deduction. Used book stores, private book collectors,
public libraries, etc., etc., probably all have, in aggregate, a lot
of such books and will be happy to donate them to the cause (some
people may go through the used books at the Salvation Army, Goodwill,
DI, etc., and for a dime find some great books to donate to the
cause.)

In parallel with Distributed Scanners, we could set up a "Distributed
Catalogers", to setup a network of volunteer librarians, and provide
an online interface, to allow the volunteers to enter the cataloging
metadata associated with each scanned book. They could also take the
lead in copyright clearance (further discussed in the note at the end
of this message.)

Of course, the final page scans will be donated to the Internet
Archive, plus to any other online archive out there willing to host
the scans (including Google.)

Note, too, the Distributed Scanners, over time, could diversify into
scanning other types of older documents, such as newspapers,
historical records, etc. The sky's the limit as to how we might
mobilize volunteers to digitize our printed heritage.

*****

So, with that introduction, I'm making three requests:

1) I'm looking for individuals reading this who would like to be
involved, especially as co-founders and co-movers/leaders of this
project, who see the potential and want to be involved. If you are
interested, contact me by email.

2) I'm looking for someone in the Salt Lake City area of Utah, who
owns (or has free access to) a high-quality sheet feed scanner that
we can begin experimenting with, to understand the various
real-world issues associated with this project.

3) Of course, looking for a few older books (pre-1963) that can be
donated for demo'ing this project. (Don't send any now, just let me
know if you have such books which are literally falling apart.)


And of course, your thoughts, ideas and criticisms are welcome.

Jon Noring



[Note on copyright aspects: We will definitely need to go through a
process to determine the copyright status of each scanned book. We can
draw upon the expertise of Project Gutenberg to assist with this (they
are experts at it.) Nearly all books published in the U.S. before 1923
are public domain -- and interestingly, a good number of the books
published between 1923 and 1963 are also public domain since copyright
renewal was required (some say up to 90% of all the books published
between 1923 and 1963 are public domain -- Distributed Proofreaders
has been working on the Copyright Office renewal files to assist with
renewal searching.) Public Domain books will become freely available;
the books whose status is either copyrighted or indeterminate will go
into a separate, non-public part of the archive, and efforts can be
made in the future to find and approach the copyright holders for
permission to allow the scans of those books to be made freely
available under a Creative Commons license.)]
 
O

Olin Sibert

Jon said:
Thus, I'm considering an alterative approach (which is intended to run
in parallel with, and to augment, the Internet Archive project) where
a volunteer network will be setup (and volunteers include both
institutions and individuals) to scan donated books using relatively
inexpensive and high throughput sheet feed scanners. I call this
project "Distributed Scanners", inspired by "Distributed
Proofreaders", an innovative online interface to assist volunteers in
proofing older texts, and which this project, if launched, will
associate with.

I have lately been doing a project with similar requirements.
It's processing computer manuals and listings, so it's not quite
the same, but I've learned a lot about scanners and workflow in
the process.

I decided to use several low-end duplex auto-feed scanners
rather than high-throughput devices, and the results have borne
out that decision: for a scanning task handled by at most two or
three people (often just one), the physical handling of
documents chews up enough time that the actual scanning speed
isn't very important. Another advantage is that when things do
go wrong, it's slower and less damaging.

I've been pretty happy with the Fujitsu ScanSnap 5110fx ($400,
6-7 sheets/minute) and the Documate 252 ($800, 12-14
sheets/minute). There's now a Documate 262 that appears
significantly faster ($1100). Scan quality seems to be fine;
pages are sometimes misaligned by a few degrees, but that can be
corrected after the fact (although I don't have the software to
do that yet).

One needs to think carefully about the workflow: how to remove
the bindings, how to index the material, how it will be filed,
packed, and shipped for storage, etc. I think it's critical to
separate the image collection (as TIFFs) from the OCR
process--verifying TIFFs is relatively easy, but OCR is a whole
separate task. I've found, however, that it's very important to
spot-check OCR quality for any document that's not absolutely
pristine, in case it needs to be scanned gray-scale rather than
monochrome.

Lots of things can go wrong: misfeeds, page damage, software
hangs, etc. I find that a significant fraction of time is
occupied just by rebooting machines. I stopped trying to
diagnose or recover from mysterious problems, because it wasted
time and rarely worked--now, as soon as something hiccups, I
just reboot the machine and start over. It really helps to have
a dedicated computer per scanner, so nothing else is lost to
those failures. One of the reasons I like the ScanSnap so much,
even though it's slow, is that it's unflappable: I've never had
to reboot its computer. For relatively large documents, a
single person can easily keep two scanners busy, and three is
possible. It helps to have a lot of work space to stack and
organize things.

My archivist pals at the Computer History Museum tell me that
600 dpi is the recommended standard for this sort of work: 300
or 400 works OK for crisp machine-printed originals, but
anything that's a bit funky, or anything with pictures, 600 is
important. With storage so cheap (and speed not being a big
issue), 600 has been fine.

I've found it helpful to have a flatbed scanner (very low-end is
fine here) available for things like covers, photographs,
drawings (which may warrant 1200 dpi scans and color), and pages
that are too damaged to feed successfully. Of course, it's then
a nuisance to re-integrate the images--I've explored a variety
of software for doing that (ranging from tiffcp to high-cost
TIFF tools to Acrobat), none of which has proven ideal. For
now, I'm keeping a lot of images separately and planning to find
better software later.

I started out with grand designs of collecting metadata as I
went, but I've switched to taking care of that after the fact
(i.e., from the scanned images rather than from the paper
copies) because it made the workflow too complicated.

With new material (crisp black ink on clean white paper),
black-and-white scans work well and are quite manageable in size
at 600 dpi. Older material is problematic: stains, dirt,
speckles, etc., will all result in unattractive page images that
are also hard to OCR. I've done some scanning in grayscale, but
at 35 megabytes per page, they're very clumsy to manage, and
they don't compress well. Grayscale scanning is only about half
as fast, too. I suspect there's some form of automated
processing that could reduce them to more manageable sizes, but
that's not something I'm worrying about for now--fortunately,
most of my material has been sufficiently clean that OCR works
OK with monochrome scans.

I have some material that's perfect-bound and will need to be
cut apart. I expect to do that with a table saw and a clamping
jig, but haven't actually built one yet. I'm sure that will be
another adventure (fortunately, I have plenty of duplicates!).
Professionals do this with a bandsaw, I believe.

Good luck! Feel free to e-mail if you have questions (making
the obvious adjustment to the address :)
 
D

dickey45

I am about to get rid of a bunch of my old books and have used a
bandsaw to cut the bindings. The problem with it is all the dust it
creates and I suspect some scanners (it seemed the fujitsu 4110
snapscan was prone to this) get gummed up and have lots of misfeeds.
My buddy is thinking of making a cutter for books. You can also go to
kinkos and they might cut the bindings.

I am trying to decide between:
ScanPartner fi-4120C2
Canon DR-2080C 7862A002
Xerox DocuMate 252
Visioneer Strobe Xp450
HP ScanJet 8250

because they are all sheet fed duplex scanners under $1000 street.

Any suggestions or comments on the scanners or procedures for doing
this type of scanning would be greatly appreciated.

With the research I have done so far:

ScanPartner fi-4120C2 - tends to jam if you are not doing regular paper
(like books). On the plus side, it is super fast, has a great driver,
super support, and is pretty well designed. I borrowed a snapscan and
it really had feed problems but after thinking about it I'm not certain
it was the paper but maybe that you have to absolutely define the page
size in order to get it to feed correctly.

Canon DR-2080C 7862A002 - also can have tendencies to jam. I have a
canon ink/fax/scanner with ADF and it has been pretty rock solid for
me, especially books (unless it is the really old, thick paper). It
isn't as fast but it is duplex.

Xerox DocuMate 252 - uses a visioneer driver and those have problems
with other visioneer devices, maybe with other scanners? I heard they
are good and solid, however.

Visioneer Strobe Xp450 - not such great reviews, and not that many
reviews...

HP ScanJet 8250 - after 3 $400+ HP scanners I refuse to own another.
Their drivers are crap and it sounds like this model still has the same
issues - plus it doesn't work with Acrobat 6 unless you install 5 to
cohabitate with 6. That said, the HP scanners all have one thing in
common, their feed reliability is excellent and they put up with abuse
much better (staples, bad paper, etc). This one won't scan your covers
or business cards (I think) like the sheet fed.

Any
 
R

renethx

I am converting my personal library books (1000+) into electronic
books. I would like to share some of my experience. I hope this may be
helpful.
I am about to get rid of a bunch of my old books and have used a
bandsaw to cut the bindings. The problem with it is all the dust it
creates and I suspect some scanners (it seemed the fujitsu 4110
snapscan was prone to this) get gummed up and have lots of misfeeds.
My buddy is thinking of making a cutter for books. You can also go to
kinkos and they might cut the bindings.

The tools I use to cut the bindings are a utility knife and a
heavy-duty rotary paper trimmer. The rotary trimmer I have (Carl
DC-210) can trim 30 sheets of paper (or even more) quite easily.
(Never use a Guillotine-style trimmer!!) First cut the whole book
pages from the front/back covers and the spine in case of a hardcover
book, or just cut off the front/back covers in case of a softcover
book, using the utility knife. At this point, all the pages are firmly
glued to a piece of cloth in case of a hardcover book, or to the spine
itself in case of a softcover book. Cut off pages from the cloth/spine
with every 30 sheets (= 60 pages) in a group by the utility knife. The
30 sheets in each group are tightly glued together. Insert each group
of 30 sheets into the rotary trimmer and cut off the glued part, about
5 mm from the edge. Now all the pages are clean separated and ready
for scan. The amount of dust created in this process is very small.
I am trying to decide between:
ScanPartner fi-4120C2
Canon DR-2080C 7862A002
Xerox DocuMate 252
Visioneer Strobe Xp450
HP ScanJet 8250

because they are all sheet fed duplex scanners under $1000 street.

Any suggestions or comments on the scanners or procedures for doing
this type of scanning would be greatly appreciated.

For archival purpose, the optimal resolution would be 600dpi in Black
and White mode, and 400dpi in Grayscale/Color mode. Moreover DeScreen
is necessary for color/grayscale documents. If DeScreen is not
available from the scanner, the choices is either use the well-known
method of manual DeScreen (scan at a high resolution, 600dpi in this
case, blur, downsample to 400dpi, then sharpen) or use a commercial
DeScreen program. So I may have to scan color/grayscale books at
600dpi for manual DeScreen. Thus I looked for an affordable (less than
$1000) sheet-fed duplex scanner, reasonably fast at 400/600dpi in all
three modes. This was a very difficult task because manufactures
usually provide only scanning speed at lower resolutions. So I bought
a bunch of scanners, tested them, and returned those do not meet my
criteria. Eventually I tested all of the above models except HP
ScanJet 8250; instead I tested HP ScanJet 5590.

My conclusion is that the best scanner for all purposes is XEROX
DocuMate 262. DocuMate 252 is also good. DM 262 and DM 252 are almost
identical except for their performance. DM 262 is much faster than DM
252 in Grayscale/Color mode, but costs more. The TWAIN user interface
of these models is a crap. Use OneTouch Buttons instead. Before
installing OneTouch Buttons, you MUST remove the driver of any other
VISIONEER scanner; otherwise the two drivers would be tangled up and
it would become very difficult to rectify. There is no caution about
this point in User's Guide and there appears no warning in the process
of installation! DeScreen does not work at 300dpi or higher in both
models. Bright thin vertical lines sometimes appear randomly in color
images (like atomic spectrum) in both models, but are noticeable only
when background color is present and only when enlarged. About 2mm of
the top portion of the document is truncated and instead a blank white
space is added to the bottom of the image. This is annoying for color
documents with background color. Well, I have to compromise on these
matters because fi-4120C2 is nearly a lemon in Color mode so that
there is no other choice!

There is little reason to choose fi-4120C2 over DM 252/262. fi-4120C2
is good in Black and White mode, but DM252/262 shows much better
performance. fi-4120C2 is faster than DM 262 in Color mode, but the
image quality is terribly bad and is unsuitable for archiving.
Moreover it costs more. As for Color/Grayscale modes, each image has
very dark part (as dark as 215 even for perfect white background) and
very bright part (255) that cannot be corrected by Levels Adjustment
or whatever tool I use. The scanner is designed to read thick cards
too, which causes small fluctuations of the distance of the sheet of
thin paper from the light source in the path and thus uneven
brightness occurs in the image. FUJITSU admitted that this is a
structural flaw of the scanner. DeScreen works but is mediocre. Manual
DeScreen gives much better results. ADF is very reliable: I had not a
tiny problem in scanning 6000+ sheets. TWAIN interface is the best
among the scanners I tested. The footprint of fi-4120C2 is much
smaller than DM 252/262. Despite of these superior points, I had to
give up fi-4120C2 to scan color/grayscale documents.

The scanners to be avoided under any circumstance are CANON DR-2080C
(too slow at higher resolutions; no output tray; a very old model),
VISIONEER Strobe XP 450 (supports only simplex mode; too slow at
higher resolutions) and HP ScanJet 5590 (terrible ADF; extremely large
and heavy body; too slow at all resolutions; hard-to-use TWAIN user
interface).

Just for reference, the scanning speed (letter-size documents, seconds
per image) at the resolution 200dpi, 300dpi, 400dpi, 600dpi
respectively I measured for each model is the following (in the
descending order of performance), where `na' means either I omitted
measuring the speed for some reason, or the resolution is not
available in the scanner.

XEROX DocuMate 262: 1, 1, 3, 3 (Black&White), 1, 3, 4, 9 (Grayscale),
3, 6, 12, 25 (Color)

FUJITSU fi-4120C2: 3, 4, 9, 13 (Black&White); 3, 4, 9, 14 (Grayscale);
3, 4, 9, 14 (Color) (almost identical speed in all three modes!)

XEROX DocuMate 252: 1, 1, 3, 3 (Black&White), 3, 7, 12, 29
(Grayscale), 4, 8, 14, 31 (Color)

CANON DR-2080C: na, 7, 10, 18 (Black&White); 4, 7, 10, na (Grayscale);
10, 18, 27, na (Color)

VISIONEER Strobe XP 450: na, 5, na, 30 (Black and White); 4, 7, na, na
(Grayscale); 15, 32, na, na (Color)

HP ScanJet 5590: na, 15, 46, 46 (Black and White); 8, 16, 50, na
(Grayscale); 12 23, 63, na (Color),


Finally the programs I found useful are:

ThumbsPlus Pro (Cerious Software; excellent image management program;
supports various tools in batch mode)

IrfanView (all-purpose image management program; supports various
tools in batch mode; free)

Adobe PhotoShop or PhotoShop Element 3.0 (for batch processing of
levels adjustment and color correction)

DeScreenIt (JetSoft; for DeScreen in batch mode; this program works as
fine as manual DeScreen in many cases)

I used to use ScanSoft PaperPort (also included in DM 252/262) but
quit using it because of its proprietary format (.max). Scanned images
are archived in single-page TIFF Group 4 (Black and White) and
single-page TIFF LZW or PNG (Color/Grayscale) in at least two hard
disks. To make these documents into daily use, they need to be
converted into searchable PDF (text under image). For this purpose I
use

Abbyy FineReader (the most accurate OCR program, in particular for
color documents)

Needless to say, Adobe Acrobat 6.0 or 7.0 is an indispensable tool to
handle PDF documents. I usually scan the front cover of each book in a
flatbed scanner, convert it to PDF and attach it to the PDF file of
the book. By doing so, the colorful book cover is displayed in the
thumbnail view of ThumbsPlus/"My Bookshelf" of Acrobat 6.0/"Organizer"
of Acrobat 7.0. Creating database of all the books/journals/references
is a good idea to manage the personal library. I am using

BibDB (BibTeX database management program; free. BibTeX is a part of
the TeX typsetting system)

Creating the index of PDF files is also a good idea. There are many
free desktop search engines for this purpose. My choice is "namazu".
 
R

renethx

I am converting my personal library books (1000+) into electronic books. I would
like to share some of my experience. I hope this may be helpful.
I am about to get rid of a bunch of my old books and have used a
bandsaw to cut the bindings. The problem with it is all the dust it
creates and I suspect some scanners (it seemed the fujitsu 4110
snapscan was prone to this) get gummed up and have lots of misfeeds.
My buddy is thinking of making a cutter for books. You can also go to
kinkos and they might cut the bindings.

The tools I use to cut the bindings are a utility knife and a heavy-duty rotary
paper trimmer. The rotary trimmer I have (Carl DC-210) can trim 30 sheets of
paper (or even more) quite easily. (Never use a Guillotine-style trimmer!!)
First cut the whole book pages from the front/back covers and the spine in case
of a hardcover book, or just cut off the front/back covers in case of a
softcover book, using the utility knife. At this point, all the pages are firmly
glued to a piece of cloth in case of a hardcover book, or to the spine itself in
case of a softcover book. Cut off pages from the cloth/spine with every 30
sheets (= 60 pages) in a group by the utility knife. The 30 sheets in each group
are tightly glued together. Insert each group of 30 sheets into the rotary
trimmer and cut off the glued part, about 5 mm from the edge. Now all the pages
are clean separated and ready for scan. The amount of dust created in this
process is very small.
I am trying to decide between:
ScanPartner fi-4120C2
Canon DR-2080C 7862A002
Xerox DocuMate 252
Visioneer Strobe Xp450
HP ScanJet 8250

For archival purpose, the optimal resolution would be 600dpi in Black and White
mode, and 400dpi in Grayscale/Color mode. Moreover DeScreen is necessary for
color/grayscale documents. If DeScreen is not available from the scanner, the
choices is either use the well-known method of manual DeScreen (scan at a high
resolution, 600dpi in this case, blur, downsample to 400dpi, then sharpen) or
use a commercial DeScreen program. So I may have to scan color/grayscale books
at 600dpi for manual DeScreen. Thus I looked for an affordable (less than $1000)
sheet-fed duplex scanner, reasonably fast at 400/600dpi in all three modes. This
was a very difficult task because manufactures usually provide only scanning
speed at lower resolutions. So I bought a bunch of scanners, tested them, and
returned those do not meet my criteria. Eventually I tested all of the above
models except HP ScanJet 8250; instead I tested HP ScanJet 5590.

My conclusion is that the best scanner for all purposes is XEROX DocuMate 262.
DocuMate 252 is also good. DM 262 and DM 252 are almost identical except for
their performance. DM 262 is much faster than DM 252 in Grayscale/Color mode,
but costs more. The TWAIN user interface of these models is a crap. Use OneTouch
Buttons instead. Before installing OneTouch Buttons, you MUST remove the driver
of any other VISIONEER scanner; otherwise the two drivers would be tangled up
and it would become very difficult to rectify. There is no caution about this
point in User's Guide and there appears no warning in the process of
installation! DeScreen does not work at 300dpi or higher in both models. Bright
thin vertical lines sometimes appear randomly in color images (like atomic
spectrum) in both models, but are noticeable only when background color is
present and only when enlarged. About 2mm of the top portion of the document is
truncated and instead a blank white space is added to the bottom of the image.
This is annoying for color documents with background color. Well, I have to
compromise on these matters because fi-4120C2 is nearly a lemon in Color mode so
that there is no other choice!

There is little reason to choose fi-4120C2 over DM 252/262. fi-4120C2 is good in
Black and White mode, but DM252/262 shows much better performance. fi-4120C2 is
faster than DM 262 in Color mode, but the image quality is terribly bad and is
unsuitable for archiving. Moreover it costs more. As for Color/Grayscale modes,
each image has very dark part (as dark as 215 even for perfect white background)
and very bright part (255) that cannot be corrected by Levels Adjustment or
whatever tool I use. The scanner is designed to read thick cards too, which
causes small fluctuations of the distance of the sheet of thin paper from the
light source in the path and thus uneven brightness occurs in the image. FUJITSU
admitted that this is a structural flaw of the scanner. DeScreen works but is
mediocre. Manual DeScreen gives much better results. ADF is very reliable: I had
not a tiny problem in scanning 6000+ sheets. TWAIN interface is the best among
the scanners I tested. The footprint of fi-4120C2 is much smaller than DM
252/262. Despite of these superior points, I gave up fi-4120C2 to scan
color/grayscale documents.

The scanners to be avoided under any circumstance are CANON DR-2080C (too slow
at higher resolutions; no output tray; a very old model), VISIONEER Strobe XP
450 (supports only simplex mode; too slow at higher resolutions) and HP ScanJet
5590 (terrible ADF; extremely large and heavy body; too slow at all resolutions;
hard-to-use TWAIN user interface).

Just for reference, the scanning speed (letter-size documents, seconds per
image) at the resolution 200dpi, 300dpi, 400dpi, 600dpi respectively I measured
for each model is the following (in the descending order of performance), where
`na' means either I omitted measuring the speed for some reason, or the
resolution is not available in the scanner.

XEROX DocuMate 262: 1, 1, 3, 3 (Black&White), 1, 3, 4, 9 (Grayscale), 3, 6, 12,
25 (Color)

FUJITSU fi-4120C2: 3, 4, 9, 13 (Black&White); 3, 4, 9, 14 (Grayscale); 3, 4, 9,
14 (Color) (almost identical speed in all three modes!)

XEROX DocuMate 252: 1, 1, 3, 3 (Black&White), 3, 7, 12, 29 (Grayscale), 4, 8,
14, 31 (Color)

CANON DR-2080C: na, 7, 10, 18 (Black&White); 4, 7, 10, na (Grayscale); 10, 18,
27, na (Color)

VISIONEER Strobe XP 450: na, 5, na, 30 (Black and White); 4, 7, na, na
(Grayscale); 15, 32, na, na (Color)

HP ScanJet 5590: na, 15, 46, 46 (Black and White); 8, 16, 50, na (Grayscale); 12
23, 63, na (Color),

Finally the programs I found useful are:

ThumbsPlus Pro (Cerious Software; excellent image management program; supports
various tools in batch mode)

IrfanView (all-purpose image management program; supports various tools in batch
mode; free)

Adobe PhotoShop or PhotoShop Element 3.0 (for batch processing of levels
adjustment and color correction)

DeScreenIt (JetSoft; for DeScreen in batch mode; this program works as fine as
manual DeScreen in many cases)

I used to use ScanSoft PaperPort (also included in DM 252/262) but quit using it
because of its proprietary format (.max). Scanned images are archived in
single-page TIFF Group 4 (Black and White) and single-page TIFF LZW or PNG
(Color/Grayscale) in at least two hard disks. To make these documents into daily
use, they need to be converted into searchable PDF (text under image). For this
purpose I use

Abbyy FineReader (the most accurate OCR program, in particular for color
documents)

Needless to say, Adobe Acrobat 6.0 or 7.0 is an indispensable tool to handle PDF
documents. I usually scan the front cover of each book in a flatbed scanner,
convert it to PDF and attach it to the PDF file of the book. By doing so, the
colorful book cover is displayed in the thumbnail view of ThumbsPlus/"My
Bookshelf" of Acrobat 6.0/"Organizer" of Acrobat 7.0. Creating database of all
the books/journals/references is a good idea to manage the personal library. I
am using

BibDB (BibTeX database management program; free. BibTeX is a part of the TeX
typsetting system)

Creating the index of PDF files is also a good idea. There are many free desktop
search engines for this purpose. My choice is "namazu".
 
R

renethx

I am converting my personal library books (1000+) into electronic books. I would
like to share some of my experience. I hope this may be helpful.
I am about to get rid of a bunch of my old books and have used a
bandsaw to cut the bindings. The problem with it is all the dust it
creates and I suspect some scanners (it seemed the fujitsu 4110
snapscan was prone to this) get gummed up and have lots of misfeeds.
My buddy is thinking of making a cutter for books. You can also go to
kinkos and they might cut the bindings.

The tools I use to cut the bindings are a utility knife and a heavy-duty rotary
paper trimmer. The rotary trimmer I have (Carl DC-210) can trim 30 sheets of
paper (or even more) quite easily. (Never use a Guillotine-style trimmer!!)
First cut the whole book pages from the front/back covers and the spine in case
of a hardcover book, or just cut off the front/back covers in case of a
softcover book, using the utility knife. At this point, all the pages are firmly
glued to a piece of cloth in case of a hardcover book, or to the spine itself in
case of a softcover book. Cut off pages from the cloth/spine with every 30
sheets (= 60 pages) in a group by the utility knife. The 30 sheets in each group
are tightly glued together. Insert each group of 30 sheets into the rotary
trimmer and cut off the glued part, about 5 mm from the edge. Now all the pages
are clean separated and ready for scan. The amount of dust created in this
process is very small.
I am trying to decide between:
ScanPartner fi-4120C2
Canon DR-2080C 7862A002
Xerox DocuMate 252
Visioneer Strobe Xp450
HP ScanJet 8250

For archival purpose, the optimal resolution would be 600dpi in Black and White
mode, and 400dpi in Grayscale/Color mode. Moreover DeScreen is necessary for
color/grayscale documents. If DeScreen is not available from the scanner, the
choices is either use the well-known method of manual DeScreen (scan at a high
resolution, 600dpi in this case, blur, downsample to 400dpi, then sharpen) or
use a commercial DeScreen program. So I may have to scan color/grayscale books
at 600dpi for manual DeScreen. Thus I looked for an affordable (less than $1000)
sheet-fed duplex scanner, reasonably fast at 400/600dpi in all three modes. This
was a very difficult task because manufactures usually provide only scanning
speed at lower resolutions. So I bought a bunch of scanners, tested them, and
returned those do not meet my criteria. Eventually I tested all of the above
models except HP ScanJet 8250; instead I tested HP ScanJet 5590.

My conclusion is that the best scanner for all purposes is XEROX DocuMate 262.
DocuMate 252 is also good. DM 262 and DM 252 are almost identical except for
their performance. DM 262 is much faster than DM 252 in Grayscale/Color mode,
but costs more. The TWAIN user interface of these models is a crap. Use OneTouch
Buttons instead. Before installing OneTouch Buttons, you MUST remove the driver
of any other VISIONEER scanner; otherwise the two drivers would be tangled up
and it would become very difficult to rectify. There is no caution about this
point in User's Guide and there appears no warning in the process of
installation! DeScreen does not work at 300dpi or higher in both models. Bright
thin vertical lines sometimes appear randomly in color images (like atomic
spectrum) in both models, but are noticeable only when background color is
present and only when enlarged. About 2mm of the top portion of the document is
truncated and instead a blank white space is added to the bottom of the image.
This is annoying for color documents with background color. Well, I have to
compromise on these matters because fi-4120C2 is nearly a lemon in Color mode so
that there is no other choice!

There is little reason to choose fi-4120C2 over DM 252/262. fi-4120C2 is good in
Black and White mode, but DM252/262 shows much better performance. fi-4120C2 is
faster than DM 262 in Color mode, but the image quality is terribly bad and is
unsuitable for archiving. Moreover it costs more. As for Color/Grayscale modes,
each image has very dark part (as dark as 215 even for perfect white background)
and very bright part (255) that cannot be corrected by Levels Adjustment or
whatever tool I use. The scanner is designed to read thick cards too, which
causes small fluctuations of the distance of the sheet of thin paper from the
light source in the path and thus uneven brightness occurs in the image. FUJITSU
admitted that this is a structural flaw of the scanner. DeScreen works but is
mediocre. Manual DeScreen gives much better results. ADF is very reliable: I had
not a tiny problem in scanning 6000+ sheets. TWAIN interface is the best among
the scanners I tested. The footprint of fi-4120C2 is much smaller than DM
252/262. Despite of these superior points, I gave up fi-4120C2 to scan
color/grayscale documents.

The scanners to be avoided under any circumstance are CANON DR-2080C (too slow
at higher resolutions; no output tray; a very old model), VISIONEER Strobe XP
450 (supports only simplex mode; too slow at higher resolutions) and HP ScanJet
5590 (terrible ADF; extremely large and heavy body; too slow at all resolutions;
hard-to-use TWAIN user interface).

Just for reference, the scanning speed (letter-size documents, seconds per
image) at the resolution 200dpi, 300dpi, 400dpi, 600dpi respectively I measured
for each model is the following (in the descending order of performance), where
`na' means either I omitted measuring the speed for some reason, or the
resolution is not available in the scanner.

XEROX DocuMate 262: 1, 1, 3, 3 (Black&White), 1, 3, 4, 9 (Grayscale), 3, 6, 12,
25 (Color)

FUJITSU fi-4120C2: 3, 4, 9, 13 (Black&White); 3, 4, 9, 14 (Grayscale); 3, 4, 9,
14 (Color) (almost identical speed in all three modes!)

XEROX DocuMate 252: 1, 1, 3, 3 (Black&White), 3, 7, 12, 29 (Grayscale), 4, 8,
14, 31 (Color)

CANON DR-2080C: na, 7, 10, 18 (Black&White); 4, 7, 10, na (Grayscale); 10, 18,
27, na (Color)

VISIONEER Strobe XP 450: na, 5, na, 30 (Black and White); 4, 7, na, na
(Grayscale); 15, 32, na, na (Color)

HP ScanJet 5590: na, 15, 46, 46 (Black and White); 8, 16, 50, na (Grayscale); 12
23, 63, na (Color),

Finally the programs I found useful are:

ThumbsPlus Pro (Cerious Software; excellent image management program; supports
various tools in batch mode)

IrfanView (all-purpose image management program; supports various tools in batch
mode; free)

Adobe PhotoShop or PhotoShop Element 3.0 (for batch processing of levels
adjustment and color correction)

DeScreenIt (JetSoft; for DeScreen in batch mode; this program works as fine as
manual DeScreen in many cases)

I used to use ScanSoft PaperPort (also included in DM 252/262) but quit using it
because of its proprietary format (.max). Scanned images are archived in
single-page TIFF Group 4 (Black and White) and single-page TIFF LZW or PNG
(Color/Grayscale) in at least two hard disks. To make these documents into daily
use, they need to be converted into searchable PDF (text under image). For this
purpose I use

Abbyy FineReader (the most accurate OCR program, in particular for color
documents)

Needless to say, Adobe Acrobat 6.0 or 7.0 is an indispensable tool to handle PDF
documents. I usually scan the front cover of each book in a flatbed scanner,
convert it to PDF and attach it to the PDF file of the book. By doing so, the
colorful book cover is displayed in the thumbnail view of ThumbsPlus/"My
Bookshelf" of Acrobat 6.0/"Organizer" of Acrobat 7.0. Creating database of all
the books/journals/references is a good idea to manage the personal library. I
am using

BibDB (BibTeX database management program; free. BibTeX is a part of the TeX
typsetting system)

Creating the index of PDF files is also a good idea. There are many free desktop
search engines for this purpose. My choice is "namazu".
 
W

Winfried Truemper

renethx said:
The scanners to be avoided under any circumstance are [..] HP ScanJet 5590
(terrible ADF; extremely large and heavy body; too slow at all resolutions;
hard-to-use TWAIN user interface).

I can only second that. We bought a HP ScanJet 5590 in March and are
extremely disappointed. From the product quality as well as from HP support.

The ADF does not work for A5 size paper in duplex mode, despite their
specification claims it does. Unfortunately many of my documents are around
A5 size. Called HP support three times and provided them with samples they
requested. The only response I'm getting since two months is "we need more
time to research this". I think I got the message.

The ADF is not reliable enough for unattended operation. Paper jam every
second scan, with no option to resume from inside HP Director.
Sometimes the scanner even clips the start of a page silently.
So we have to manually check each and every document for completeness.

Quite funny is the user interface of the TWAIN driver.
You have to press Cancel to save your document.


Regards
-Winfried
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top