A few scanning tips

www.scantips.com


Scanning Line art

The scanner makes a good copy machine when used with a laser or inkjet printer. You can easily crop the image in the preview to remove the black borders that real copy machines often cause. You can crop out the advertisement in the adjacent column. You can remove the black spots where there were staple holes. You can enlarge or reduce the image, lighten or darken it, etc. In particular, you can remove dark backgrounds that prevent readable copies. Some very pretty copies can be made. Or, you can just lay it down and scan and print, just like a copy machine.

For copying text to the printer, or for OCR, use 300 dpi and Line art mode. Line art mode is 1-bit 2-color (B&W) like ClipArt or fax. Since there is no Gray, the printer can use its full resolution and CAN print 300 dpi. If a 600 dpi printer, then you can scan at 600 dpi. Since Line art mode is only 1/24 the memory size of color, we can handle the large image without much pain. A Line art 8.5 x 11 inch page at 300 dpi uses "only" one megabyte of memory, and 600 dpi would be 4 megabytes (however 600 dpi color would be 100 megs!). Frankly, 300 dpi scans of most text documents look fine on a 600 dpi laser printer, and more scan resolution is not often very useful (but the highest quality requirements should print 600 dpi line art).

So note that this is suddenly a new tune we're singing. For Color or Grayscale scans, you have little (if any) reason to ever approach 300 dpi. For Line art scans however, including OCR, you're probably always at 300 dpi. I've made the assumption that we are printing these images. For display on a video monitor, then 300 dpi is probably too large to be useful.

OCR has the same requirements as any ordinary copy work. We can discuss them together because either way, the goal is to obtain a scan with sharp clear dark text. I'll mention OCR first, but we are really speaking about acquiring any Line art image in general, including for printing copies of documents.

When scanning clean laser-printed documents, it doesn't much matter what you do, it'll be great. Magazine type is not nearly so clean, and newspapers are much worse, and reducing Threshold into the 80-100 range can often help greatly. However, depending on your OCR software, you may or may not be able to access your TWAIN driver's Threshold settings.

My frank opinion is that the free OCR software that comes free with most scanners is not very useful, it is too limited. You will enjoy better software if you use OCR very much.

Better OCR programs (full versions) can also capture embedded images and can maintain text columns. But any OCR program will make more errors than you really would desire.   Good programs include good error proofing tools that show you the highlighted word in the original image for comparison, which is a great help to know that the "rn" is really an "m". The biggest common problems for OCR are colored backgrounds, and smudgy print. Having control over the scanner line art Threshold setting will help both problems substantially.

Most OCR software will want to scan at 300 dpi in line art mode, and line art is faster too. This works best on better quality text documents, and the Threshold control is fundamental in obtaining good quality text scans. However, an exception, the recent OmniPage Pro versions will actually scan many pages better in its grayscale mode. Grayscale is to retain photo images of course, but it's grayscale mode seems to be able to set an automatic threshold which does particularly well on degraded text quality and shaded backgrounds. I usually use OmniPage Pro grayscale mode now by default. Do experiment.

If your OCR software does not allow adequate control of Threshold (sometimes called Brightness), for any problem cases (like black text on a dark background) you can always use conventional image programs to scan the image separately as 300 dpi Line art, and use the TWAIN driver Threshold control to make the images perfect. Then you can use the Clipboard or TIF files to transfer the image to the OCR software for processing. All of the following material is very pertinent and useful if using the TWAIN driver, and at the least, will show the importance of the Threshold control.

Generally 300 dpi will be best for scanning typical text sizes for OCR. The two programs above both want and use 300 dpi. Excessive resolution can be counterproductive for OCR unless the characters are tiny, because the OCR software is not expecting to match character bitmaps that large. A printer's Point is defined to be 1/72 inch (a 9 or 10 pt character font fits into a 12 pt line spacing, called leading, altho fonts vary in actual size). The space for 12 point leading or line spacing should be 12/72 = 1/6 = 0.167 inches tall. At 300 dpi, that's 50 pixels for the height of the line.

This 300 dpi sample from PC Magazine was printed at about 6 lines to the inch. That's not large type, but 300 dpi gives a big bitmap, it's more than enough detail to define a really good A or B, and more pixels than this are not going to help. The scanner has done its job, and it's up to the OCR software now to decode those pixels. If your scanner produces Line art images like this, and the OCR still doesn't work well, then look to your OCR software, not the scanner.

300 dpi is typically the size range that the OCR software is expecting to see, the software is designed for 300d dpi. If we doubled resolution to 600 dpi for OCR, we'd have a bitmap 100 pixels tall for 12 point type. That's huge, almost wallpaper <grin>, and the software may not recognize huge character bitmaps as easily as normal size characters. If you are scanning tiny characters, smaller than 8 points (say 10 or more lines per inch), then do use more resolution to make the characters be normal size, perhaps up to 400 dpi. Interpolated resolution is just fine for Line art, that is really its only purpose anyway. The idea is for size rather than for detail (your original document probably doesn't have that much detail anyway). But if scanning unusually large characters, then less, not more, resolution may be desirable.


Copyright © 1997-2010 by Wayne Fulton - All rights are reserved.

Previous Main Next