A few scanning tips

www.scantips.com

Got HTML?

ScanSoft TextBridge Pro OCR to HTML

TextBridge Pro can scan documents and create HTML pages, ready to go. If you use a Page Type with images, and if using the TextBridge user interface, then there are two scanner passes. Mode "Magazine, color" uses 300 dpi for the text and 100 dpi for the color images, and the XIF file can store all of this (next section).

You can configure TextBridge Pro to use your scanner's native TWAIN driver so you can make any settings you prefer, including descreen for color or threshold for Line art. However, then only one scanner pass is made, and the color image is also used for OCR (which comes out quite well).

Instead of scanning and posting pages from real magazines (of which I am not the copyright owner), I used one of the Pagis Pro sample image files here.   I get the same good results scanning real magazines in TextBridge Pro in "Magazine, Color" page type mode (although moire patterns can be the usual problem). The HP 6200/6250 scanners do automatic descreen and is ideal for such document modes, a good solution for the moire problem.

This file named sample5.xif below is now stored in a 3 page XIF file. I merely used "drag and drop" to drop it on the TextBridge icon on the Pagis SendTo bar (immediately below it).

NOTE: You can add or remove any of your programs (that support Drag&Drop) to the Pagis Pro SendTo bar with the Registration Wizard at menu PAGIS.

NOTE: These screen shot images are now 32-color GIF files for clarity and small size on this web page, approx 35K each. Your own copy of Pagis Pro will have better colors.

Above is Pagis Pro 3.0.   Below is TextBridge Pro, which appeared.

I set Page Type "Magazine, Color", and nothing else. Below, in the upper right corner, the tabs Image and Text allow an immediate switch between two screens: Image is the original scan next below, and Text is the OCR results in the view following that.

I did not click "Find Zones", it was automatic this time, and shown below are the default zones. The yellow zones will be recognized as text characters, and the cyan zones will be retained as images, and tables are marked with a sort of gold-brown color. If you want it recognized another way, you can change it, and that is sometimes necessary. I did not here, was OK.

Below is the TextBridge Help on the zoning tools.


Zoning tools

Use the Zoning tools to view, zoom, rotate, and zone text, tables and pictures on a page before recognizing it. The Zoning tools appear after TextBridge gets a page in Manual mode.

 

Click on
 
To
 
  Find text, tables, and picture zones automatically.
Select a zone to modify or delete.
Mark text zones on the page.
Mark table zones on the page.
Mark picture zones on the page.
Erase zone marking.
Undo the last zoning action.
Put zones in the order you want in your final document, if you are not retaining page layout.
Rotate the page 90 degrees to the right.
Zoom in to enlarge the page for viewing.
Zoom out to reduce the page for viewing.
  See also

Zoning a page
Using zoning tools
Marking a zone
Deleting a zone
Ordering zones for output
Zoning all your pages the same
Zoning odd shaped areas on your page

 

Zoning is a big factor in OCR success, but I didn't do anything here. But it is certainly not like I had nothing to do with this effort, I did click the RECOGNIZE button (2) to cause the recognized text to appear (below).

Note the proofing tool below. There were no errors in this OCR text, but TextBridge did not recognize the word "front/18-inch" from its language dictionary. So the confidence level on this word was low, so we were asked for our opinion (shown). The line of text immediately above the car photo is an enlargement of the actual scanned image, where a possible error is suspected, or rather, where TextBridge's confidence was not the greatest (you can set different confidence levels). This actual image allows humans to easily recognize what the text ought to be shown by the pixels (like for example, rn for m is a common real error). These "suspect" words are not necessarily incorrect, most are not, but you can review them and correct them easily if necessary. You can click that Book+ button to add the unknown word to your dictionary. The highlighted word just above the image line is the recognized word as detected. The arrow at the right of that field shows many similar alternative words from the language dictionary, or you can type the correct word. You can change it or accept it as is. Clicking Accept goes to the next suspect word.

TextBridge also shows the suspect word location highlighted in yellow in the text image (below the car, not visible here). And if you put the cursor on any word, the enlargement of the original scan appears there too, as shown for the word "class". Seeing the original scan in this way can give you a strong clue about if you need to decrease or increase the "brightness" setting to get better OCR results.

I clicked RECOGNIZE and TextBridge Pro spun through the 3 pages (just a second or so on each) doing the OCR to create text from the image.

Then I clicked SAVE AS, and got the Save As box below. Retain Pictures and Retain Page Layout are the default, and I left it that way. Open File When Done means it starts your browser so you can view the HTML results.


Let's see, I think I have done 4 actions so far.

  1. Drag and dropped the image on TextBridge Pro on the Pagis Pro SendTo bar.
  2. Set Magazine, Color mode.
  3. Clicked the Recognize menu button.
  4. Clicked the Save As menu button, and selected HTML.

Not so difficult, my browser just came up, I am seeing it now.

The generated HTML is rather sophisticated, with StyleSheets and in the case of WYSIWYG, JavaScript too. It makes subdirectory for images and the subsequent pages after the first in the document, which you can transfer directly with ws_ftp. In some cases, the JPG files have upper case characters in the name and html, so you must observe case. I have to think some degree of HTML and web server experience will be necessary in the real world, but all I did was the above actions, and Viola!

Now, I quote from the TextBridge Pro Help file:

TextBridge lets you save your recognized pages in either HTML or HTML WYSIWYG formats. The difference is that if you save to standard HTML, you can easily edit the document later. If you save to HTML WYSIWYG, a nearly perfect replica is produced but editing it can be more difficult.

HTML format (with Retain Page Layout off)

Output is saved in galley style. Text and table zones are output one after another in zone order. Pictures are output after the text and tables for each page. Page breaks, margins, columns, centering, and indentation are not retained.

This format works with most browsers.

HTML format (with Retain Page Layout on)

Output retains the approximate layout of the original page, including multi-column text. Pictures are output after the text and tables for each page. Columns and centering are retained. Page breaks, margins, and indentation are not retained. This format works with most browsers.


WYSIWYG

Retains the exact layout of the page, including multi-column text and the placement of pictures. Page breaks, margins, columns, centering, and indentation are retained. This format is intended to reproduce the original page for viewing. Editing the output is difficult. This format is designed for Windows Internet Explorer 4.0, Internet Explorer 5.0 or later, and Netscape Navigator 5.0 or later browsers. Other browsers may not preserve the original page layout. Output in this format retains the original page layout, even if you have not checked the Retain Layout option.


HTML

HTML is an output format of TextBridge. HTML format allows you to edit the document, but it may not produce a perfect reproduction of the original page. If you want an absolute reproduction of the original page, use WYSIWYG HTML. This produces an almost exact replica, but you cannot edit the file.



Previous Main Next