OCR text vs page images

Southeast Asia Visions materials have been encoded in a simple SGML form (a 40 element DTD conforming to the TEI Guidelines). This data includes the document text from the OCR (Optical Character Recognition) process. Many users have asked if they can have access to the plain, uncorrected OCR text. We believe that in most cases people will still want to look at the page images of the books, but have decided to make the text available to users so they can save it, cut and paste, and to use the "find" feature on their Web browsers to locate a word on a page. We think that this will be of benefit to our users. If you want to view the plain text, there are a couple of ways to accomplish this:

Page by page viewing:

Go to the desired page and choose "view as text" from the view as menu in the toolbar at the top. As you move forward or back in the work, you will continue paging through plain text until you choose another "view as" option (such as image or pdf).

Enter books:

You may choose to view an entire book in plain text by selecting the "view as text" option. The file can be saved by selecting the "save" option in the browser"s File menu. By default, the file will be saved as HTML, which can be viewed with a web browser (text will not be broken up by line or page -- it is one large block of text). You can also change the file extension to .txt to save as text for viewing with a text editor or word processor (this preserves line and page breaks).

Please be aware that some of these texts are as long as 1,000 pages and will take a long time to download, particularly over a modem. Such a large download may also crash your Web browser.

Viewing and Navigating a text

When you begin to view a book, you will also see a separate navigation frame at the top of your browser that looks like this (without the number labels). This is what the various parts mean:

Previous page:
Click on this icon. It goes to the previous page of the text.

Page #:
Indicates the number of the page you are viewing and the total number of pages in the text.

Next page:
Click on this icon. It goes to the next page of the text.

View as:
sets the size of the image you are viewing. If you have a smaller monitor you might want to choose a low percentage. The percentages are in a pull down menu. The size you choose will stay in effect until you change it or end your session. Other options on this menu include:

  • Text allows you to view the raw OCR text or (if available) the proofed and encoded text.

Go to page #:
Jumps to a desired page that you enter in the box. Especially handy for moving from a table of contents to a section of a book. "Go to page #" is a button and must be clicked on to jump to the desired page.

Go to:
jumps to special purpose pages such as title pages, tables of contents, and lists of illustrations. The special pages are listed in a pull down menu. Not all texts will contain the same choices.

Printing a text

Go to File > Print in your browser's menu.

  • Texts in Southeast Asia Visions can only be printed page by page
  • If you print directly from your browser, texts will print at the size of the image you are viewing (100%. 75%, etc)
  • You will need to calculate maximum clarity against fitting a page on a standard piece of paper when you decide what size image to print. 25% may be unreadable. 100% may not fit on a standard printer's paper.

Note on diacritics: A Unicode version is planned for summer 2006. Until then, some entries may not indicate proper diacritics. For example, arrivée will appear as arrivee. We are aware that this seems like a misspelling and hope to fix this with the Unicode version.

