Distributed Proofreaders 787 books posted to PG from DP!   Visit DP-INT 
  DP » Post-Processing FAQ
ID: Password:  ·  Register ·  Help 
 

Post-Processing FAQ

(Version 3.7; last updated July 22, 2004)

Getting Started

What is Post-Processing?
Tools
Selecting a Book
How Long Do Etexts Take?
Walkthrough

Software

The Basics

Gutcheck
Text Editors
Image Viewers
Spell Checkers
HTML Validators

Dedicated Proofreading Tools

Guiguts The Gut* Foursome Other Tools

Advanced Proofreading Questions

Special Formatting Issues

Footnotes
Illustrations
Poetry
Tables
Sidenotes
Headings and Subheadings
Indexes
Errata Pages

Symbols and Scripts

Non-ASCII Characters
Non-Latin Scripts and Unusual Symbols

Extra Checks

Paranoid Proofreading Checks (Stealth Scannos, etc.)

Technical Questions

Non-ASCII Formats
Missing or Problem Images
Returning a Project
Projects with Multiple Parts
Post-Processing Verification Other Questions, and Suggestions for the FAQ


This FAQ is designed as a guide for post-processors, especially new ones. I hope that it makes the task of post-processing a little less frightening!



Getting Started


This section is designed especially for the first-time post-processor. It provides a walkthrough of all of the essential steps needed to proofread a relatively simple book. For more difficult books, please see the Advanced Proofreading Questions section for advice.


What is post-processing, and who can post-process?

The purpose of post-processing is to massage an etext into a readable form. In its journey through two proofreading rounds, the text will have been edited by perhaps hundreds of proofreaders. The post-processor must standardize the formatting of the book and adjust it to comply with Project Gutenberg's requirements. The post-processor must also correct as many errors as possible that have survived both proofreading rounds. The ultimate goal of post-processing is to create a plain-text etext with consistent formatting throughout, which contains as few errors as possible, and which accurately reflects the intentions of the author.

Post-processors require more experience than ordinary proofreaders. Because they are preparing the text for uploading to Project Gutenberg, they are the final editors of the text. Because of this, post-processing is only available for proofreaders who have completed at least 400 pages in the first and/or second proofreading rounds. Also, post-processors should be very familiar with the Proofreading Guidelines before attempting to post-process.


What software will I need to get started?

The requirements for software are minimal:

There are other useful programs available which are not essential, but which can be extremely useful and might save you a lot of time. Take a look and see if any of them would be of use to you.


All right. I have everything I need to get started except a book. Which one should I choose?

For your first project, it's a good idea to pick a fiction book with a relatively small number of pages. Here's why:

  • fiction usually has fewer words per page than non-fiction and a more streamlined format, which means that it scans clearer and is unlikely to be riddled with OCR errors and inconsistent formatting;
  • fiction generally lacks footnotes, tables, and other items which could be difficult for a new post-processor to deal with; and
  • a low page count makes the work go faster and is easier to handle.

Take a look at the Post-Processing section of your Personal Page to see all of the works available for post-processing. Projects which are marked "TRAINEE" have been set aside for new post-processors, and are ideal texts with which to learn the ropes of post-processing. Books labelled "BEGINNERS" or "MENTORS" were originally set aside for new proofreaders. These books are also relatively easy, and would make good first projects. Unlike "TRAINEE" texts, however, these books may be checked out by both new and experienced post-processors. Alternatively, pick a text with a title that is likely to be fiction. "The Boy Scout Camera Club" is a good bet; "Copyright Renewals" is not. These texts are often marked "EASY" and are available to all post-processors.

Download the text you have chosen by going to its Book Options scrollbar and selecting "Download Zipped Text". Do not select "Download Zipped TEI Text" (a different encoding) unless you know what TEI is and want to play around with it. The plain text is the version that you need.

Scroll through the whole text to see if there are any difficulties, like footnotes, poetry, foreign languages, dialects, and tables in it. This way, you will know what you will be dealing with before you commit to the project. If you see any of these items, you would be wise to choose a different project. But, if you think you can handle it, give it a try!

Check the Project Questions and Comments Forum for your book's title to see what proofreaders have been saying about it. Again, this might alert you of issues which might make the work more difficult than you had realized.

When you have found a work that you want to do, check it out by going to its Book Options dropdown listbox and selecting Check Out Book. The book is now yours! (Make absolutely sure that the book appears as checked out to you, or you might end up working for hours on your text only to find that someone else has checked it out and submitted it!)


How long will it take me to post-process my book?

It is impossible to answer this question in advance. The time that a book will take to complete depends essentially on three factors:

  • the difficulty and length of the work itself;
  • the tools being used; and
  • the amount of experience the post-processor has.

Some proofreaders can finish an easy book in only an hour or two (or even less, for the especially speedy). However, most proofreaders require longer than this to do a good job. Some especially difficult works can take weeks (or more!) to complete.

Try not to feel discouraged if you take more than a day to complete an "easy" book. Concentrate on learning the process of post-processing, familiarizing yourself with any tools you might be using, and doing a quality job, rather than on working quickly. You will speed up naturally with practice.


Great! Now what am I supposed to do?

There is no one way to post-process books. Every post-processor has a different technique and uses different tools to do the job. The technique described below is the one the author personally uses, but there are many ways of achieving the same end result. In time, you will probably develop your own technique.

This walkthrough describes a very hands-on way of post-processing, which is why it is recommended for first-time post-processors. It takes longer to complete the book this way than if macros, global find-and-replace, and other tools are used, but it gives the first-timer a feel for what they should look out for by making the proofreader scroll through the book several times. Also, the walk-through below is suitable for use on all operating systems, regardless of what text editor is used or what tools are available for that operating system. Once you get the hang of what post-processing is all about, please experiment with the software available below, especially the dedicated post-processing tools. You do not have to use and of them, but any post-processors find them indispensable!

It is strongly advised that you read through the following walkthrough before starting to post-process. This will give you an idea of what you will be doing in advance. You might also like to look for helpful software first if you are so inclined.


The Walkthrough

  1. Choose a book and sign it out. For guidance, see Selecting a Book, above. Make sure that the book appears in the Post-Processing list as checked out to you. Otherwise, someone else might check it out later and duplicate all of your hard work!
  2. Look for comments and questions proofreaders had while proofreading your book in the Project Questions and Comments Forum.
  3. Download the book's images (optional) and text. Make sure to select "Download Zipped Text" and not "Download Zipped TEI Text" unless you know how to work with TEI text encoding. You may wish to make a local backup of the text file, for ease of reference later in the process.
  4. If you are using a Mac, change the character set to Mac. In BBEdit Lite, use MIDex to do this by clicking on the ISO->Mac button. You will need to convert the book back to ISO when it is complete.
  5. If the document is not displayed in a monospaced font (that is, a font where each letter, space, and punctuation mark take up exactly the same amount of space), make this adjustment now. "Monaco" or any font with "Mono" in its name will be monospaced. Please make sure that the project is always saved as text (.txt). This is a Project Gutenberg requirement, and is easily done in all word processing and text editing programs.
  6. Format the first page(s) to your liking. These pages contain the title of the book, the author's name, and occasionally information on a translator, etc. Retain the publication date and publisher name, but delete words or phrases such as "copyright," "all rights reserved," etc. If this section is in ALL CAPS, you can change it to Capitalized Text if you wish.
  7. Scan the entire book for problems and formatting issues. You will almost always run across things which obviously just aren't right, which can be corrected at this time. There are many other things that you can look for during this first pass. For instance, you may notice deviations from the proofreading guidelines, which should be corrected. Here are a few examples of the sorts of things to watch for:
    • poetry, tables, and other areas which must not be re-wrapped (i.e. where line breaks are important). These should be surrounded by /* */, but be wary, as some might not be!;
    • punctuation in odd places (usually from specks on the scanned image) and dubious spellings;
    • lines of asterisks: if you find one, it may be worth looking through the images to make sure that none were missed. Also, make sure that they are correctly written (7 spaces, then 5 asterisks separated by 7 spaces);
    • extra line breaks within paragraphs that clearly shouldn't be there, or lack of space between paragraphs;
    • bad spacing: double spaces, indented paragraphs, wordsthatruntogether;
    • words in ALL CAPS at the beginning of paragraphs
    • words that are italicized everywhere, like ships' names (watch for non-italicized occurrences);
    • words containing an accent (watch for the loss of the accent);
    • hyphenation: em-dashes should be marked --, and hyphens -, with no spaces on either side. If it's clearly wrong, fix it;
    • missing pages: check the image numbers to make sure that there are none missing, and read the last few words on each page and the first few words on the next page to make sure that they go together; and
    • pages which look like they haven't been proofread at all; you'll have to look carefully for errors in these areas.

    Some pages will begin with a full sentence. Since you won't know if these sentences are the beginning of a new paragraph or a continuation of the previous one, either mark these with a * and check them later, or check them against the book's images as you come across them and format accordingly.

    If you wish, you may remove the page markers on the way through. Most people remove them at this stage, though some leave them in to make checking the text against the images easier. If you choose to remove them, make an unedited copy of the file to give you a reference with the page numbers still included. You can always unzip or download a fresh file if something happens to your backup. Guiguts will remove the page markers while still remembering the page's identity.

  8. Run a spell check. Mark any words that are even remotely questionable with a *, or check each questionable word against the book's images as you go and correct when necessary. Proofreaders often miss some misspellings, such as ltttle for little. Books of PG vintage, especially books of poetry, may also contain archaic and unusual spellings and words. DO NOT correct these, but only obvious typos or OCR errors. If you do not have a spell checker, you can pick one up here.
  9. Search for -. If you run across something bizarre, mark it with a * or check it against the images. The purpose of this step is essentially to make sure that there are no em-dashes in the middle of words, no single dashes that should be em-dashes, and that all end-of-line hyphenation is fixed.

    Some proofreaders like to replace all hyphens with unbreaking hyphens in order to minimize the work in browsing for hyphenated words split between lines at the end. In Microsoft Word, the unbreaking hyphen character can be inserted by doing a global find-and-replace of - into ^~. If you choose to do this, you must change all unbreaking hyphens back to regular hyphens at the end, as they are non-ASCII and do not display properly on all machines.

    There are many other "paranoid checks" that proofreaders perform at this stage. These checks are particularly useful when the book scanned badly or the font was unusual, since these conditions introduce a lot of errors that are difficult for proofreaders to spot. If this is your first book, you don't need to worry about these "bonus" checks, but once you learn the basics of proofreading, you will want to be sure that you do a thorough job and catch as many mistakes as you can, and these extra checks will help you produce high quality etexts.

    Guiguts makes performing paranoid checks a lot easier, as it incorporates Gutcheck into its interface. GutAxe makes these checks as well.

  10. Search for *. In addition to the asterisks that you may have inserted yourself to mark potential problems, proofreaders may have used them to mark problems that they had to bring them to your attention. If you have left in the page markers, use them to guide you to the image containing the text that you are checking. Resolve all the problems and remove the asterisks before uploading for PPV. If you can't resolve the problem on your own, contact the project manager or ask your question in the post-processing forum.
  11. Format chapter headings and any subheadings that you may find in the text. It is recommended that there be three blank lines between chapters, two after the chapter title, and four between larger sections (eg. Part 1 and 2), but the exact numbers are not very important. As long as you are consistent within your work, use whatever spacing seems appropriate and looks good to you.

    Scan the book and make sure that all of the chapter headings and subheadings are there! Proofreaders sometimes remove these accidentally. If there is a table of contents, use it as a checklist; otherwise, refer to the images. If the chapters are simply numbered Chapter 1, 2, etc., just make sure that there are no numbers missing.

  12. Remove page markers, either manually or with a tool or search. Remove [Blank Page] tags as well.
  13. Search for space-hyphen and hyphen-space and replace each instance with hyphen. The exception here is hyphens replacing a word, like a person's name; in that case, leave spaces before and after the hyphens to indicate this. There may be other exceptions like this as well, but you'll be able to identify them from their context.
  14. Search for double space and replace with single space. DO NOT do a global find and replace if your text contained poetry, charts, or lines of asterisks, all of which legitimately contain multiple spaces. GutSweeper will do this for you automatically.
  15. Remove end-of-line spaces by searching for spaces followed by returns and replacing them with returns only. Again, don't do a global replace. RewrapIndent has a feature which will remove end-of-line spaces automatically.
  16. Find and replace all incidences of <i> and </i> with _. Make sure that the same number of <i> were replaced as </i>. If there were more or less, some of the tags may not have been correctly typed, and you'll have to track them down. Search for <b> and </b> and replace them with *, +, or any other character which is not used in your text (suggestions from other post-processors are ~, %, #, =, and converting words in bold to ALL CAPS). Try searching for >, <, and / to find both of these more quickly. You may wish to preface your text with a transcriber's note explaining which symbol you used for italics and which for bold, but this is not absolutely necessary.
  17. Time to rewrap. Did you see any poetry, tables, etc.? If not, rewrapping the lines should be easy. You will need to rewrap the lines to between 65 and 75 characters in length. Each program has a different way of doing this, and you will have to find the way that works best for you. In BBEdit Lite, select Hard Wrap from the Text menu. For MS Word, save as Text with Line Breaks. Many tools will rewrap for you, including Guiguts, GutHammer, and RewrapIndent. If worst comes to worst and you cannot find an easy way to rewrap the lines, find and replace all line breaks with spaces, count any line to find to see approximately where 65-75 characters falls, and insert lines breaks manually at this point. It's painful, but it works. Be grateful that you chose a book with a low page count! However, this extreme step should not be necessary.

    If you have areas that you must not wrap, you must be more careful. In BBEdit, it is possible to rewrap a section by highlighting it and selecting Hard Wrap. This allows you to rewrap the text in blocks between tables or poems. Other programs may have a similar feature. RewrapIndent, Guiguts, and Guthammer feature selective rewrapping, though the user needs to learn a small amount of special markup.

    Make sure to remove the /* */ markup around poetry and tables after the rewrap is complete!

    Rewrapping sometimes reveals spacing errors. Repeat steps 13 and 15 to catch any new problems that may have been introduced by the rewrap.

  18. Run Gutcheck. Check every potential problem that it brings to your attention. Not all Gutcheck "flags" are genuine errors (for example, it may report short lines where the text contains poetry or a table), but each must be looked into and corrected if necessary. Continue to run Gutcheck after each series of corrections until it doesn't flag any more "true" errors.
  19. Give the whole thing a quick eyeball to make sure that all is well. If you are not sure what the finished text should look like, download a text from Project Gutenberg and skim it to get a clearer idea. Some proofreaders believe that this is best done the next day, when you have a fresher eye and might be more likely to spot oversights. If you switched character sets (eg. used MIDex on a Mac), switch back to ISO encoding now.
  20. Zip the finished product and upload it to the site. Make sure that the text version is saved as plain text (i.e., not as Word format, WordPerfect format, etc). The file extension should be .txt. If you are submitting an HTML version or another version in addition to the plain text version, please include this in the same zip file as the plain text version. Go to your Post-Processing page, and select Upload for Verification from the project's drop-down menu. Enter your name and email address in the submit form if you would like your real name to be listed in the credits of the final etext. If you do not wish for your name to appear, put in a note that you would like to remain anonymous. Your email address will not be displayed in the credits line, but is used by the PPVer to give you feedback. If you do not include an email address, this feedback will be sent via a personal message on the site.

    NOTE: Please ensure that your files .zip extension is in lower case, NOT in upper case. Some people have not been able to upload their files when the .zip extension was written in upper case.

  21. Relax. You're done!

Software


The Basics


Gutcheck


What is Gutcheck, and how do I use it?

Gutcheck is a nifty piece of software created by Jim Tinsley for people working on Project Gutenberg etexts. It checks for errors which are common, but not easy to spot, like mismatched quotes, short lines, etc.

Gutcheck is currently being produced for Windows and *nix systems, and can be found here. A quick-and-dirty Mac build can be downloaded here. You could also simply ask another post-processor to run Gutcheck for you if you have trouble with it, and they can send you a list of results.

Many people have trouble setting up Gutcheck for the first time. If you too have trouble, don't worry. Someone in the Post-Processing Forum will be only too happy to help you. Just post and ask for help!

If you are using Guiguts, you will not need Gutcheck, as it is a part of Guiguts.


Text Editors

Any text editor can be used for post-processing, but some have tools which make them more suitable than others for the job. BBEdit Lite is an excellent choice for Mac users. It is no longer being supported, but remains on the site for download. Be sure to download the MIDex plug-in as well. Other useful plug-ins for BBEdit are available here. Many *nix users use emacs, which can be downloaded here, though it is probably on your machine already. It is also available for Windows. If you use Microsoft Word for any platform, you will be able to run a useful macro.


Image Viewers

You will need to be able to look at your book's scanned images at some point in the post-processing process. You can either download them and use a third-party program to view them, or view them online through a browser. Any program that will display images will do.

Some people have recommended utilities which allow you to see thumbnails of images without opening them, making it easier to find the one you're looking for. GraphicConverter is one such program for Macintosh machines. Irfanview is a quality Windows image manipulation program. Xnview runs on Windows, *nix, and a host of smaller operating systems.

Spell Checkers

Many text editors do not provide a spell checking feature. Instead, spell check programs are used to provide this essential function. A separate spell-checking program is also useful when post-processing texts which are not in English. Many post-processors do not have access to non-English spell checkers, and buying add-ons for programs like Microsoft Windows can be very expensive. Independent spell checkers provide dictionaries for a wide range of common (French, Spanish, Dutch) and not-so-common (Catalan, Estonian, or English Biomedical word list) for free.

Excalibur is a Macintosh spellchecker. It has separate downloadable dictionaries for many different languages. The dictionaries provided are very complete. It works with LaTex. The major drawback of this program is that its dictionaries are difficult to edit. Though adding words is easy, removing words (such as common Stealth Scannos) is not. Excalibur can be downloaded here.

Aspell is a spell checker which is available both as a Windows executable and as a perl script. It provides a good selection of different language dictionaries. It is available here.

Ispell has been ported to Windows, OS/2, and also runs on *nix and MacOSX. It does not run on DOS or older Mac OS machines. Like the other two spell checkers, there are a good selection of language dictionaries available for downloading. Ispell can be found here.

HTML Validators


HTML Tidy is an excellent HTML validator which runs on a myriad of systems. Files to be validated can be uploaded to this website [w3.org] or this one and validated online.

CSS Validators


CSS can be validated at this website [w3.org]. Note that if you choose to upload a file for verification, only the section between the <style> tags should be in the file since anything other than CSS will confuse it (and you) terribly. It is probably easiest to cut and paste the css into the section of the form called "Validate by direct input."

Dedicated Proofreading Tools


There are presently two pieces of software which have been specifically written by post-processors for post-processors. Both provide an all-in-one kit, so you can use them for all of your post-processing needs, or just take advantage of some of their extra features. Each program has their supporters and detractors, so give them both a try!


Guiguts

Guiguts was written by thundergnat. The tool is almost a complete post-processing kit in itself. It began as a graphical interface for gutcheck, but has evolved to become much more. Among its special features are the ability to automatically remove page headers while keeping track of each page's identity, analysis of word frequencies (especially useful for catching misspelled proper names and other odd spellings), automated checks for markup errors, a footnote moving function, easy checking of "Stealth Scannos", and much more. You won't even need a text editor or Gutcheck, because they're built in! Guiguts is very well-documented.

Guiguts is available from here. Guiguts does not come with a spell checker, but integrates well with Aspell.


The Gut* Foursome

BillFlis has created a set of four tools which run on the Windows platform only.

GutSweeper scans the text and makes a lot of automatic corrections, such as eliminating double spaces and end-of-line blanks, fixing hyphenation errors, and splitting oe ligatures. It is markup sensitive, so that it will not ruin the formatting of poetry, block quotes, and tables. This saves some time for the proofreader, as all of its changes are ones that they would have to be made anyway.

GutAxe is an interactive tool, which allows the post-processor to make more complicated changes to the text. It works a bit like a spell checker, highlighting a "problem" area and suggesting possible solutions. It scans for common "Stealth Scannos" and punctuation errors, among other things. It also removes page markers.

GutWrench supplements Gutcheck with a lot of extra checks.

GutHammer rewraps the text in a similar way to Big_Bill's RewrapIndent tool (see Macros, below). It also replaces HTML markup with ASCII symbols.

The changes made by these programs are saved as new files, with no alteration being made to the original, so they are totally undoable if something goes wrong. The tools can be downloaded here. They have excellent documentation, and include a suggested post-processing walkthrough written especially for the Gut* tools.

Other Tools

Please note that macros may need an additional piece of software in order to run. For example, a lisp or perl script will need a lisp or perl interpreter to work, such as ActivePerl, available here, and Clisp, available here, both for free. If you need help getting them to work, or your platform isn't supported by these, ask in the Post-Processing Forum for help; almost certainly someone will be able to assist you!

There is a macro available for Microsoft Word, which rewraps text while preserving paragraph breaks. It may also replace double spaces with single ones and adjust the margins to less than 75 characters (I am less sure about these functions). Please note that this macro will not work in Word 97 or older versions.

Naomi Parkhurst has written an Applescript macro, which rewraps the text (including around poetry) and removes superfluous spaces. It works for BBEdit (but not BBEdit Lite).

Garweyne has written a lisp script to aid post-processors wishing to rearrange footnotes. A more elaborate description of the script's abilites is available at the link above. Many more useful lisp scripts are available here.

Bill Keir (big_bill) has written a lisp script called RewrapIndent which will both rewrap and indent text automatically. It honours the /* */ tags that proofreaders use to mark poetry, so poetry will not be rewrapped but will be automatically indented, and allows the addition of other tags to handle block quotes automatically if needed. It handles complicated cases like poetry nested inside block quotes, and multiple levels of block quotes within block quotes, but can also be used to quickly and tidily rewrap any simple book that needs no special indenting, to whatever line length you choose, too. For more details see the HTML documentation inside the zip file.




Advanced Proofreading Questions


This section contains information on the more complex aspects of post-processing. It is designed for advanced post-processors, rather than beginners.

The formatting issues treated below are also discussed in the Proofreading Guidelines, should you need information on proper markup.


Footnotes


Should I leave footnotes inline, or move them elsewhere?

This depends a great deal on the length of the footnotes, their frequency, and on the type of text that you are proofreading.

It is suggested that you should leave the footnotes where they occur in the text if they are short and relatively uncommon. In these cases, the footnotes won't disrupt the flow of the text very much.

If the footnotes in your text are long, numerous, or are primarily bibliographic references, you might be wise to move them to the end of the paragraph where they occur, on their own line, and just leave a marker (ex. [2]) in the paragraph.

If the footnotes are so common that even this would be highly disruptive to the reader, consider collecting them all and moving them to the end of the chapter. This is a bit of a pain, but it makes such works infinitely more readable. Mark the endnote in the text as you would an ordinary footnote, fix all of the numbering, and list the endnotes at the end of the chapter with the new numbering.

If you should you decide to move footnotes, Garweyne has a lisp tool which makes this MUCH easier.

Footnotes in poetry are a special case worth mentioning. Most post-processors agree that the footnotes should not be simply inserted into the text where they occur, as this interferes with the rhythm of the poetry. Mark the place referenced by the footnote, then move the footnote to its own line or to the end of the poem, whichever seems most appropriate.

Whatever format you choose, make sure that it is consistent throughout the text.


Illustrations


Should I leave in all of those [Illustration] tags? What about the ones with captions?

[Illustration] tags with no captions should not be removed. This is so that if someone decides to make an HTML version in the future, the tags will be there and it will be easier to correctly place the images. If you are making an HTML, XML, or similar version yourself, replace the tags with links to the images.

[Illustration] tags with captions should be left in place for the reader to enjoy.


My text will make no sense if the actual illustrations aren't included.

If the text was scanned with the intention of just converting it to ASCII, email the Project Manager and ask for advice. If you are willing and able and the scans are available, they may ask you to do an HTML version and provide good quality scans for the missing images. Alternately, they may decide that the project doesn't really need the images and explain their opinion.

If the text really should be produced in a version that will allow future readers to view the images, and you are unable or unwilling to do the work to put it in such a version, return it and post your concerns in the Post-Processing Forum.


Poetry


My book contains a few verses of poetry. How should I format them?

The proofreaders should have surrounded any verses of poetry with the markers /* and */. Remove the markers, and check against the original image to make sure that the formatting is correct. A quick scan should reveal if the spacing in your text and the original match.

Make sure that the indentation is consistent, at least within each poem. It's entirely possible that one proofreader indented some lines four spaces, while the proofreader who got the next page indented five, when the image shows the same amount of indentation.

Indent all of the poetry by 2 spaces (or more if you prefer), preserving any further indenting that the author intended.

DO NOT REWRAP LINES. You will have to take special care when you rewrap the text not to rewrap your poetry. However, if a line is broken in two due to its length, but it was not intended to be (the second line of these is usually highly indented and not capitalized), join the two parts of the line together. If they still don't fit on one line, break them and indent the second half heavily.


My book is poetry. How do I format it?

Aside from the fact that poems won't be marked by poetry markers, the book should be formatted as for occasional verses (check the indentation, etc.). Also, the poem(s) need not be indented two spaces by default, as they are the main text, and don't need to stand out from it.

Books which are entirely comprised of poetry do not need to have their lines rewrapped (watch the Gutcheck output for overly long lines, though). However, the book may contain an introduction or other prose section which will need rewrapping.

bconstan has graciously offered to aid anyone who needs extra help post-processing poetry books. If you have any questions, send her a message.


Tables


What do I do with tables?

Tables should have been marked with /* */ by the proofreaders, but a quick scan of the book should turn any up even if they are unmarked, as they are quite conspicuous.

You will have to move the text in the table around to make it as readable as possible. If the headings are broken between lines, put them on the same line if you can. Adjust the spacing of the columns so that they look good on the screen and aren't too close together. Make all of the column entries line up. If you are lucky, this formatting was already done for you, but not all proofreaders can format tables accurately (if their display font isn't monospace, for example). Watch for tables that span multiple pages, as they will be unlikely to have similar formatting. "Related" tables should be formatted consistently, if possible.

DO NOT REWRAP LINES! You don't want to destroy all of your hard work, now do you?


The table will not fit into the lines widths allowed by PG.

You have a couple of options here:

  1. Try your best. You may have to split the chart into multiple rows. Or you may come up with your own way to format the information in the troublesome chart. Be inventive.
  2. Go over the PG limit. PG will accept books with a few lines longer than their standard if there is a good reason for them to be extra-long. However, try very hard to make it fit in the PG limit before bending the rules.
  3. Give up. Mark the chart up as an [Illustration], use the title as the caption, and write it off. It's not ideal, but sometimes it's the only way.

Sidenotes


These [Sidenote] tags seem redundant!

In most cases, sidenotes add a bit of summary or description to the text, but, in very rare cases, the sidenotes add nothing to the book and will be an annoyance rather than a help. If your book fits this mold, consider leaving out the sidenotes. BUT, think long and hard about this, as this is altering the text of the original, a DP no-no. Email the Project Manager and/or post in the Forums for a second opinion before taking this step.


Headings and Subheadings


How do I differentiate subheadings from headings?

Usually, the easiest way to differentiate between subheadings and headings is to change the line spacing (ex. leave three lines blank when a new heading begins, and only one for a new subheading). However, some texts may have more than one layer of subheading. In these cases, you will have to devise a markup which is appropriate to the text. You could indent subheadings a certain number of spaces depending on their "layer", for example.


Indexes


Is there anything special that I should know about formatting indexes?

Pay attention to the presence/absence of trailing commas and semicolons. You may either leave these in or remove them, as long as you are consistent.

DO NOT REWRAP LINES. Unless you have placed a blank line between each and every entry, rewrapping will destroy the format of the index. Be careful!

If you are creating an HTML version, why not make a hyperlinked index?


Errata Pages


My book has an errata page at the end. Should I correct the errors?

Yes. The list of errata at the end of the book reflect the author's intention, and one of the guiding principles of etext production is to preserve the author's original intent. First- and second-round proofreaders had access to only one page of the book at a time, so none of the errata errors will have been corrected by them. This job therefore falls to the post-processor, who has access to all of the pages of the book. Find and correct all of the errata, and delete the errata page from the book.


Non-ASCII Characters


My text has accents, pound signs, or other non-ASCII characters in it. Should I preserve them in the final version?

In general, yes. Keep all of the accented words (or symbols) as they are. An ISO-8859-1 (also called Latin-1 or 8-bit ASCII) file can be made which preserves them. If the text is in a language containing many accents that are not found in ASCII or ISO-8859-1, there are other forms of encoding out there in which they can be preserved.


Non-Latin Scripts and Unusual Symbols


How do I handle footnotes, etc. in Greek, Russian, or other texts with a non-Latin alphabet?

If possible, the text should be transcribed into the Latin alphabet. It's not a lossless process, but it's the only way to preserve these snippets in ASCII. Information of the transcription process is available in the Proofreading Guidelines.

Many languages, like Arabic and Hebrew, are difficult to transcribe without an intimate knowledge of the language. If your text contains snippets of such a language and you don't have the knowledge to transcribe it yourself, try posting in the Forums to find someone to team up with for transcription.

If there is a significant amount of text in a non-Latin script, it may be worth making a Unicode (HTML) version, which would allow the original script to be preserved.

If you cannot transcribe the language, and you can't find anyone else who is capable of doing it for you, mark its presence with [Arabic] (for Arabic), and delete any OCR garbage that may have been left in the text. It's too bad that the information will be lost, but you've done the best you could.


My text has weird symbols in it (Zodiac signs, medical abbreviations, etc.). How do I mark these up?

If you are lucky, the proofreader will have done some research and found the meaning of the symbol for you. However, often the proofreader will mark the symbol with a * and leave you with the legwork.

If know what the symbol represents, write out its meaning, e.g. [Symbol: Jupiter]. Do not try to replicate the symbol itself in ASCII.

If you have never run across the symbol before, here are a few web pages provided by proofreaders to help you:
Apothecaries'/Medical
Alchemical
Astronomical
Latin Abbreviations
Graphic Search for Symbols

This section has room to grow. If you find any other good links, please PM me.

Note that some symbols may have more than one meaning. If this is the case, try to determine the best meaning from the context of the symbol.


Paranoid Proofreading Checks


I want to make sure that I do a really good job post-processing my book. Are there any common errors that often make it through the checking system?

Yes, there are a few kinds of errors which often make it through both rounds of proofreading and into the final etext. These errors fit into three categories: specks that introduce punctuation, errors introduced by the tags used in proofreading, and "scannos" that can make it through a spell check.

To check for random punctuation caused by specks on the image that was OCRed, search for the following things:

  • , or . followed immediately by a letter,
  • . followed by a space and a lowercase letter,
  • , followed by a space and an uppercase letter,
  • / and /', which often occur instead of ," and .",
  • .' for ." (reverse these if your book uses single quotes as double quotes,
  • { and } instead of [ and ],
  • standalone ' followed by a hard return, and
  • standalone symbols, like &, $, ^, =, \, /, «, », @, ~, `, #, %, =, +, and |, which can creep in.

A few errors are introduced by the HTML and other proofreading markup that DP uses for proofreading. To eliminate these, do the following:

  • before deleting the HTML elements, search for <i> followed by a space, and </i> preceded by a space,
  • [ and ], to make sure that all [Footnote] and [Illustration] tags are properly formatted, and
  • after replacing the HTML elements, search for >, <, and / to make sure that all of the tags have been replaced.

There are myriad errors which will make it through a spell checker. If you would like to avoid tedious find-and-replaces, you could remove these words from your program's spell checker. Only a few of the most common scannos will be listed here.

  • standalone 1 and O, which sometimes replace I and O,
  • arid, for and,
  • arc, for are,
  • m, for in,
  • yon, for you,
  • modem, for modern,
  • loth, for 10th,
  • bad, for had,
  • lie, for he (and the),
  • hut, for but,
  • clay, for day,
  • wen, for well,
  • ail, for all,
  • fail, for fall,
  • tho, for the,
  • bo, for be,
  • ho, for he,
  • lime, for time,
  • coining, for coming,
  • tiling, for thing,
  • docs, for does,
  • riot, for not,
  • tum, for turn,
  • cur, for our,
  • ringer, for finger,
  • mined, for ruined,
  • carnage, for carriage,
  • carne, for came,
  • tip, for up,
  • tile, for the,
  • bat, for but,
  • comer, for corner,
  • 44 and 11, for ",
  • Borne, for Rome,
  • ease, for case,
  • Spam, for Spain,
  • tram, for train,
  • gram, for grain,
  • guru, for gun,
  • vas, for was,
  • bum, for burn,
  • Alien, for Allen,
  • j, for ;,
  • gaming, for gaining,
  • art, for act,
  • eve(s), for eye(s),
  • car, for ear, and
  • cat, for eat.

A more complete, and constantly growing list is maintained by big_bill. If you have found a "stealth scanno" which isn't on his lists, send it to him. He also collects statistics on the appearance rates of those already on the lists, so if you care to keep count of sightings of old stealth scannos he'd be just as happy to accept those as reports of new ones.

The latest version (presently 1.22) of big_bill's lists can be found here:

Common English Scannos
Rare English Scannos
Theoretical (But as yet Unwitnessed) English Scannos
Common French Scannos
Rare French Scannos
Theoretical (But as yet Unwitnessed) French Scannos
Common German Scannos
Rare German Scannos
Theoretical (But as yet Unwitnessed) German Scannos

The lists are plain text, and could also be used by an adventurous programmer to check for common letter shifts (ex. h -> b) and such. The custom built post-processing tools make use of them, also.


Non-ASCII Formats


I want to do something special with this text. Can I make a version of the text in HTML/XML/etc.?

Yes! Feel free to make non-ASCII versions if you wish. As long as you also produce an ASCII version, PG will be glad to accept any other version that you may produce.

Check the project comments to find out whether the project manager has requested an HTML version. If they have not requested an HTML version, you may still create one if you wish, but it is not necessary. If you wish to work on a text which will need HTML treatment, you must either be willing to produce the extra version yourself or find a partner to do it for you.

There are several requirements for HTML. First, be sure to read the PG HTML FAQ and follow all the requirements. Before submitting your HTML for PPV, please do the following:
Validate the HTML at http://validator.w3.org; validate the CSS, if any, at http://jigsaw.w3.org/css-validator; check all the links to make sure that they work and to double check that all your images are present, using the linkchecker at http://validator.w3.org/checklink or another linkchecker such as xenulink; run HTML Tidy to uncover any remaining problems in the HTML. Any HTML questions not answered in the PG HTML FAQ or DP's DPWiki's Guide to HTML should be discussed in the Post-Processing forum.

There have been a few especially useful discussion in the Post-Processing Forum. One is a post which suggests a format for documents based on XHTML 1.0. Another is this DPWiki post which gives a suggested guide for HTML writers and provides an index of HTML-related topics.

Some useful HTML entity codes can be found here.


I checked out a math text, and it's full of strange markup. How should I treat it?

Most math texts which are being proofread use the LaTeX markup language to clarify the mathematical symbols and formulas in the book. This means that the post-processor of math books must understand this markup, and be willing to produce a LaTeX version for PG.


Missing or Problem Images


There's a page missing from the scans, or some words/pages are blurred/chopped off, etc.

Try emailing the Project Manager. If they still have the text, they may be able to clarify blurred or missing words, or give you a scan for a missing page.

If they don't have the text, join gutvol-d and post a message asking for help. Give the name and author of the book which you are working on, what you will require as help (usually looking at a paper book to clarify a few words), and how much work there will be (are only a few lines cut off, or are there whole pages missing?). These volunteers will reply to you if they have access to your book. You should then send them the text with comments so that they can find the damaged portions, correct them, and send them back to you. You could also do this process yourself if the book exists in your local library.

Please do not submit the project for Verification until the missing text has been found.


Returning a Project


This project is too hard, or I don't have time for it, or I just don't want to do it any more! How do I get rid of it?

To dispose of your project and return it to the pool for another post-processor, go to your Post-Processing page, find the title of the book which you are returning, and select Return to Available from its drop-down menu. This will erase all of your changes and send it back to the pool for another post-processor. If you have done a lot of work on it, you might be better to arrange for someone else to pick up where you left off by making a post in the Post-Processing Forum.


Projects with Multiple Parts


Why are some projects split into different parts?

In the proofreading rounds, books intended for beginners were generally split into smaller units. This was not only to ensure a constant supply of projects for beginners, but to get feedback from mentors to the proofreaders faster than if the books were kept in one piece. It is encouraged, though not absolutely essential, for the same post-processor to check out all of the pieces of these books at the same time so that the formatting will be consistent throughout. The pieces should be joined together into one file for submission.


Post-Processing Verification


What is Post-Processing Verification, and who can do it?

Post-processing verification is the "second round" of post-processing. The post-processor looks over the post-processed etext for errors big and small, and submits them to Project Gutenberg. They will often provide feedback to post-processors as well, so that they can improve the quality of their work.

PPV's (as they are known) need to be experienced post-processors, familiar with common problems in etexts and able to provide feedback. Because of this, there is only a limited pool of people capable of PPVing. Once a person has submitted a number of consistently good etexts, the PPV will (at their discretion) give them permission to PPV projects themselves. If you have not been given this permission, you will not be able to check out PPV projects.


Other Questions


I have a question which isn't in this FAQ.

Post-processing involves common sense and personal judgement. The only solid rule is for the post-processor to preserve the author's intention to the best of their ability. There can be more than one way to handle a particular piece of formatting, and all of them can be right. You, as post-processor, have a great deal of freedom to decide how to handle particular formatting issues and make global format changes.

If your common sense and personal judgement aren't helping you solve some particular problem, post your question in the Post-Processing Forum. Other post-processors can then tell you how they would handle the situation. Their suggestions might give you a logical answer for your text, or inspire your own idea as to how to handle the issue.


If you think that something should be added to the FAQ, PM me. I don't mind answering questions, really I don't!.

If you have any suggestions (ex. a nifty new spell checker or text editor that I haven't included, comments on grammar, anything!), please don't be afraid to tell me about them.


 
Copyright Distributed Proofreaders (Page Build Time: 0.119) Report a Bug