|787 books posted to PG from DP!|
|DP » Post-Processing FAQ|
(Version 3.7; last updated July 22, 2004)
Getting StartedWhat is Post-Processing?
Selecting a Book
How Long Do Etexts Take?
Guiguts The Gut* Foursome Other Tools
Advanced Proofreading Questions
Special Formatting IssuesFootnotes
Headings and Subheadings
Symbols and ScriptsNon-ASCII Characters
Non-Latin Scripts and Unusual Symbols
Extra ChecksParanoid Proofreading Checks (Stealth Scannos, etc.)
Technical QuestionsNon-ASCII Formats
Missing or Problem Images
Returning a Project
Projects with Multiple Parts
Post-Processing Verification Other Questions, and Suggestions for the FAQ
This FAQ is designed as a guide for post-processors, especially new ones. I hope that it makes the task of post-processing a little less frightening!
This section is designed especially for the first-time post-processor. It provides a walkthrough of all of the essential steps needed to proofread a relatively simple book. For more difficult books, please see the Advanced Proofreading Questions section for advice.
The purpose of post-processing is to massage an etext into a readable form. In its journey through two proofreading rounds, the text will have been edited by perhaps hundreds of proofreaders. The post-processor must standardize the formatting of the book and adjust it to comply with Project Gutenberg's requirements. The post-processor must also correct as many errors as possible that have survived both proofreading rounds. The ultimate goal of post-processing is to create a plain-text etext with consistent formatting throughout, which contains as few errors as possible, and which accurately reflects the intentions of the author.
Post-processors require more experience than ordinary proofreaders. Because they are preparing the text for uploading to Project Gutenberg, they are the final editors of the text. Because of this, post-processing is only available for proofreaders who have completed at least 400 pages in the first and/or second proofreading rounds. Also, post-processors should be very familiar with the Proofreading Guidelines before attempting to post-process.
The requirements for software are minimal:
There are other useful programs available which are not essential, but which can be extremely useful and might save you a lot of time. Take a look and see if any of them would be of use to you.
For your first project, it's a good idea to pick a fiction book with a relatively small number of pages. Here's why:
Take a look at the Post-Processing section of your Personal Page to see all of the works available for post-processing. Projects which are marked "TRAINEE" have been set aside for new post-processors, and are ideal texts with which to learn the ropes of post-processing. Books labelled "BEGINNERS" or "MENTORS" were originally set aside for new proofreaders. These books are also relatively easy, and would make good first projects. Unlike "TRAINEE" texts, however, these books may be checked out by both new and experienced post-processors. Alternatively, pick a text with a title that is likely to be fiction. "The Boy Scout Camera Club" is a good bet; "Copyright Renewals" is not. These texts are often marked "EASY" and are available to all post-processors.
Download the text you have chosen by going to its Book Options scrollbar and selecting "Download Zipped Text". Do not select "Download Zipped TEI Text" (a different encoding) unless you know what TEI is and want to play around with it. The plain text is the version that you need.
Scroll through the whole text to see if there are any difficulties, like footnotes, poetry, foreign languages, dialects, and tables in it. This way, you will know what you will be dealing with before you commit to the project. If you see any of these items, you would be wise to choose a different project. But, if you think you can handle it, give it a try!
Check the Project Questions and Comments Forum for your book's title to see what proofreaders have been saying about it. Again, this might alert you of issues which might make the work more difficult than you had realized.
When you have found a work that you want to do, check it out by going to its Book Options dropdown listbox and selecting Check Out Book. The book is now yours! (Make absolutely sure that the book appears as checked out to you, or you might end up working for hours on your text only to find that someone else has checked it out and submitted it!)
It is impossible to answer this question in advance. The time that a book will take to complete depends essentially on three factors:
Some proofreaders can finish an easy book in only an hour or two (or even less, for the especially speedy). However, most proofreaders require longer than this to do a good job. Some especially difficult works can take weeks (or more!) to complete.
Try not to feel discouraged if you take more than a day to complete an "easy" book. Concentrate on learning the process of post-processing, familiarizing yourself with any tools you might be using, and doing a quality job, rather than on working quickly. You will speed up naturally with practice.
There is no one way to post-process books. Every post-processor has a different technique and uses different tools to do the job. The technique described below is the one the author personally uses, but there are many ways of achieving the same end result. In time, you will probably develop your own technique.
This walkthrough describes a very hands-on way of post-processing, which is why it is recommended for first-time post-processors. It takes longer to complete the book this way than if macros, global find-and-replace, and other tools are used, but it gives the first-timer a feel for what they should look out for by making the proofreader scroll through the book several times. Also, the walk-through below is suitable for use on all operating systems, regardless of what text editor is used or what tools are available for that operating system. Once you get the hang of what post-processing is all about, please experiment with the software available below, especially the dedicated post-processing tools. You do not have to use and of them, but any post-processors find them indispensable!
It is strongly advised that you read through the following walkthrough before starting to post-process. This will give you an idea of what you will be doing in advance. You might also like to look for helpful software first if you are so inclined.
What is Gutcheck, and how do I use it?
Gutcheck is a nifty piece of software created by Jim Tinsley for people working on Project Gutenberg etexts. It checks for errors which are common, but not easy to spot, like mismatched quotes, short lines, etc.
Gutcheck is currently being produced for Windows and *nix systems, and can be found here. A quick-and-dirty Mac build can be downloaded here. You could also simply ask another post-processor to run Gutcheck for you if you have trouble with it, and they can send you a list of results.
Many people have trouble setting up Gutcheck for the first time. If you too have trouble, don't worry. Someone in the Post-Processing Forum will be only too happy to help you. Just post and ask for help!
If you are using Guiguts, you will not need Gutcheck, as it is a part of Guiguts.
Any text editor can be used for post-processing, but some have tools which make them more suitable than others for the job. BBEdit Lite is an excellent choice for Mac users. It is no longer being supported, but remains on the site for download. Be sure to download the MIDex plug-in as well. Other useful plug-ins for BBEdit are available here. Many *nix users use emacs, which can be downloaded here, though it is probably on your machine already. It is also available for Windows. If you use Microsoft Word for any platform, you will be able to run a useful macro.
You will need to be able to look at your book's scanned images at some point in the post-processing process. You can either download them and use a third-party program to view them, or view them online through a browser. Any program that will display images will do.
Some people have recommended utilities which allow you to see thumbnails of images without opening them, making it easier to find the one you're looking for. GraphicConverter is one such program for Macintosh machines. Irfanview is a quality Windows image manipulation program. Xnview runs on Windows, *nix, and a host of smaller operating systems.
Many text editors do not provide a spell checking feature. Instead, spell check programs are used to provide this essential function. A separate spell-checking program is also useful when post-processing texts which are not in English. Many post-processors do not have access to non-English spell checkers, and buying add-ons for programs like Microsoft Windows can be very expensive. Independent spell checkers provide dictionaries for a wide range of common (French, Spanish, Dutch) and not-so-common (Catalan, Estonian, or English Biomedical word list) for free.
Excalibur is a Macintosh spellchecker. It has separate downloadable dictionaries for many different languages. The dictionaries provided are very complete. It works with LaTex. The major drawback of this program is that its dictionaries are difficult to edit. Though adding words is easy, removing words (such as common Stealth Scannos) is not. Excalibur can be downloaded here.
Aspell is a spell checker which is available both as a Windows executable and as a perl script. It provides a good selection of different language dictionaries. It is available here.
Ispell has been ported to Windows, OS/2, and also runs on *nix and MacOSX. It does not run on DOS or older Mac OS machines. Like the other two spell checkers, there are a good selection of language dictionaries available for downloading. Ispell can be found here.
CSS can be validated at this website [w3.org]. Note that if you choose to upload a file for verification, only the section between the <style> tags should be in the file since anything other than CSS will confuse it (and you) terribly. It is probably easiest to cut and paste the css into the section of the form called "Validate by direct input."
There are presently two pieces of software which have been specifically written by post-processors for post-processors. Both provide an all-in-one kit, so you can use them for all of your post-processing needs, or just take advantage of some of their extra features. Each program has their supporters and detractors, so give them both a try!
Guiguts was written by thundergnat. The tool is almost a complete post-processing kit in itself. It began as a graphical interface for gutcheck, but has evolved to become much more. Among its special features are the ability to automatically remove page headers while keeping track of each page's identity, analysis of word frequencies (especially useful for catching misspelled proper names and other odd spellings), automated checks for markup errors, a footnote moving function, easy checking of "Stealth Scannos", and much more. You won't even need a text editor or Gutcheck, because they're built in! Guiguts is very well-documented.
Guiguts is available from here. Guiguts does not come with a spell checker, but integrates well with Aspell.
BillFlis has created a set of four tools which run on the Windows platform only.
GutSweeper scans the text and makes a lot of automatic corrections, such as eliminating double spaces and end-of-line blanks, fixing hyphenation errors, and splitting oe ligatures. It is markup sensitive, so that it will not ruin the formatting of poetry, block quotes, and tables. This saves some time for the proofreader, as all of its changes are ones that they would have to be made anyway.
GutAxe is an interactive tool, which allows the post-processor to make more complicated changes to the text. It works a bit like a spell checker, highlighting a "problem" area and suggesting possible solutions. It scans for common "Stealth Scannos" and punctuation errors, among other things. It also removes page markers.
GutWrench supplements Gutcheck with a lot of extra checks.
GutHammer rewraps the text in a similar way to Big_Bill's RewrapIndent tool (see Macros, below). It also replaces HTML markup with ASCII symbols.
The changes made by these programs are saved as new files, with no alteration being made to the original, so they are totally undoable if something goes wrong. The tools can be downloaded here. They have excellent documentation, and include a suggested post-processing walkthrough written especially for the Gut* tools.Other Tools
Please note that macros may need an additional piece of software in order to run. For example, a lisp or perl script will need a lisp or perl interpreter to work, such as ActivePerl, available here, and Clisp, available here, both for free. If you need help getting them to work, or your platform isn't supported by these, ask in the Post-Processing Forum for help; almost certainly someone will be able to assist you!
There is a macro available for Microsoft Word, which rewraps text while preserving paragraph breaks. It may also replace double spaces with single ones and adjust the margins to less than 75 characters (I am less sure about these functions). Please note that this macro will not work in Word 97 or older versions.
Naomi Parkhurst has written an Applescript macro, which rewraps the text (including around poetry) and removes superfluous spaces. It works for BBEdit (but not BBEdit Lite).
Garweyne has written a lisp script to aid post-processors wishing to rearrange footnotes. A more elaborate description of the script's abilites is available at the link above. Many more useful lisp scripts are available here.
Bill Keir (big_bill) has written a lisp script called RewrapIndent which will both rewrap and indent text automatically. It honours the /* */ tags that proofreaders use to mark poetry, so poetry will not be rewrapped but will be automatically indented, and allows the addition of other tags to handle block quotes automatically if needed. It handles complicated cases like poetry nested inside block quotes, and multiple levels of block quotes within block quotes, but can also be used to quickly and tidily rewrap any simple book that needs no special indenting, to whatever line length you choose, too. For more details see the HTML documentation inside the zip file.
Advanced Proofreading Questions
This section contains information on the more complex aspects of post-processing. It is designed for advanced post-processors, rather than beginners.
The formatting issues treated below are also discussed in the Proofreading Guidelines, should you need information on proper markup.
Should I leave footnotes inline, or move them elsewhere?
This depends a great deal on the length of the footnotes, their frequency, and on the type of text that you are proofreading.
It is suggested that you should leave the footnotes where they occur in the text if they are short and relatively uncommon. In these cases, the footnotes won't disrupt the flow of the text very much.
If the footnotes in your text are long, numerous, or are primarily bibliographic references, you might be wise to move them to the end of the paragraph where they occur, on their own line, and just leave a marker (ex. ) in the paragraph.
If the footnotes are so common that even this would be highly disruptive to the reader, consider collecting them all and moving them to the end of the chapter. This is a bit of a pain, but it makes such works infinitely more readable. Mark the endnote in the text as you would an ordinary footnote, fix all of the numbering, and list the endnotes at the end of the chapter with the new numbering.
If you should you decide to move footnotes, Garweyne has a lisp tool which makes this MUCH easier.
Footnotes in poetry are a special case worth mentioning. Most post-processors agree that the footnotes should not be simply inserted into the text where they occur, as this interferes with the rhythm of the poetry. Mark the place referenced by the footnote, then move the footnote to its own line or to the end of the poem, whichever seems most appropriate.
Whatever format you choose, make sure that it is consistent throughout the text.
Should I leave in all of those [Illustration] tags? What about the ones with captions?
[Illustration] tags with no captions should not be removed. This is so that if someone decides to make an HTML version in the future, the tags will be there and it will be easier to correctly place the images. If you are making an HTML, XML, or similar version yourself, replace the tags with links to the images.
[Illustration] tags with captions should be left in place for the reader to enjoy.
My text will make no sense if the actual illustrations aren't included.
If the text was scanned with the intention of just converting it to ASCII, email the Project Manager and ask for advice. If you are willing and able and the scans are available, they may ask you to do an HTML version and provide good quality scans for the missing images. Alternately, they may decide that the project doesn't really need the images and explain their opinion.
If the text really should be produced in a version that will allow future readers to view the images, and you are unable or unwilling to do the work to put it in such a version, return it and post your concerns in the Post-Processing Forum.
My book contains a few verses of poetry. How should I format them?
The proofreaders should have surrounded any verses of poetry with the markers /* and */. Remove the markers, and check against the original image to make sure that the formatting is correct. A quick scan should reveal if the spacing in your text and the original match.
Make sure that the indentation is consistent, at least within each poem. It's entirely possible that one proofreader indented some lines four spaces, while the proofreader who got the next page indented five, when the image shows the same amount of indentation.
Indent all of the poetry by 2 spaces (or more if you prefer), preserving any further indenting that the author intended.
DO NOT REWRAP LINES. You will have to take special care when you rewrap the text not to rewrap your poetry. However, if a line is broken in two due to its length, but it was not intended to be (the second line of these is usually highly indented and not capitalized), join the two parts of the line together. If they still don't fit on one line, break them and indent the second half heavily.
My book is poetry. How do I format it?
Aside from the fact that poems won't be marked by poetry markers, the book should be formatted as for occasional verses (check the indentation, etc.). Also, the poem(s) need not be indented two spaces by default, as they are the main text, and don't need to stand out from it.
Books which are entirely comprised of poetry do not need to have their lines rewrapped (watch the Gutcheck output for overly long lines, though). However, the book may contain an introduction or other prose section which will need rewrapping.
bconstan has graciously offered to aid anyone who needs extra help post-processing poetry books. If you have any questions, send her a message.
What do I do with tables?
Tables should have been marked with /* */ by the proofreaders, but a quick scan of the book should turn any up even if they are unmarked, as they are quite conspicuous.
You will have to move the text in the table around to make it as readable as possible. If the headings are broken between lines, put them on the same line if you can. Adjust the spacing of the columns so that they look good on the screen and aren't too close together. Make all of the column entries line up. If you are lucky, this formatting was already done for you, but not all proofreaders can format tables accurately (if their display font isn't monospace, for example). Watch for tables that span multiple pages, as they will be unlikely to have similar formatting. "Related" tables should be formatted consistently, if possible.
DO NOT REWRAP LINES! You don't want to destroy all of your hard work, now do you?
The table will not fit into the lines widths allowed by PG.
You have a couple of options here:
These [Sidenote] tags seem redundant!
In most cases, sidenotes add a bit of summary or description to the text, but, in very rare cases, the sidenotes add nothing to the book and will be an annoyance rather than a help. If your book fits this mold, consider leaving out the sidenotes. BUT, think long and hard about this, as this is altering the text of the original, a DP no-no. Email the Project Manager and/or post in the Forums for a second opinion before taking this step.
How do I differentiate subheadings from headings?
Usually, the easiest way to differentiate between subheadings and headings is to change the line spacing (ex. leave three lines blank when a new heading begins, and only one for a new subheading). However, some texts may have more than one layer of subheading. In these cases, you will have to devise a markup which is appropriate to the text. You could indent subheadings a certain number of spaces depending on their "layer", for example.
Is there anything special that I should know about formatting indexes?
Pay attention to the presence/absence of trailing commas and semicolons. You may either leave these in or remove them, as long as you are consistent.
DO NOT REWRAP LINES. Unless you have placed a blank line between each and every entry, rewrapping will destroy the format of the index. Be careful!
If you are creating an HTML version, why not make a hyperlinked index?
My book has an errata page at the end. Should I correct the errors?
Yes. The list of errata at the end of the book reflect the author's intention, and one of the guiding principles of etext production is to preserve the author's original intent. First- and second-round proofreaders had access to only one page of the book at a time, so none of the errata errors will have been corrected by them. This job therefore falls to the post-processor, who has access to all of the pages of the book. Find and correct all of the errata, and delete the errata page from the book.
My text has accents, pound signs, or other non-ASCII characters in it. Should I preserve them in the final version?
In general, yes. Keep all of the accented words (or symbols) as they are. An ISO-8859-1 (also called Latin-1 or 8-bit ASCII) file can be made which preserves them. If the text is in a language containing many accents that are not found in ASCII or ISO-8859-1, there are other forms of encoding out there in which they can be preserved.
How do I handle footnotes, etc. in Greek, Russian, or other texts with a non-Latin alphabet?
If possible, the text should be transcribed into the Latin alphabet. It's not a lossless process, but it's the only way to preserve these snippets in ASCII. Information of the transcription process is available in the Proofreading Guidelines.
Many languages, like Arabic and Hebrew, are difficult to transcribe without an intimate knowledge of the language. If your text contains snippets of such a language and you don't have the knowledge to transcribe it yourself, try posting in the Forums to find someone to team up with for transcription.
If there is a significant amount of text in a non-Latin script, it may be worth making a Unicode (HTML) version, which would allow the original script to be preserved.
If you cannot transcribe the language, and you can't find anyone else who is capable of doing it for you, mark its presence with [Arabic] (for Arabic), and delete any OCR garbage that may have been left in the text. It's too bad that the information will be lost, but you've done the best you could.
My text has weird symbols in it (Zodiac signs, medical abbreviations, etc.). How do I mark these up?
If you are lucky, the proofreader will have done some research and found the meaning of the symbol for you. However, often the proofreader will mark the symbol with a * and leave you with the legwork.
If know what the symbol represents, write out its meaning, e.g. [Symbol: Jupiter]. Do not try to replicate the symbol itself in ASCII.
If you have never run across the symbol before,
here are a few web pages provided by proofreaders to help you:
This section has room to grow. If you find any other good links, please PM me.
Note that some symbols may have more than one meaning. If this is the case, try to determine the best meaning from the context of the symbol.
I want to make sure that I do a really good job post-processing my book. Are there any common errors that often make it through the checking system?
Yes, there are a few kinds of errors which often make it through both rounds of proofreading and into the final etext. These errors fit into three categories: specks that introduce punctuation, errors introduced by the tags used in proofreading, and "scannos" that can make it through a spell check.
To check for random punctuation caused by specks on the image that was OCRed, search for the following things:
A few errors are introduced by the HTML and other proofreading markup that DP uses for proofreading. To eliminate these, do the following:
There are myriad errors which will make it through a spell checker. If you would like to avoid tedious find-and-replaces, you could remove these words from your program's spell checker. Only a few of the most common scannos will be listed here.
A more complete, and constantly growing list is maintained by big_bill. If you have found a "stealth scanno" which isn't on his lists, send it to him. He also collects statistics on the appearance rates of those already on the lists, so if you care to keep count of sightings of old stealth scannos he'd be just as happy to accept those as reports of new ones.
The latest version (presently 1.22) of big_bill's lists can be found here:Common English Scannos
Rare English Scannos
Theoretical (But as yet Unwitnessed) English Scannos
Common French Scannos
Rare French Scannos
Theoretical (But as yet Unwitnessed) French Scannos
Common German Scannos
Rare German Scannos
Theoretical (But as yet Unwitnessed) German Scannos
The lists are plain text, and could also be used by an adventurous programmer to check for common letter shifts (ex. h -> b) and such. The custom built post-processing tools make use of them, also.
I want to do something special with this text. Can I make a version of the text in HTML/XML/etc.?
Yes! Feel free to make non-ASCII versions if you wish. As long as you also produce an ASCII version, PG will be glad to accept any other version that you may produce.
Check the project comments to find out whether the project manager has requested an HTML version. If they have not requested an HTML version, you may still create one if you wish, but it is not necessary. If you wish to work on a text which will need HTML treatment, you must either be willing to produce the extra version yourself or find a partner to do it for you.
There are several requirements for HTML.
First, be sure to read the PG HTML FAQ and follow all the
requirements. Before submitting your HTML for PPV, please do the following:
There have been a few especially useful discussion in the Post-Processing Forum. One is a post which suggests a format for documents based on XHTML 1.0. Another is this DPWiki post which gives a suggested guide for HTML writers and provides an index of HTML-related topics.
Some useful HTML entity codes can be found here.
I checked out a math text, and it's full of strange markup. How should I treat it?
Most math texts which are being proofread use the LaTeX markup language to clarify the mathematical symbols and formulas in the book. This means that the post-processor of math books must understand this markup, and be willing to produce a LaTeX version for PG.
There's a page missing from the scans, or some words/pages are blurred/chopped off, etc.
Try emailing the Project Manager. If they still have the text, they may be able to clarify blurred or missing words, or give you a scan for a missing page.
If they don't have the text, join gutvol-d and post a message asking for help. Give the name and author of the book which you are working on, what you will require as help (usually looking at a paper book to clarify a few words), and how much work there will be (are only a few lines cut off, or are there whole pages missing?). These volunteers will reply to you if they have access to your book. You should then send them the text with comments so that they can find the damaged portions, correct them, and send them back to you. You could also do this process yourself if the book exists in your local library.
Please do not submit the project for Verification until the missing text has been found.
This project is too hard, or I don't have time for it, or I just don't want to do it any more! How do I get rid of it?
To dispose of your project and return it to the pool for another post-processor, go to your Post-Processing page, find the title of the book which you are returning, and select Return to Available from its drop-down menu. This will erase all of your changes and send it back to the pool for another post-processor. If you have done a lot of work on it, you might be better to arrange for someone else to pick up where you left off by making a post in the Post-Processing Forum.
Why are some projects split into different parts?
In the proofreading rounds, books intended for beginners were generally split into smaller units. This was not only to ensure a constant supply of projects for beginners, but to get feedback from mentors to the proofreaders faster than if the books were kept in one piece. It is encouraged, though not absolutely essential, for the same post-processor to check out all of the pieces of these books at the same time so that the formatting will be consistent throughout. The pieces should be joined together into one file for submission.
Post-processing verification is the "second round" of post-processing. The post-processor looks over the post-processed etext for errors big and small, and submits them to Project Gutenberg. They will often provide feedback to post-processors as well, so that they can improve the quality of their work.
PPV's (as they are known) need to be experienced post-processors, familiar with common problems in etexts and able to provide feedback. Because of this, there is only a limited pool of people capable of PPVing. Once a person has submitted a number of consistently good etexts, the PPV will (at their discretion) give them permission to PPV projects themselves. If you have not been given this permission, you will not be able to check out PPV projects.
I have a question which isn't in this FAQ.
Post-processing involves common sense and personal judgement. The only solid rule is for the post-processor to preserve the author's intention to the best of their ability. There can be more than one way to handle a particular piece of formatting, and all of them can be right. You, as post-processor, have a great deal of freedom to decide how to handle particular formatting issues and make global format changes.
If your common sense and personal judgement aren't helping you solve some particular problem, post your question in the Post-Processing Forum. Other post-processors can then tell you how they would handle the situation. Their suggestions might give you a logical answer for your text, or inspire your own idea as to how to handle the issue.
If you think that something should be added to the FAQ, PM me. I don't mind answering questions, really I don't!.
If you have any suggestions (ex. a nifty new spell checker or text editor that I haven't included, comments on grammar, anything!), please don't be afraid to tell me about them.