Foreign Language Press Survey

About FLPS About the WPA Acknowledgements WPA Codebook Editing and Encoding Organization

Editing and Text Encoding

The Newberry version of the Chicago Foreign Language Press Survey of nearly 50,000 articles transcribed in XML files conforming to a schema adopted through the guidelines of the Text Encoding Initiative (TEI). The Newberry worked with a digitization vendor, PanGeo Partners of Chicago, which created a base XML transcription for each article from digital images. Much effort went into capturing the structure of the metadata for each article so that the information could be extracted into a database later. The transcriptions record the Internet Archive image identifiers for each sheet, and observe page breaks and page numberings. Within the body of the text only paragraphs and simple table structures have been represented.

After the initial transcription, an editing phase of the project checked the vendor's work and then looked at the articles in bulk to evaluate the work of the original WPA project. Although the 1930s editors and proofreaders took care to maintain a high degree of quality, some inconsistencies and errors inevitably made it through their review. The Newberry project transformed the vendor XML files into new expanded files, mapping key metadata fields into TEI header elements. Through a modified TEI schema suited to this phase of work, these fields could be further constrained, which made it possible to identify and correct typographical and other errors. In addition, this editing step put date values into a consistent format when possible, ensured that subject codes matched the project's list, and edited newspaper and source names to be more consistent. To the extent possible, we have made such corrections in the TEI header, while leaving errors and inconsistencies uncorrected in the body if they correctly transcribed errors in the original, rather than mistakes introduced during digitization.

Where we have been unable as yet to transcribe part of a text, the gap element in the original XML identifies a missing span, which may range from a single letter to a full page. Tabular information is represented with elements for tables, rows, and cells.