Behind the Scenes: Digitization

Getting a newspaper digitized and online is no simple task. It takes the US Caribbean & Florida Digital Newspaper Project (USCFDNP) 2 years and dozens of people from around the world to digitize approximately 100,000 pages of historic newspapers.

Microfilm

The digitization process starts with microfilm. Most historic newspapers we have today in libraries survive through microforms. The two most common types of microforms are microfilm and microfiche. Our project digitizes from reels of microfilm containing tiny images of each page of the newspaper. The images are only about 4% the size of the original sheet of paper.

The first reel of microfilm is called a “master negative”. All other copies of the film derive from this original. The first copy made is a called a “silver duplicate negative”. This is a copy of the master negative that can be used to make all other copies.

Copying microfilm causes wear and damage to the film over time, so a duplicate must be made to protect the master negative and preserve the original image of the newspaper. The master negative is only copied to replace lost or destroyed silver duplicates.

Silver duplicate negatives are used to generate all the other copies of the microfilm including silver positive copies. If you use microfilm as a patron in a library, you are using silver positive film, often called a “service copy”. A positive image is the reverse image of the negative, so the ink appears black and the paper appears white as in the original physical newspaper. When the microfilm is digitized, the digital images are also positive copies.

Duplication

We first inspect positive service copies to assess the condition of the paper and take a closer look at the content to decide if it will be digitized through the grant.

Once we decide which reels we want to digitize, we order duplicate negatives. An external vendor duplicates the master negatives and creates the duplicate negative we will digitize from and send to the Library of Congress to keep in their archives.

Collation

After the duplicate reels arrive, the Project Coordinator begins the collation process. To collate the microfilm the coordinator uses a microfilm machine and a computer program to view the microfilm and collect important information about what’s on the film and in the newspaper.

Project Coordinator Sarah Tew viewing microfilm

The collator records key information about the images on the film, including how they’re filmed, how many pages are in each issue, the date of each issue, any visible damage to each page, whether any pages are missing, whether any pages were photographed more than once, and the different languages that appear on each page. All the data the collator collects will be used by the digitization vendor to know which images belong together, which to leave out, and the metadata associated with each page and issue.

The microfilm machine (ScanPro3000) used to view the content on the reels.

Screenshot of the program used to view the newspapers on reels. Displayed is a partial view of an issue of the *Pensacola Journal*.

Scanning

Once the film has been collated, it gets carefully packed and shipped to a scanning vendor in the US. This team scans the reel of microfilm and creates TIFF files, the first digital copies of the film. The physical film stays with the vendor but the TIFFs are sent to another vendor for many digitization processes.

Optical Character Recognition (OCR)

One of the most important parts of digitization is OCR. With OCR, a computer “reads” the image of the paper itself and identifies letters based on the shapes it sees in the image. It is then able to generate a text file of the printed words and map the text to its location in the image. This is how the highlighted search feature in Chronicling America works.

Image of text search for “Everglades” in Chronicling America with the “Everglades” highlighted on the page. The Sun, Jacksonville, Fla., 10 March 1906.

OCR today is good, but still imprecise. When searching digitized newspapers, it’s important to remember that text in special fonts, especially in advertisements, cartoons, and special section headings, and anything handwritten may not be read correctly or recognized as text at all by OCR. You can access the OCR text for every page in Chronicling America and it is also overlaid in the PDFs.

File Types

In addition to the OCR file, the digitization vendor also automatically creates many other file types for each page including JP2s, JPGs, PDFs, and XMLs. The JP2s are the highest quality image files accessible in Chronicling America. The JPGs are used for image downloads using the snip tool. All PDFs are text searchable.

Metadata

Every page, every issue, and every newspaper title comes with robust metadata. Metadata is data about data. Newspaper metadata in the XML files includes publication date, language, page numbers, the owner of the film, and technical information about the microfilm and scanner used in digitization.

Complete, correct metadata is vital for searching the papers with filters and categories. Besides the metadata that gets encoded in the XMLs, more metadata comes to Chronicling America through cataloging. Each newspaper has a catalog record that includes some of the same information as the XMLs, including the name of the paper, publication dates, and languages but includes many more fields like where the paper was published, editor’s names, and associated titles to better describe the paper as a whole and where to find it.

List page for La Gaceta with metadata fields on left, including subject headings at bottom.

One of the most important types of metadata for searching and finding papers are subject headings. Subject headings describe the contents of a paper and are used in Chronicling America and library catalogs to help filter results by topic. In Chronicling America they are used in search fields for ethnic press and labor press about different industries.

Sending Digital Files to the Library of Congress

Once a group of reels has been digitized, the project coordinator receives the digital files, double-checks them for accuracy, copies them to a physical hard drive, and mails them to the Library of Congress. Once at the Library, they are again checked for accuracy and then enter the ingest queue. After a few months they go live on Chronicling America for everyone to use.

Sending Microfilm to the Library of Congress

At the very end of the process, once all files have been accepted by the Library of Congress, the project coordinator calls all the duplicate negative reels back from the scanning vendor and ships the reels to the Library of Congress for physical storage.

100 duplicate microfilm reels packed in a large box

It’s important to keep physical and digital copies of the newspapers in multiple places in case they are lost, damaged, or future technological advances necessitate reprocessing.

Duplicate Digital Files

The digital newspapers are available through Chronicling America but the University of Florida also hosts copies across different platforms including:

University of Florida Digital Collections (UFDC)
Florida Digital Newspaper Library (FDNL)
Digital Library of the Caribbean (dLOC)

Papers are ingested into these platforms about a year after they are available in Chronicling America.

The newspaper digitization process is usually long, often tedious, and always worth it. Happy reading!

Further resources

US Caribbean & Florida Digital Newspaper Project, “Pre-Digitization”. Dec. 3, 2015. https://ufndnp.domains.uflib.ufl.edu/pre-digitization/

Association for Library Collections & Technical Services, “Microfilm Terminology”. https://www.ala.org/alcts/resources/collect/serials/microforms03

University of Arizona Libraries, “How to Use Microfilm and Microfiche”. April 17, 2017. https://www.youtube.com/watch?v=HxXhLhTHkD0