While developing Paperity, we encountered the problem of extracting full text from PDF documents. Vast majority of academic papers are published as PDFs, and we wanted to unlock their contents and make them searchable in Paperity.
Extracting text from a PDF document is one of the hardest practical problems that seem easy on the first sight. It should not be much different than using a word processor, right? Absolutely wrong. PDF format is designed for laying out pages and faithfully reproducing the same visual layout everywhere, be it a screen or a printer. Therefore, it does not consist of a continuous stream of letters, words, and sentences; but of pages and objects with specific sizes and coordinates relative to the page. This is a very low-level representation that must be thoroughly preprocessed before it can be analyzed as a complete text. Moreover, PDF authoring tools apply different typographical tricks while converting the text to PDF. For example, letters “f” and “i” are typically joined in a single “glyph”, to make them look better when printed, so that, for instance, the word “justification” in your word processor becomes “justiﬁcation” (note the single character that is a combination of “f” and “i”) when converted to PDF. This adds another level of complexity while extracting text. Split words at the end of lines pose another problem.
We evaluated several toolkits designed for text extraction from PDF documents. In this article, we will share our findings and the rationale behind our final choice of CERMINE – the extraction tool developed by ADA Lab and CeON in ICM UW. After reviewing many packages, we shortlisted the following three open source tools:
- PyPDF2: Written in Python, PyPDF2 is the successor of pyPDF and mainly focuses on document manipulation.
- Apache PDFBox: Written in Java, it allows creation of new PDF documents, manipulation of existing documents and the ability to extract contents from documents.
- CERMINE: Written in Java, designed specifically for analysis of scholarly articles; it is both a library and a web service for extracting metadata and content from scientific papers.
We will use a sample Open Access paper (Sulfolobus chromatin proteins modulate strand displacement by DNA polymerase B1, Nucleic Acids Research, 2013, Vol. 41, No. 17, accessible at http://paperity.org/p/34709961) to demonstrate some of the test cases we were concerned with. An image of the first page is included here for understanding the test results:
Content Selection Comparison
The first problem that arises during analysis of scholarly PDFs is how to correctly detect different blocks of text. A scholarly paper is not a continuous stream of words. Rather, it contains many different sections that can be arranged in very different ways on the page and inside entire document. Each block plays a different role, therefore it is important to correctly detect all of them and discover what roles they play in the document. Only CERMINE was able to do this job. Below we give a preview of outputs of all the three tools.
(Please note that the red arrows denote line wraps.)
At the beginning of the text extracted by PDFMiner and PDFBox, you can see meta data (title, journal name, authors) and article abstract, all combined into one stream of text. Moreover, paragraphs are most often broken into many separate lines in the output stream. CERMINE, on the other hand, can properly concatenate lines that belong to the same paragraph. It also detects the type of each block and starts extraction directly with the main body of the article. When necessary, CERMINE can also extract meta data fields as separate items in the output XML file.
All the three tools did a good job eliminating the download link written vertically on the right hand side of the pages.
Paragraph Structure and Formatting Comparison
Below is a more detailed example of how paragraphs are processed by the evaluated tools.
We can see that both PDFMiner and PDFBox produce line breaks wherever the original PDF has one, while CERMINE successfully joins lines into paragraphs, retaining the document structure. In other experiments, we have also found that CERMINE can actually represent the paragraph structure with high accuracy.
Moreover, only CERMINE was able to successfully merge split words: like “com-pacted” in the example above, merged to “compacted”. This is a very important feature, especially in documents with multi-column layout, where split words are very common. While PDFMiner was unable to parse typographical shorthands (like joined f and i), both PDFBox and CERMINE were able to interpret them correctly.
All in all, we found CERMINE to be a very sophisticated and reliable tool for analysis of scholarly PDFs. It does an excellent job in detecting different types of blocks, preserving the structure of paragraphs and decoding characters correctly. What is also important, CERMINE is open source, which enables unconstrained use and guarantees that the tool will be easy to customize when necessary and will have a growing community of users and contributors. That is why we decided to use it in Paperity.
We hope that CERMINE will be further developed in the future and one of possible directions that we might suggest is the recognition of national characters – a difficult task, given the peculiar encoding used in many PDFs, but very important for scholarly community and surely the one that can be successfully solved by CERMINE team.
This article was co-written with Marcin Wojnarski and previously published at ADA Lab blog. Paperity (paperity.org) is the first multi-disciplinary aggregator of peer-reviewed Open Access journals and papers, “gold” and “hybrid”. It was launched in the beginning of October 2014 to facilitate access to scholarly literature across all different fields and already now includes nearly 200,000 articles from over 2,000 journals.