Accessible Scientific PDFs for All

Description

PDF is the most popular document format to provide and distribute information on the internet. It was developed by Adobe 1996 but has been an open format since 2008. It was estimated in 2015 that more than 2.5 trillion PDF documents exist on the internet, covering all aspects of life and research, and their number is growing every d§ay. This is especially true for the scientific community, where PDF documents comprise the majority of publications.Despite research that has taken place for over 40 years on different aspects of document accessibility, a very large portion of PDF documents are still partially or completely inaccessible to more than 10% of the world population, i.e. people with visual impairments. Given the ever-growing number of PDF documents, these people are becoming more and more excluded from digital information. Inaccessible PDFs also present a major barrier for people with visual impairments wishing to pursue studies or careers in STEM fields, as they cannot easily read studies and publications from their field. Scientific journals and conferences have become more aware of this issue in recent years, and there is increasing demand for a solution.Currently, the only option for making PDF documents fully accessible is to manually add special “tags” to different structural elements while creating the document – however, this is very time-consuming, especially for scientific PDFs. Most PDF authors are either unaware of PDF accessibility in general, don’t have the knowledge needed to make them accessible, or are unable or unwilling to put in the additional effort. The main reason for this is that no reasonably automated method exists to make PDFs accessible, due to the huge challenges involved, e.g. the automatic detection of the page structure and the corresponding reading order, and the automatic accurate recognition and interpretation of tables, formulas, and graphics.Different approaches have been researched for each of these aspects, but their results are insufficient to allow full automation of the respective processing steps. In view of all these open research issues, the goal of this project is not to fully automate the process of making PDFs accessible, but rather to minimize the burden on authors by automating as many steps in the process as possible.In a first phase, new approaches will be researched that allow automatic detection of the structure and reading order of a (scientific) PDF page, building on deep-learning approaches developed at our institute in the area of newspaper article segmentation [1]. An accurate segmentation of the content elements is crucial for the success of the subsequent processing steps. In the second phase, our research will concentrate on how to make formulas accessible as they are a major part of PDFs in the STEM (Science, Technology, Engineering, and Mathematics) fields.The result will be a semi-automatic system that allows efficient conversion of any PDF into an accessible PDF by identifying, recognizing, and converting the different content elements of a document in such a way that this information can then be read out by a screen reader in a meaningful manner. This will allow users with visual impairments to explore and understand the content and structure of an otherwise inaccessible PDF document.

Key Data

Projectlead

Prof. Dr. Alireza Darvishy

Co-Projectlead

Prof. Dr. Thilo Stadelmann

Project team

Prof. Dr. Hans-Peter Hutter, Felix Schmitt-Koopmann

Project status

ongoing, started 04/2021

Funding partner

BRIDGE Discovery / Projekt Nr. 194677

Publications

MathNet : a data-centric approach for printed mathematical expression recognition

2024 Schmitt-Koopmann, Felix; Huang, Elaine M.; Hutter, Hans-Peter; Stadelmann, Thilo; Darvishy, Alireza

go back