Optical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century


Ines Jerele, Tomaž Erjavec, Daša Pokorn, Alenka Kavčič-Čolić




This paper presents research aimed at achieving better OCR quality in large scale digitisation of newspapers and books, and opening possibilities of full-text search of digitised old Slovenian printed texts, which should enable digital library end-users to gain better transcriptions of digitised contents. The paper describes on-going work undertaken by the National and University Library of Slovenia and the Jožef Stefan Institute in the framework of the EU research project IMPACT – Improving access to text – to develop high-quality datasets, in particular ground-truth transcriptions (a clean corpus) and a lexicon of historical Slovene.