Преглед НЦД 22 (2013), 67–74

 

 

Geneviève Cron

Bibliothèque Nationale de France, Paris, France

 

OCR RATE COMPUTATION IN MASS DIGITIZATION PROGRAMS

 

Abstract: When digitization for libraries began about 20 years ago, the main issue was the scanning quality in order to obtain the best images both for conservation and dissemination. Since 2005, the Biblithèque Nationale de France (French National Library) has been launching tenders including conversion from image to text. This conversion can be done either by using software (Optical Character Recognition, OCR), or manually, or a combination of both. Irrespective of the course of proceedings the library expects the quality of the transcribed text. After description of the context, we will present the academic way of computing the OCR accuracy of an OCR output. Then, we will expose all parameters that any content holder needs to take into account for the definition of the OCR accuracy computation. This paper does not give any global formula for evaluation computation, but it raises questions for defining this formula. This paper will show that estimating the quality of the text depends on many of factors, including the use of the transcribed text. We also show that the reliability of this assessment is probably far from meeting the expectations of the libraries.

 

Keywords: digitization, OCR, Biblithèque Nationale de France, IMPACT project