Outlier detection as a method for knowledge extraction from digital resources

Eugenia Stoimenova,Plamen Mateev, Milena Dobreva

Mass digitization leads to the gathering of large amounts of data and metadata in electronic form. Commonly, they are used for representation and data harvesting. In information retrieval we have the cases of records, which differ much from the main part of the data. They seem to be quite unusual than one would expect from the rest of the records and from the "knowledge" about the underlying process, which generates the information items. Such records are usually called “outliers”. This information can lead to substantial improvements in the model. It can also lead to discoveries, which are valuable themselves. The basic aim of this study is to demonstrate what knowledge could be extracted studying the outliers in a collection of Bulgarian mediaeval manuscripts metadata. The distribution of document size is investigated using statistical techniques. Several outliers were marked as misprints, some other were pointed as documents with non standard intention. The distribution of extent data showed a structure that might be explained by the paper folding preferences. An appropriate technique for distribution was utilized and the manuscripts were presented according to their chronological distribution.