Building an efficient OCR system for historical documents with little training data

Martínek, Jiří; Lenc, Ladislav; Král, Pavel

Full metadata record

DC Field	Value	Language
dc.contributor.author	Martínek, Jiří
dc.contributor.author	Lenc, Ladislav
dc.contributor.author	Král, Pavel
dc.date.accessioned	2021-03-08T11:00:21Z	-
dc.date.available	2021-03-08T11:00:21Z	-
dc.date.issued	2020
dc.identifier.citation	MARTÍNEK, J. LENC, L. KRÁL, P. Building an efficient OCR system for historical documents with little training data. Neural Computing and Applications, 2020, roč. 32, č. 23, s. 17209-17227. ISSN 1433-3058.	cs
dc.identifier.issn	1433-3058
dc.identifier.uri	2-s2.0-85084519412
dc.identifier.uri	http://hdl.handle.net/11025/42814
dc.description.abstract	S rychlým nárůstem počtu digitalizovaných historických dokumentů vzniká potřeba umožnit efektivní vyhledávání informací a extrakci znalostí, aby bylo možné tato data zpřístupnit. Tyto úlohy jsou závislé na optickém rozpoznání znaků (OCR), které umožní převod dokumentů do textové podoby. Článek představuje sadu metod, které umožňují provedení OCR na historických dokumentech s minimálními nároky na množství reálných, manuálně anotovaných, dat. Prezentovaný OCR systém zahrnuje analýzu rozložení stránky spolu s detekcí textových bloků a segmentací řádek textu a také samotný OCR modul. Segmentační metody jsou založeny na plně konvolučních neuronových sítích a OCR modul využívá rekurentní sítě. Je ukázáno, že jak segmentace tak i OCR jsou možné s malým množstvím anotovaných dat. Cílem experimentů bylo nalézt efektivní postup pro dosažení dobrých výsledků s použitím malého množství trénovacích dat. Výsledky ukazují, že je možné dosáhnout srovnatelných, nebo i lepších výsledků, než poskytují nejlepší současné OCR systémy.	cs
dc.format	19 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	Springer	en
dc.relation.ispartofseries	Proceedings of the International Spring Seminar on Electronics Technology, ISSE 2020	en
dc.rights	© Springer	en
dc.subject	CNN	cs
dc.subject	FCN	cs
dc.subject	historické dokumenty	cs
dc.subject	LSTM	cs
dc.subject	neuronová síť	cs
dc.subject	OCR	cs
dc.subject	Porta fontium	cs
dc.subject	syntetická data	cs
dc.title	Building an efficient OCR system for historical documents with little training data	en
dc.title.alternative	Vytvoření efektivního OCR systému pro historické dokumenty s malým množstvím trénovacích dat	cs
dc.type	článek	cs
dc.type	article	en
dc.rights.access	openAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	As the number of digitized historical documents has increased rapidly it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. This paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our seg-mentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems.	en
dc.subject.translated	CNN	en
dc.subject.translated	FCN	en
dc.subject.translated	Historical documents	en
dc.subject.translated	LSTM	en
dc.subject.translated	Neural network	en
dc.subject.translated	OCR	en
dc.subject.translated	Porta fontium	en
dc.subject.translated	Synthetic data	en
dc.identifier.doi	10.1007/s00521-020-04910-x
dc.type.status	Peer-reviewed	en
dc.identifier.document-number	531222300001
dc.identifier.obd	43929970
Appears in Collections:	Články / Articles (NTIS) Články / Articles (KIV) OBD

Files in This Item:

File	Size	Format
Martínek2020_Article_BuildingAnEfficientOCRSystemFo.pdf	4,63 MB	Adobe PDF	View/Open

Show simple item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/42814

search

navigation