TY - JOUR
T1 - Extracting text from scanned Arabic books
T2 - a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
AU - Elanwar, Randa
AU - Qin, Wenda
AU - Betke, Margrit
AU - Wijaya, Derry
N1 - Funding Information:
The authors would like to thank the library staff at the Mugar Library at Boston University, the Rotch Library at MIT, and the Widener and Fine Arts libraries at Harvard University for facilitating the collection process of our dataset BE-Arabic-9K. The authors thank the National Science Foundation (1838193) and the Hariri Institute for Computing at Boston University for partial support of this work.
Funding Information:
The authors would like to thank the library staff at the Mugar Library at Boston University, the Rotch Library at MIT, and the Widener and Fine Arts libraries at Harvard University for facilitating the collection process of our dataset BE-Arabic-9K. The authors thank the National Science Foundation (1838193) and the Hariri Institute for Computing at Boston University for partial support of this work.
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2021/12
Y1 - 2021/12
N2 - Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.
AB - Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.
UR - http://www.scopus.com/inward/record.url?scp=85117632510&partnerID=8YFLogxK
U2 - 10.1007/s10032-021-00382-4
DO - 10.1007/s10032-021-00382-4
M3 - Article
AN - SCOPUS:85117632510
VL - 24
SP - 349
EP - 362
JO - International Journal on Document Analysis and Recognition
JF - International Journal on Document Analysis and Recognition
SN - 1433-2833
IS - 4
ER -