← 返回大厅
arXiv (CS.CV) 2026-06-16 12:00 DOI: arXiv:2606.15987

A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

摘要 / Abstract

In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。