No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Bustamante, G.; Oncevay, A.; Zariquiey, R.

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

dc.contributor.affiliation	Pontificia Universidad Católica del Perú
dc.contributor.affiliation	Pontificia Universidad Católica del Perú. Departamento de Humanidades
dc.contributor.author	Bustamante, G.
dc.contributor.author	Oncevay, A.
dc.contributor.author	Zariquiey, R.
dc.date.accessioned	2026-03-13T17:01:05Z
dc.date.issued	2020
dc.description.abstract	We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
dc.description.sponsorship	Funding: We are grateful to the computational linguistic team at PUCP: John Miller, Erasmo Gómez, Kervy Rivas, Gema Silva, Gildo Valero, Jaime Montoya and Gonzalo Acosta. Similarly, we thank the bilingual teachers from the UCSS who provided their own crafted material for the evaluation, and more specifically to Juan Rubén Ruiz for his support. Besides, we appreciate the comments of Fernando Alva-Manchego on a draft version and the feedback of our anonymous reviewers. Finally, we acknowledge the research grant of the “Con-sejo Nacional de Ciencia, Tecnología e Innovación Tec-nológica” (CONCYTEC, Peru) under the contract 183-2018-FONDECYT-BM-IADT-MU, and the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for the study.
dc.identifier.uri	http://hdl.handle.net/20.500.14657/206833
dc.language.iso	eng
dc.publisher	European Language Resources Association (ELRA)
dc.relation.conferencename	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.relation.uri	https://aclanthology.org/2020.lrec-1.356/
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	Yine
dc.subject	Ashaninka
dc.subject	Corpus creation
dc.subject	Endangered languages
dc.subject	Indigenous languages
dc.subject	Low-resource languages
dc.subject	Monolingual corpus
dc.subject	Pdf processing
dc.subject	Shipibo-Konibo
dc.subject	Yanesha
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#6.02.02
dc.title	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
dc.type	http://purl.org/coar/resource_type/c_5794
dc.type.other	Comunicación de congreso
dc.type.version	https://vocabularies.coar-repositories.org/version_types/c_970fb48d4fbd8a85/

Collections

Artículos (DFI)

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Files

Collections