No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

dc.contributor.affiliationPontificia Universidad Católica del Perú
dc.contributor.affiliationPontificia Universidad Católica del Perú. Departamento de Humanidades
dc.contributor.authorBustamante, G.
dc.contributor.authorOncevay, A.
dc.contributor.authorZariquiey, R.
dc.date.accessioned2026-03-13T17:01:05Z
dc.date.issued2020
dc.description.abstractWe introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
dc.description.sponsorshipFunding: We are grateful to the computational linguistic team at PUCP: John Miller, Erasmo Gómez, Kervy Rivas, Gema Silva, Gildo Valero, Jaime Montoya and Gonzalo Acosta. Similarly, we thank the bilingual teachers from the UCSS who provided their own crafted material for the evaluation, and more specifically to Juan Rubén Ruiz for his support. Besides, we appreciate the comments of Fernando Alva-Manchego on a draft version and the feedback of our anonymous reviewers. Finally, we acknowledge the research grant of the “Con-sejo Nacional de Ciencia, Tecnología e Innovación Tec-nológica” (CONCYTEC, Peru) under the contract 183-2018-FONDECYT-BM-IADT-MU, and the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for the study.
dc.identifier.urihttp://hdl.handle.net/20.500.14657/206833
dc.language.isoeng
dc.publisherEuropean Language Resources Association (ELRA)
dc.relation.conferencenameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.relation.urihttps://aclanthology.org/2020.lrec-1.356/
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.subjectYine
dc.subjectAshaninka
dc.subjectCorpus creation
dc.subjectEndangered languages
dc.subjectIndigenous languages
dc.subjectLow-resource languages
dc.subjectMonolingual corpus
dc.subjectPdf processing
dc.subjectShipibo-Konibo
dc.subjectYanesha
dc.subject.ocdehttps://purl.org/pe-repo/ocde/ford#6.02.02
dc.titleNo data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
dc.typehttp://purl.org/coar/resource_type/c_5794
dc.type.otherComunicación de congreso
dc.type.versionhttps://vocabularies.coar-repositories.org/version_types/c_970fb48d4fbd8a85/

Files

Collections