A Low-Resourced Peruvian Language Identification Model

Linares, A.E.; Oncevay, A.

A Low-Resourced Peruvian Language Identification Model

dc.contributor.affiliation	Grupo de Inteligencia Artificial de la Pontificia Universidad Católica del Perú (IA-PUCP)
dc.contributor.affiliation	Pontificia Universidad Católica del Perú. Departamento de Ingeniería
dc.contributor.author	Linares, A.E.
dc.contributor.author	Oncevay, A.
dc.date.accessioned	2026-03-13T17:00:59Z
dc.date.issued	2017
dc.description.abstract	Due to the linguistic revitalization in Peru´ through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as ngrams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex tasks in the future
dc.description.sponsorship	Funding: The authors are thankful to J. Rubén Ruiz, bilingual education professor at NOPOKI, for providing access to some private books written in native languages (Universidad Católica Sedes Sapientiae, 2015; Díaz, 2012). Likewise, it is appreciated the collaboration of Dr. Roberto Zariquiey, linguistic professor at PUCP, for allowing the use of his own corpus for the Panoan family (Zariquiey Biondi, 2011). Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.; Funding text 2: Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.
dc.identifier.uri	http://hdl.handle.net/20.500.14657/206826
dc.language.iso	eng
dc.publisher	CEUR-WS
dc.relation.conferencename	CEUR Workshop Proceedings
dc.relation.ispartof	urn:issn:1613-0073
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	Learning systems
dc.subject	Big data
dc.subject	Education
dc.subject	Information management
dc.subject	Automatic language identification
dc.subject	Bilingual education
dc.subject	Complex task
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#5.03.00
dc.title	A Low-Resourced Peruvian Language Identification Model
dc.type	http://purl.org/coar/resource_type/c_5794
dc.type.other	Comunicación de congreso
dc.type.version	https://vocabularies.coar-repositories.org/version_types/c_970fb48d4fbd8a85/

Files

Original bundle

Now showing 1 - 1 of 1

Name:: paper3.pdf
Size:: 635.08 KB
Format:: Adobe Portable Document Format
Description:: Texto completo

Download

Collections

Artículos (DFI)