Language identification with scarce data: A case study from Peru

Espichán-Linares, A.; Oncevay, A.

doi:https://doi.org/10.1007/978-3-319-90596-9_7

Language identification with scarce data: A case study from Peru

dc.contributor.affiliation	Pontificia Universidad Católica del Perú
dc.contributor.author	Espichán-Linares, A.
dc.contributor.author	Oncevay, A.
dc.date.accessioned	2026-03-13T16:58:35Z
dc.date.issued	2018
dc.description.abstract	Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
dc.description.sponsorship	Funding: Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.
dc.identifier.doi	https://doi.org/10.1007/978-3-319-90596-9_7
dc.identifier.uri	http://hdl.handle.net/20.500.14657/205966
dc.language.iso	eng
dc.publisher	Springer Verlag
dc.relation.conferencename	Communicatións in Computer and Information Science; Vol. 795 (2018)
dc.relation.ispartof	urn:isbn:978-3-319-90596-9
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	Computer science
dc.subject	Identification (biology)
dc.subject	Natural language processing
dc.subject	Indigenous
dc.subject	Task (project management)
dc.subject	Artificial intelligence
dc.subject	Scratch
dc.subject	Indigenous language
dc.subject	Face (sociological concept)
dc.subject	Linguistics
dc.subject	Programming language
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#1.02.01
dc.title	Language identification with scarce data: A case study from Peru
dc.type	http://purl.org/coar/resource_type/c_5794
dc.type.other	Comunicación de congreso
dc.type.version	https://vocabularies.coar-repositories.org/version_types/c_970fb48d4fbd8a85/

Collections

Artículos (DFI)

Language identification with scarce data: A case study from Peru

Files

Collections