Language identification with scarce data: A case study from Peru

dc.contributor.affiliationPontificia Universidad Católica del Perú
dc.contributor.authorEspichán-Linares, A.
dc.contributor.authorOncevay, A.
dc.date.accessioned2026-03-13T16:58:35Z
dc.date.issued2018
dc.description.abstractLanguage identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
dc.description.sponsorshipFunding: Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.
dc.identifier.doihttps://doi.org/10.1007/978-3-319-90596-9_7
dc.identifier.urihttp://hdl.handle.net/20.500.14657/205966
dc.language.isoeng
dc.publisherSpringer Verlag
dc.relation.conferencenameCommunicatións in Computer and Information Science; Vol. 795 (2018)
dc.relation.ispartofurn:isbn:978-3-319-90596-9
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.subjectComputer science
dc.subjectIdentification (biology)
dc.subjectNatural language processing
dc.subjectIndigenous
dc.subjectTask (project management)
dc.subjectArtificial intelligence
dc.subjectScratch
dc.subjectIndigenous language
dc.subjectFace (sociological concept)
dc.subjectLinguistics
dc.subjectProgramming language
dc.subject.ocdehttps://purl.org/pe-repo/ocde/ford#1.02.01
dc.titleLanguage identification with scarce data: A case study from Peru
dc.typehttp://purl.org/coar/resource_type/c_5794
dc.type.otherComunicación de congreso
dc.type.versionhttps://vocabularies.coar-repositories.org/version_types/c_970fb48d4fbd8a85/

Files

Collections