Language identification with scarce data: A case study from Peru

Espichán-Linares, A.; Oncevay, A.

doi:https://doi.org/10.1007/978-3-319-90596-9_7

Language identification with scarce data: A case study from Peru

Date

2018

Authors

Espichán-Linares, A.

Oncevay, A.

Publisher

Springer Verlag

URI

http://hdl.handle.net/20.500.14657/205966

DOI

https://doi.org/10.1007/978-3-319-90596-9_7

Abstract

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.

Keywords

Computer science, Identification (biology), Natural language processing, Indigenous, Task (project management), Artificial intelligence, Scratch, Indigenous language, Face (sociological concept), Linguistics, Programming language

Collections

Artículos (DFI)

Full item page

Language identification with scarce data: A case study from Peru

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

URI

DOI

Acceso al texto completo solo para la Comunidad PUCP

Abstract

Description

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By