Morphological Evaluation of Subword Vocabularies Used by Large Language Models

doi: https://doi.org/10.31810/rsel.54.1.4

Authors

Keywords:

tokenizing; morphemes; subwords; large language models; Spanish language

Abstract

Traditional tokenization methods using linguistic rules have been replaced by statistical segmentation algorithms. Although these algorithms show a higher efficiency and are capable of building subword vocabularies from large corpora without human supervision, these subwords do not consistently correspond to morphemes. This paper addresses this issue by proposing an evaluation methodology and applying it to the morphological quality of Spanish vocabularies produced by three prominent subword tokenization algorithms –BPE, WordPiece, and Unigram– commonly used in Large Language Models (LLMs). Three gold standards were created to measure relevance, coherence, and morphological accuracy of vocabularies of six tokenizers trained on Spanish corpus, exploring different vocabulary sizes. Evaluation results indicate that none of the three algorithms is suitable for accurately representing Spanish morphology.

Downloads

Download data is not yet available.

References

Bostrom, K., y Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. arXiv Preprint arXiv:2004. 03720.

Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., y Pérez, J. (2023). Spanish pre-trained bert model and evaluation data. arXiv Preprint arXiv:2308. 02976.

Church, K. W. (2020). Emerging trends: Subwords, seriously? Natural Language Engineering, 26(3), 375–382.

Devlin, J., Chang, M.-W., Lee, K., y Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/1810.04805

Fang, H., Ostendorf, M., Baumann, P., y Pierrehumbert, J. (2015). Exponential language modeling using morphological features and multi-task learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2410–2421.

Friedman, R. (2023). Tokenization in the Theory of Knowledge. Encyclopedia, 3(1), 380-386.

Hofmann, V., Pierrehumbert, J., y Schütze, H. (2021). Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 3594–3608.

Kudo, T., y Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv Preprint arXiv:1808. 06226.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., y Soricut, R. (2019). Albert: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/1907.11692

Moliner, M. (1967/2012). Diccionario de uso del español. Madrid: Gredos.

Park, K., Lee, J., Jang, S., y Jung, D. (2020). An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks. arXiv Preprint arXiv:2010. 02534.

RAE-ASALE (=Real Academia Española-Asociación de Academias de la Lengua Española) (2009). Nueva gramática de la lengua española. Madrid: Espasa.

RAE-ASALE (=Real Academia Española-Asociación de Academias de la Lengua Española) (en línea). Diccionario de la lengua española. https://dle.rae.es/

Radford, A., Narasimhan, K., Salimans, T., y Sutskever, I. (2018). Improving language understanding by generative pre-training.

https://paperswithcode.com/paper/improving-language-understanding-by

Sennrich, R., Haddow, B., y Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv Preprint arXiv:1508. 07909.

Schuster, M., y Nakajima, K. (2012). Japanese and korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149–5152. IEEE.

Song, X., Salcianu, A., Song, Y., Dopson, D., y Zhou, D. (2020). Fast wordpiece tokenization. arXiv preprint arXiv:2012.15524.

Suárez, P. J. O., Sagot, B., y Romary, L. (2019). Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.

Suárez, P. J. O., Romary, L., y Sagot, B. (2020). A monolingual approach to contextualized word embeddings for mid-resource languages. arXiv Preprint arXiv:2006. 06202.

Van der Wouden, T. (1990). Celex: Building a multifunctional polytheoretical lexical data base. Proceedings of BudaLex, 88, 363–373.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W. y Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv Preprint arXiv:1609. 08144.

Published

2024-07-15

How to Cite

García-Sierra, Óscar, Fernández-Pampillón Cesteros, A. ., & Ortega-Martín, M. (2024). Morphological Evaluation of Subword Vocabularies Used by Large Language Models: doi: https://doi.org/10.31810/rsel.54.1.4. Revista Española De Lingüística, 54(1), 103-130. Retrieved from http://revista.sel.edu.es/index.php/revista/article/view/2209

Issue

Section

Articles

Most read articles by the same author(s)