Corpus Development and Pos Tagging Evaluation for Riau Malay Dialect Using Hidden Markov Model in A Low-Resource Setting
Downloads
Natural Language Processing (NLP) plays a crucial role in enabling machines to process and understand human language. One of the fundamental tasks in NLP is Part-of-Speech (POS) tagging, which serves as the foundation for various downstream applications such as parsing, information extraction, and machine translation. However, the development of POS tagging models for low-resource languages remains a significant challenge due to the limited availability of annotated corpora. This study aims to develop a POS-tagged corpus for Bahasa Melayu Dialek Riau (BMDR) and evaluate the performance of a Hidden Markov Model (HMM) as a baseline approach for POS tagging. The dataset consists of approximately 600 sentences with around 10,000 tokens, which were manually annotated and validated using Inter-Annotator Agreement. The annotated corpus was then divided into training and testing sets with a ratio of 80:20. Experimental results show that the HMM model achieved an accuracy of 86.8%, with precision, recall, and F1-score values of 85.9%, 85.3%, and 85.6%, respectively. The results indicate that HMM remains a competitive approach for POS tagging in low-resource language settings. Error analysis reveals that lexical ambiguity, Out-of-Vocabulary (OOV) words, and limited training data are the primary factors affecting model performance. This research contributes by providing the first annotated POS corpus for BMDR, evaluating the effectiveness of HMM in a low-resource context, and offering insights into linguistic challenges in regional languages. Future work may explore larger datasets and advanced deep learning models to improve tagging performance.
Akbar Vetian, R., Fajrin, A. A., & Maslan, A. (2026). Penerapan Model BERT dalam Klasifikasi Otomatis Laporan Pemeliharaan Industri Menggunakan Natural Language Processing. In Jurnal Kecerdasan Buatan dan Data Science Industri: I (Number 1).
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. http://arxiv.org/abs/2005.14165
Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2020). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. Retrieved https://github.com/tensorflow/tensor2tensor
Dewi, N. P., & Ubaidi, U. (2018). Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 18(1), 65–72. https://doi.org/10.30812/matrik.v18i1.332
Gil, D. (2002). Riau Indonesian Sama: a Unified Analysis. NUSA, 50.
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. ArXiv Preprint ArXiv:1508.01991. http://arxiv.org/abs/1508.01991
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282–6293. https://microsoft.github.io/linguisticdiversity
Jurafsky, Dan., & Martin, J. H. . (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall.
Kurniawan, K., & Aji, A. F. (2018). Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging. Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, 303–307. https://doi.org/10.1109/IALP.2018.8629236
Ma, X., & Hovy, E. (2016). End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1, 1064–1074.
Manning, C., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT press.
Panchendrarajan, R., & Amaresan, A. (2018). Bidirectional LSTM-CRF for Named Entity Recognition. 32nd Pacific Asia Conference on Language, Information and Computation.
Porikli, F., Shan, S., Snoek, C., Sukthankar, R., & Wang, X. (2018). Deep Learning for Visual Understanding: Part 2 [From the Guest Editors]. In IEEE Signal Processing Magazine (Vol. 35, Number 1, pp. 17–19). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/MSP.2017.2766286
Rabiner, L. R. (2002). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 2(77), 257–286.
Sumoko, A., Negara, A. B. P., & Pratiwi, H. S. (2021). Perbandingan Tipe Metode PoS Tagger Terhadap Nilai Akurasi Untuk Bahasa Melayu Pontianak. Jurnal Sistem Dan Teknologi Informasi (Justin), 9(3), 342. https://doi.org/10.26418/justin.v9i3.44116
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL, 252–259.
Wicaksono, A. F., & Purwarianti, A. (2010). HMM Based Part-of-Speech Tagger for Bahasa Indonesia. On Proceedings of 4th International MALINDO (Malay and Indonesian Language) Workshop. http://students.itb.ac.id/home/alfan_fw@students.itb.ac.id/IPOSTAgger
Copyright (c) 2026 Rifky Akbar Vetian, Koko Handoko, Andi Maslan, Alfannisa Annurrullah Fajrin

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.










