Corpus Development and Pos Tagging Evaluation for Riau Malay Dialect Using Hidden Markov Model in A Low-Resource Setting

Rifky Akbar Vetian; Koko  Handoko; Andi  Maslan; Alfannisa Annurrullah  Fajrin

doi:10.59188/eduvest.v6i4.53066

Authors

Rifky Akbar Vetian
rifky.vetian@puterabatam.ac.id
Universitas Putera Batam, Indonesia
Koko Handoko Universitas Putera Batam, Indonesia
Andi Maslan Universitas Putera Batam, Indonesia
Alfannisa Annurrullah Fajrin Universitas Putera Batam, Indonesia

Vol. 6 No. 4 (2026): Eduvest - Journal of Universal Studies

Articles

April 22, 2026

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

Natural Language Processing (NLP) plays a crucial role in enabling machines to process and understand human language. One of the fundamental tasks in NLP is Part-of-Speech (POS) tagging, which serves as the foundation for various downstream applications such as parsing, information extraction, and machine translation. However, the development of POS tagging models for low-resource languages remains a significant challenge due to the limited availability of annotated corpora. This study aims to develop a POS-tagged corpus for Bahasa Melayu Dialek Riau (BMDR) and evaluate the performance of a Hidden Markov Model (HMM) as a baseline approach for POS tagging. The dataset consists of approximately 600 sentences with around 10,000 tokens, which were manually annotated and validated using Inter-Annotator Agreement. The annotated corpus was then divided into training and testing sets with a ratio of 80:20. Experimental results show that the HMM model achieved an accuracy of 86.8%, with precision, recall, and F1-score values of 85.9%, 85.3%, and 85.6%, respectively. The results indicate that HMM remains a competitive approach for POS tagging in low-resource language settings. Error analysis reveals that lexical ambiguity, Out-of-Vocabulary (OOV) words, and limited training data are the primary factors affecting model performance. This research contributes by providing the first annotated POS corpus for BMDR, evaluating the effectiveness of HMM in a low-resource context, and offering insights into linguistic challenges in regional languages. Future work may explore larger datasets and advanced deep learning models to improve tagging performance.

Vetian, R. A., Handoko, K. ., Maslan, A. ., & Fajrin, A. A. . (2026). Corpus Development and Pos Tagging Evaluation for Riau Malay Dialect Using Hidden Markov Model in A Low-Resource Setting. Eduvest - Journal of Universal Studies, 6(4), 4817–4825. https://doi.org/10.59188/eduvest.v6i4.53066

Download Citation

Akbar Vetian, R., Fajrin, A. A., & Maslan, A. (2026). Penerapan Model BERT dalam Klasifikasi Otomatis Laporan Pemeliharaan Industri Menggunakan Natural Language Processing. In Jurnal Kecerdasan Buatan dan Data Science Industri: I (Number 1).

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. http://arxiv.org/abs/2005.14165

Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2020). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. Retrieved https://github.com/tensorflow/tensor2tensor

Dewi, N. P., & Ubaidi, U. (2018). Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura. MATRIK : Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 18(1), 65–72. https://doi.org/10.30812/matrik.v18i1.332

Gil, D. (2002). Riau Indonesian Sama: a Unified Analysis. NUSA, 50.

Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. ArXiv Preprint ArXiv:1508.01991. http://arxiv.org/abs/1508.01991

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6282–6293. https://microsoft.github.io/linguisticdiversity

Jurafsky, Dan., & Martin, J. H. . (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall.

Kurniawan, K., & Aji, A. F. (2018). Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging. Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, 303–307. https://doi.org/10.1109/IALP.2018.8629236

Ma, X., & Hovy, E. (2016). End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1, 1064–1074.

Manning, C., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT press.

Panchendrarajan, R., & Amaresan, A. (2018). Bidirectional LSTM-CRF for Named Entity Recognition. 32nd Pacific Asia Conference on Language, Information and Computation.

Porikli, F., Shan, S., Snoek, C., Sukthankar, R., & Wang, X. (2018). Deep Learning for Visual Understanding: Part 2 [From the Guest Editors]. In IEEE Signal Processing Magazine (Vol. 35, Number 1, pp. 17–19). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/MSP.2017.2766286

Rabiner, L. R. (2002). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 2(77), 257–286.

Sumoko, A., Negara, A. B. P., & Pratiwi, H. S. (2021). Perbandingan Tipe Metode PoS Tagger Terhadap Nilai Akurasi Untuk Bahasa Melayu Pontianak. Jurnal Sistem Dan Teknologi Informasi (Justin), 9(3), 342. https://doi.org/10.26418/justin.v9i3.44116

Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL, 252–259.

Wicaksono, A. F., & Purwarianti, A. (2010). HMM Based Part-of-Speech Tagger for Bahasa Indonesia. On Proceedings of 4th International MALINDO (Malay and Indonesian Language) Workshop. http://students.itb.ac.id/home/alfan_fw@students.itb.ac.id/IPOSTAgger

Corpus Development and Pos Tagging Evaluation for Riau Malay Dialect Using Hidden Markov Model in A Low-Resource Setting

Authors

Downloads

Most read articles by the same author(s)

Login

Current Issue

Information