Corpus Development and Pos Tagging Evaluation for Riau Malay Dialect Using Hidden Markov Model in A Low-Resource Setting

Part-of-Speech Tagging Hidden Markov Model Natural Language Processing Low-Resource Language Bahasa Melayu Dialek Riau Corpus Annotation

Authors

April 22, 2026

Downloads

Natural Language Processing (NLP) plays a crucial role in enabling machines to process and understand human language. One of the fundamental tasks in NLP is Part-of-Speech (POS) tagging, which serves as the foundation for various downstream applications such as parsing, information extraction, and machine translation. However, the development of POS tagging models for low-resource languages remains a significant challenge due to the limited availability of annotated corpora. This study aims to develop a POS-tagged corpus for Bahasa Melayu Dialek Riau (BMDR) and evaluate the performance of a Hidden Markov Model (HMM) as a baseline approach for POS tagging. The dataset consists of approximately 600 sentences with around 10,000 tokens, which were manually annotated and validated using Inter-Annotator Agreement. The annotated corpus was then divided into training and testing sets with a ratio of 80:20. Experimental results show that the HMM model achieved an accuracy of 86.8%, with precision, recall, and F1-score values of 85.9%, 85.3%, and 85.6%, respectively. The results indicate that HMM remains a competitive approach for POS tagging in low-resource language settings. Error analysis reveals that lexical ambiguity, Out-of-Vocabulary (OOV) words, and limited training data are the primary factors affecting model performance. This research contributes by providing the first annotated POS corpus for BMDR, evaluating the effectiveness of HMM in a low-resource context, and offering insights into linguistic challenges in regional languages. Future work may explore larger datasets and advanced deep learning models to improve tagging performance.