Eduvest – Journal of Universal Studies Volume 4 Number 09, September, 2024 p- ISSN 2775-3735- e-ISSN 2775-3727

IMPLEMENTATION OF XGBOOST ALGORITHM TO PREDICT THE SELLING PRICE OF CAYENNE PEPPERS IN DKI JAKARTA
Dhafin Riando ¹, Afiyati Afiyati ² ^1,2Universitas Mercu Buana, Indonesia, Indonesia Email: afiyati.reno@mercubuana.ac.id
ABSTRACT
This research focuses on applying the XGBoost algorithm to analyze and predict cayenne pepper prices. The main aim is to exploit XGBoost's exceptional capability to manage large datasets and discern intricate patterns for precise price forecasting. The dataset comprises historical cayenne pepper price data, along with pertinent economic and climatic factors. The XGBoost model was developed and validated on this dataset, with its performance assessed using metrics. The results indicated a high level of accuracy, achieving an R² score of 99% on the training set and 92% on the test set, reflecting a strong alignment between predicted and actual prices. Moreover, the model attained an average cross-validation score of 96%, reinforcing its robustness and reliability. These findings highlight XGBoost's efficacy in agricultural price prediction, offering stakeholders a potent tool for data-driven decision-making. This study enriches the literature on machine learning applications in agriculture and emphasizes XGBoost's potential to enhance predictive accuracy and operational efficiency.
KEYWORDS	XGBoost Algorithm, Price Prediction, Cayenne Pepper, Agricultural Markets, Predictive Analytics.
	*This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International*

INTRODUCTION

The price of cayenne pepper in Indonesia often experiences significant fluctuations, especially during the rainy season or when distribution disruptions occur. This instability is influenced by various factors such as unpredictable weather conditions, pest attacks, and complex logistical challenges, especially in remote areas with inadequate infrastructure. This price volatility impacts not only consumers but also farmers and agricultural businesses. Price uncertainty can cause substantial financial losses, reduce farmers' welfare, and disrupt the overall economic stability of the agricultural sector. Therefore, an efficient method is needed to analyze and predict the price of cayenne pepper to enable better and timely decision-making for all parties involved. (Yuditya et al., 2023).

Machine learning algorithms, particularly XGBoost (Extreme Gradient Boosting), have proven to be very useful in various fields to handle large and complex data and provide accurate predictions. XGBoost excels due to its ability to handle datasets with many features and complex interactions between variables. Its speed and ability to manage missing values and imbalanced data make it superior to traditional methods. In a study by Sharma et al. (2022), XGBoost was used to develop an optimized diagnostic system for predicting heart disease. (Budholiya et al., 2022).. The results showed that XGBoost provided higher prediction accuracy than conventional methods, allowing the identification of complex risk factors and interactions between various variables (Asikin et al., 2024). This application of XGBoost demonstrates its ability to handle complex datasets and provide informed decisions in critical contexts.

Additionally, in the study by Ding et al. (2021), XGBoost was applied to predict house prices using highly diverse and complex data. The model successfully identified key variables that influence house prices, such as location, size, and neighborhood amenities, and provided highly accurate price predictions. (Et. al., 2021). The success of XGBoost in capturing important correlations between these factors and property prices highlights its potential in price prediction applications. Another study by Ntakaris et al. (2023) illustrates how XGBoost can be used to improve the prediction of students' academic performance by incorporating various factors such as attendance records, test scores, and extracurricular activities. The model provides deep insights into the relative contribution of each factor to academic performance, demonstrating the versatility of this algorithm across different areas of data science (Asselman et al., 2023).

In the context of agriculture, previous studies have demonstrated the effectiveness of XGBoost in predicting agricultural commodity prices. Lakshmi et al. (2020) used XGBoost to forecast the prices of crops such as vegetables and fruits. Their study found that XGBoost can capture complex patterns in price data and produce more accurate predictions than traditional predictive models. (Bayona-Oré et al., 2021).. Similarly, Kamble et al. (2021) applied XGBoost to predict prices of various agricultural commodities and found that the model can effectively use weather and distribution data to forecast commodity prices. In the specific context of Indonesia, the price of cayenne pepper is highly influenced by local factors such as high climate variability and gaps in transportation infrastructure, which often cause distribution delays from production areas to markets (Tran et al., 2023).. Given the success of XGBoost in various applications, this study aims to apply the XGBoost algorithm to analyze and predict the price of cayenne pepper in Indonesia.

Price information on cayenne pepper will be collected from the National Food Agency (BPN). This data often needs to be extensively processed to correct missing figures or irregularities. In addition, other factors that affect prices such as weather conditions, government policies, or global economic events may not be recorded. We will also consider logistical aspects such as travel time and transportation availability. Using this data, an accurate prediction model will be built that projects prices and captures actual market conditions. (Price et al., 2023). By utilizing XGBoost, it is expected that important patterns and variables that affect the price of cayenne pepper can be better identified. The main focus of this research is to develop an XGBoost-based price prediction model for cayenne pepper that provides more accurate forecasts and deep insights into all variables affecting its price. The model will be tested using cross-validation techniques and historical data to ensure its predictive accuracy. It is expected that this research will produce a more accurate and reliable price prediction model for cayenne pepper compared to conventional methods. (Ananda et al., 2022)..

With this model, it is expected to contribute to increased economic stability, reduced financial risk, and improved welfare for both consumers and farmers. Farmers can use these predictions to plan optimal harvest times and choose markets that offer the best prices, while consumers and traders can better plan their purchases and distribution. (Windhy & Jamil, 2021). In addition, this research is expected to make a significant contribution to the scientific literature in agriculture and data science and open up opportunities for further development in the use of machine learning for other agricultural products. (Nasution et al., 2021).

RESEARCH METHOD

Referring to Figure 1, the sequence of the research process is as follows: Data Collection involves collecting daily prices of red chili peppers in DKI Jakarta from 2021 to 2024. Data Pre-processing includes data cleaning and preparation by handling missing values and outliers. Data Sharing divides the data into training set and test set. Feature Scaling standardizes the features. Modeling uses XGBoost Regressor with optimized hyperparameters. Model Evaluation assesses performance with R2 scores and cross-validation. Results Visualization displays trends, seasonal analysis, correlation matrix, residuals, and future price predictions.

Figure 1: Research process

1. Data Collection

The 1247 datasets used include information on the price of red chili peppers in DKI Jakarta Province from 2021 to 2024. This dataset was downloaded in Excel format and accessed using the pandas library with the openpyxl engine. The dataset includes the daily price of red chili in various areas of Jakarta, such as South Jakarta, East Jakarta, Central Jakarta, West Jakarta, North Jakarta, and Kepulauan Seribu Regency. In addition, this dataset has a 'Date' column, which is important for extracting temporal features.

2. Data Preprocessing

Data pre-processing is a crucial step to ensure data integrity and quality before analysis or modeling. The pre-processing steps in this study include:

1. Replacing Infinitive Values: Infinite or infinitive values are replaced with NaN (Not a Number) to ensure the data can be processed further.

2. Filling Missing Values: Missing values are filled using forward-fill and backward-fill methods to maintain data continuity.

3. Deleting Zero Values: Rows with zero prices in various columns are deleted to avoid distortions in the analysis.

4. Resetting the Index: After deleting rows, the index is reset to maintain data consistency.

5. Date Conversion: The 'Date' column is converted to DateTime format for easy extraction of temporal features.

6. Outliers Removal: Outliers in the price column in DKI Jakarta were removed using the Interquartile Range (IQR) method to get a more representative dataset.

7. Feature Additions: New features such as month, year, day of the week, quarter, and whether it is the beginning or end of the month were added. A lag feature (previous day's price) was also created to capture complex temporal patterns in the data.

3. Data Splitting

To effectively evaluate the model, the dataset is divided into two parts: training set and test set. In this study, the data was split with 20% for the training set and 80% for the test set using sklearn's train_test_split function. This split allows the model to be trained on a fraction of the data and tested on data that has not been seen before, giving a better indication of the model's performance on unseen data.

4. Feature Scaling

Feature scaling converts data features to a uniform scale. In this study, StandardScaler from sklearn is used to standardize the features so that they have a mean of 0 and a standard deviation of 1. This scaling is important for algorithms like XGBoost, which are sensitive to feature scaling. The scale is applied separately to the training data and test data to prevent information leakage from the test set to the training set.

5. Modeling

The modeling process involves the use of the XGBoost Regressor algorithm, which is known for its ability to handle complex data. The hyperparameters of the model were optimized using RandomizedSearchCV to find the best combination of parameters from a predefined range. The optimized parameters include the number of estimators, maximum depth of the tree, learning rate, subsampling ratio, and colsample_bytree. Once the best parameter set is found, the XGBoost model is trained using the scaled training data.

6. Model Evaluation

Model performance is evaluated using several metrics. An R2 (R-squared) score is calculated to assess how well the model captures the variance in the data. R2 scores were calculated for both training and test data, and the average was also reported. In addition, cross-validation was performed using cross_val_score to obtain a more accurate and reliable assessment of the model's performance. Further evaluation includes residual analysis to examine the distribution of prediction errors and identify potential model bias.

7. Result Visualization

Visualization of results is critical to understanding and communicating model performance. Key visualizations created include:

1. Price Trend Chart: This visualization shows the historical price trend of red chili in DKI Jakarta, providing insight into seasonal trends and price fluctuations.

2. Seasonal Analysis: Seasonal patterns by month and day of the week were analyzed to identify seasonal effects on red chili prices.

3. Correlation Matrix: The correlation heatmap shows the relationship between features, helping to understand which features affect the price of red chili peppers the most.

4. Residual Plot: A scatter plot of residuals (the difference between predicted and actual prices) helps identify systematic errors in model predictions.

5. Future Price Predictions: Red chili price predictions for the following year are visualized along with confidence intervals to show the uncertainty in the predictions.

RESULT AND DISCUSSION

Results

Due to several geopolitical considerations, the Special Capital Region (DKI) of Jakarta serves as Indonesia's main center of trade and population. The first reason is that price fluctuations here have a major impact on the local and national economy as the region has the highest consumption rate of red cayenne pepper in Indonesia [11]. Thanks to the regular efforts of the Jakarta National Food Agency in collecting data, historical information regarding the price of red bird's eye chili in Jakarta is also more complete and stable. In addition, the presence of various types of markets, both traditional and modern, in Jakarta facilitates the collection of complete and diverse data [12]. Data collection procedures and conducting research are easier in Jakarta due to its excellent infrastructure and high accessibility. Price analysis is particularly important for national policy formulation in Jakarta, because as the nation's capital, its trade and economic policies often serve as national standards. In conclusion, the complicated red cayenne pepper market in Jakarta, which is characterized by sudden price changes, offers an interesting challenge to evaluate the effectiveness of the XGBoost algorithm in a dynamic setting [13].

Table 1. Statistical Analysis

Statistical Analysis	Results
R-squared (R2) Score Train	0.99
R-squared (R2) Score Test	0.92
Cross-Validation Score	0.96

Table 1 displays the model evaluation results. These results show that both the training data and the new data can be predicted by the model with remarkable accuracy. With an R2 (coefficient of determination) score of 0.99 on the training dataset, the model is almost able to explain 99% of the variability in the training dataset, indicating an almost perfect predictive capacity. With an R2 score of 0.92 on the test data, the model can explain 92% of the variability, indicating strong generalizability and the ability to produce reliable predictions on new data that has never been seen before. Additionally, the model performed well and consistently across different subsets of the training data, as seen from the cross-validation score of 96%. This cross-validation technique ensures that the model performs well on all training data.

Figure 1: Average Red Cayenne Pepper Prices in DKI Jakarta by Month and by Day of the Week

The upward trend in the average price of red cayenne pepper in Jakarta between August and December is shown in the figure above. The graph shows that prices peaked in November when the average price per kilogram reached 75,000 Indonesian rupiah. Prices started to rise in September. As a result, prices experienced a minor decline in December. The dry season, short supply, and unfavorable weather patterns may be some of the causes of this price spike. Rising prices of red bird's eye chilies, an important commodity for Indonesians, can affect inflation and people's purchasing power. Therefore, to maintain the stability of red chili price, production should be increased, distribution should be improved, and market intervention is needed.

The daily price fluctuation pattern of red cayenne pepper in DKI Jakarta is depicted in the figure above. On Mondays and Sundays, prices are usually highest, while on Wednesdays and Thursdays, prices are lowest. Monday is the most expensive day with prices reaching Rp60,200 per kg. However, on Thursday, the price gradually drops to IDR 59,500 per kg. Weather, religious holidays, supply and demand, and other factors affect these price fluctuations. Mondays and Sundays experience price spikes due to high demand at the beginning and end of the week and possibly limited availability. The supply of red cayenne pepper is more consistent and demand tends to decrease in the middle of the week.

Correlation Matrix

Based on Figure 2, a correlation matrix shows the relationship between the price of red chili in various locations in DKI Jakarta. The correlation matrix displays the correlation coefficient between the price of red chili in a particular region and the price of red chili in another region. The correlation coefficient has a value from -1 to 1. When the correlation is positive, it shows that when the price of red chili increases in one place, the price of red chili also increases in another place. A negative number indicates a negative correlation, which is a phenomenon where the price of red chili falls in one region when it increases in another. There is no correlation between the price of red chili in the two regions when the correlation value is 0.

Correlation Matrix Interpretation:

The correlation coefficient between Central Jakarta and Kepulauan Seribu has a strong positive value of 0.92. This shows that the price of red chili in Central Jakarta and the Thousand Islands are strongly positively correlated. This implies that the price in the Thousand Islands will increase along with the increase in the price of red chili in Central Jakarta.
The correlation coefficient between Central Jakarta and South Jakarta is a strong positive value of 0.87. This indicates that the prices of red chili peppers in Central Jakarta and South Jakarta are strongly positively correlated. This implies that the price in South Jakarta will increase along with the increase in the price of red chili in Central Jakarta.
The correlation coefficient between Central Jakarta and East Jakarta is a strong positive value of 0.84. This shows that the prices of red chili peppers in Central Jakarta and East Jakarta are strongly positively correlated. This implies that the price in East Jakarta will increase along with the increase in the price of red chili in Central Jakarta.
The correlation coefficient between Central Jakarta and North Jakarta is a strong positive value of 0.81. This indicates that the prices of red chili peppers in Central Jakarta and North Jakarta are highly positively correlated. This implies that the price in North Jakarta will increase along with the price of red chili in Central Jakarta.
The correlation coefficient between South Jakarta and East Jakarta is a strong positive value of 0.83. This shows that the price of red chili in East Jakarta and South Jakarta are strongly positively correlated. This implies that the price in East Jakarta will increase along with the price of red chili in South Jakarta.
The correlation coefficient between South Jakarta and North Jakarta is a strong positive value of 0.79. This shows that the price of red chili in North Jakarta and South Jakarta are strongly positively correlated. This implies that the price in North Jakarta will increase along with the price of red chili in South Jakarta.
The correlation coefficient between East Jakarta and North Jakarta is a strong positive value of 0.76. This shows that there is a significant positive relationship between the price of red chili in East Jakarta and North Jakarta. This implies that the price in North Jakarta will increase along with the increase in the price of red chili in East Jakarta.

Figure 3. Learning Curves and Residual Plot

Learning curves are shown in the figure above to provide important insights into the learning process of machine learning models. At first, the error rapidly decreases, indicating that the model quickly understands the basics of the data. But thereafter, the rate of decline slows down, indicating that the model is having difficulty in discovering new patterns and is approaching its capacity limit. The error eventually stabilizes, indicating that the model has made the best use of the available data. The shape of this curve depends on variables such as data quality, model complexity, and model design. By understanding this learning curve, we can estimate model performance, determine the optimal amount of training data, and compare different models.

Residual Plot of training and testing data. The difference between the anticipated and observed values by the model is called the residual. The performance of the model on the training and testing data can be seen more clearly with the help of this figure. The model successfully predicted the values on the training data, as seen from the blue line in the plot, which represents the residuals of the training data. These residuals are mostly clustered around zero. The green line also represents the residuals from the testing data. A larger number of residuals indicates a less accurate prediction of the values on the test data by the model. This graph shows that overall, the model performs well on both training and testing data; however, it is more accurate on testing data. This can be explained because the testing data is data that the model has not previously seen, while the training data is used to train the model.

Figure 4: Predicted price of red bird's eye chilies in DKI Jakarta in the next 1 year

Figure 4 shows the forecasted cost of red cayenne pepper in the Jakarta Special Capital Region. The price of red cayenne pepper increases from July 2021 to January 2024, as seen from the blue line that rises steadily in the historical data. In addition, the rising orange line indicates that the price trend will continue to rise for the coming year, starting from July 2024 to January 2026. However, keep in mind that these predictions are only estimates, and may not always come true. It is possible that the actual price of red cayenne pepper will differ from the forecast. This uncertainty is shown by the gray confidence interval area surrounding the projected data line. The price of red cayenne pepper may increase due to various factors, including supply and demand in the market as well as climatic conditions. As a result, the actual price of red cayenne pepper may experience significant short variations.

Discussion

This research resulted in an in-depth understanding of the price fluctuations of red cayenne pepper in the Jakarta Special Capital Region from 2021 to 2024. Statistical analysis shows that the applied XGBoost model has excellent performance with R-squared values reaching 0.99 on training data and 0.92 on testing data. This indicates that the model is able to explain most of the variability in the price of red cayenne pepper, both in the data used for training and new data that has never been seen before. In addition, the learning curve showed that the model quickly understood the basic pattern of the data but had difficulty finding new patterns, indicating limitations in the available data.

The findings have significant practical implications for economic and trade policy in Indonesia. The price stability of red cayenne pepper in Jakarta is crucial given its impact on inflation and people's purchasing power. By understanding price fluctuations that are influenced by factors such as seasonality, supply availability, and weather patterns, the government can design a comprehensive strategy to maintain price stability. (M'hamdi et al., 2024). This includes increasing production, managing market operations, and diversifying promotions, which are important to protect the interests of traders, farmers, and consumers. Thus, these findings not only provide deep insights into the dynamics of the red cayenne pepper market but also suggest concrete measures to improve food security and the national economy. (Komaria et al., 2023).

This research is supported by comprehensive and diverse data collected by the Jakarta National Food Agency, as well as a careful analysis of the price of red cayenne pepper in various areas of DKI Jakarta. (Deviyanto & Aji, 2023).. Relevant references include previous research on commodity price fluctuations and machine learning techniques, which strengthen the methodology and interpretation of results. In addition, the analysis of correlations between prices in different locations of DKI Jakarta adds interpretative value, enabling a better understanding of price interdependencies within a given geographical area. (Kusumiyati et al., 2021)..

As a step towards updating knowledge in this area, this study presents new findings in the analysis of red chili pepper prices in Jakarta. While there have been previous studies on red chili price fluctuations, this approach integrates machine learning techniques with a broader and more detailed dataset, resulting in more accurate price predictions and more focused policy recommendations. As such, this study fills a gap in the literature by expanding the understanding of the factors affecting red chili pepper prices and their strategic implications for economic decision-making in Indonesia. (Nababan et al., 2023).

CONCLUSION

This study reveals that the price fluctuations of red cayenne pepper in DKI Jakarta from 2021 to 2024 are influenced by factors such as seasonality, supply availability, and demand. Statistical analysis shows that the XGBoost model provides excellent performance, with R-squared values reaching 99% on the training data and 92% on the test data, demonstrating the model's ability to explain price variability effectively. The findings have significant implications for economic and trade policy in Indonesia, with red cayenne pepper price stability being key in managing inflation and people's purchasing power. Suggested strategies include increasing production, managing market operations, and promoting diversification to protect national economic interests. As such, this study not only provides in-depth insights into local market dynamics, but also provides a framework for results-oriented policy actions, which are essential for improving food security and national economic stability in the future.

REFERENCES

Ananda, D., Pertiwi, A., & Muslim, M. A. (2022). Prediksi Rating Aplikasi Playstore Menggunakan Xgboost Prediksi Rating Aplikasi Playstore Menggunakan Xgboost. ResearchGate, October 2020, 6.

Asikin, M. Z., Fadilah, M. O., Saputro, W. E., Aditia, O., & Ridzki, M. M. (2024). The Influence Of Digital Marketing On Competitive Advantage And Performance of Micro, Small And Medium Enterprises. International Journal of Social Service and Research, 4(03), 963–970.

Asselman, A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interactive Learning Environments, 31(6), 3360–3379. https://doi.org/10.1080/10494820.2021.1928235

Bayona-Oré, S., Cerna, R., & Hinojoza, E. T. (2021). Machine learning for price prediction for agricultural products. WSEAS Transactions on Business and Economics, 18, 969–977. https://doi.org/10.37394/23207.2021.18.92

Budholiya, K., Shrivastava, S. K., & Sharma, V. (2022). An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University - Computer and Information Sciences, 34(7), 4514–4523. https://doi.org/10.1016/j.jksuci.2020.10.013

Deviyanto, A., & Aji, J. M. M. (2023). Fluktuasi Harga dan Efisiensi Pemasaran Cabai Rawit di Desa Sepanjang Kecamatan Glenmore Kabupaten Banyuwangi. Jurnal Pertanian Agros, 25(1), 529–537.

Et. al., J. A. (2021). Prediction of House Price Using XGBoost Regression Algorithm. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(2), 2151–2155. https://doi.org/10.17762/turcomat.v12i2.1870

Harga, P., Rawit, C., Di Kota, H., Menggunakan, J., Markov, R., Budiarti, D. I., Kholijah, G., Yurinanda, S., & Mardhotillah, B. (2023). Price Prediction of Green Cayenne Pepper in Kota Jambi Using Markov Chain. 2(1), 2023.

Komaria, V., Maidah, N. El, & Furqon, M. A. (2023). Prediksi Harga Cabai Rawit di Provinsi Jawa Timur Menggunakan Metode Fuzzy Time Series Model Lee. Komputika : Jurnal Sistem Komputer, 12(2), 37–47. https://doi.org/10.34010/komputika.v12i2.10644

Kusumiyati, K., Putri, I. E., & Munawar, A. A. (2021). Model Prediksi Kadar Air Buah Cabai Rawit Domba (Capsicum frutescens L.) Menggunakan Spektroskopi Ultraviolet Visible Near Infrared. Agro Bali: Agricultural Journal, 4(1), 15–22. https://doi.org/10.37637/ab.v0i0.615

M’hamdi, O., Takács, S., Palotás, G., Ilahy, R., Helyes, L., & Pék, Z. (2024). A Comparative Analysis of XGBoost and Neural Network Models for Predicting Some Tomato Fruit Quality Traits from Environmental and Meteorological Data. Plants, 13(5). https://doi.org/10.3390/plants13050746

Nababan, A. A., Jannah, M., Aulina, M., & Andrian, D. (2023). Prediksi Kualitas Udara Menggunakan Xgboost Dengan Synthetic Minority Oversampling Technique (Smote) Berdasarkan Indeks Standar Pencemaran Udara (Ispu). JTIK (Jurnal Teknik Informatika Kaputama), 7(1), 214–219. https://doi.org/10.59697/jtik.v7i1.66

Nasution, M. K., Saedudin, R. R., & Widartha, V. P. (2021). Perbandingan Akurasi Algoritma Naïve Bayes Dan Algoritma Xgboost Pada Klasifikasi Penyakit Diabetes. E-Proceeding of Engineering, 8(5), 9765–9772.

Tran, N.-Q., Nguyen Ngoc, T., Tran, Q., Felipe, A., Huynh, T., Tang, A., & Nguyen, T. (2023). Predicting Agricultural Commodities Prices with Machine Learning: A Review of Current Research.

Windhy, A. M., & Jamil, A. S. (2021). Peramalan Harga Cabai Merah Indonesia : Pendekatan ARIMA. Jurnal Agriekstensia, 20(1), 78–87.

Yuditya, A., Hardjanto, A., & Sehabudin, U. (2023). Fluktuasi Harga dan Integrasi Pasar Cabai Merah Besar (Studi Kasus: Pasar Induk kramat Jati dan Pasar Eceran di DKI Jakarta). Indonesian Journal of Agriculture Resource and Environmental Economics, 2(1), 1–13. https://doi.org/10.29244/ijaree.v2i1.50669