Eduvest � Journal of
Universal Studies Volume 4 Number 09, September, 2024 p- ISSN
2775-3735- e-ISSN 2775-3727 |
|
|
|
IMPLEMENTATION
OF XGBOOST ALGORITHM TO PREDICT THE SELLING PRICE OF CAYENNE PEPPERS IN DKI
JAKARTA |
|
Dhafin
Riando 1, Afiyati Afiyati 2 1,2 Universitas
Mercu Buana, Indonesia, Indonesia Email: [email protected] |
|
ABSTRACT |
|
This
research focuses on applying the XGBoost algorithm to analyze and predict cayenne
pepper prices. The main aim is to exploit XGBoost's exceptional capability to
manage large datasets and discern intricate patterns for precise price
forecasting. The dataset comprises historical cayenne pepper price data,
along with pertinent economic and climatic factors. The XGBoost model was
developed and validated on this dataset, with its performance assessed using
metrics. The results indicated a high level of accuracy, achieving an R�
score of 99% on the training set and 92% on the test set, reflecting a strong
alignment between predicted and actual prices. Moreover, the model attained
an average cross-validation score of 96%, reinforcing its robustness and
reliability. These findings highlight XGBoost's efficacy in agricultural
price prediction, offering stakeholders a potent tool for data-driven
decision-making. This study enriches the literature on machine learning
applications in agriculture and emphasizes XGBoost's potential to enhance
predictive accuracy and operational efficiency. |
|
KEYWORDS |
XGBoost
Algorithm, Price Prediction, Cayenne Pepper, Agricultural Markets, Predictive
Analytics. |
|
This work is
licensed under a Creative Commons Attribution-ShareAlike
4.0 International |
INTRODUCTION
The price of cayenne pepper in Indonesia
often experiences significant fluctuations, especially during the rainy season
or when distribution disruptions occur. This instability is influenced by
various factors such as unpredictable weather conditions, pest attacks, and
complex logistical challenges, especially in remote areas with inadequate
infrastructure. This price volatility impacts not only consumers but also
farmers and agricultural businesses. Price uncertainty can cause substantial
financial losses, reduce farmers' welfare, and disrupt the overall economic
stability of the agricultural sector. Therefore, an efficient method is needed
to analyze and predict the price of cayenne pepper to enable better and timely
decision-making for all parties involved. (Yuditya et al., 2023).
Machine learning
algorithms, particularly XGBoost (Extreme
Gradient Boosting), have proven to be very useful in various fields to handle
large and complex data and provide accurate predictions. XGBoost
excels due to its ability to handle datasets with many features and complex
interactions between variables. Its speed and ability to manage missing values
and imbalanced data make it superior to traditional methods. In a study by
Sharma et al. (2022), XGBoost was used
to develop an optimized diagnostic system for predicting heart disease. (Budholiya et al., 2022).. The results
showed that XGBoost provided higher
prediction accuracy than conventional methods, allowing the identification of
complex risk factors and interactions between various variables
(Asikin et al., 2024).
This application of XGBoost demonstrates
its ability to handle complex datasets and provide informed decisions in
critical contexts.
Additionally, in
the study by Ding et al. (2021), XGBoost was
applied to predict house prices using highly diverse and complex data. The
model successfully identified key variables that influence house prices, such
as location, size, and neighborhood amenities, and provided highly accurate
price predictions. (Et. al., 2021). The success of
XGBoost in capturing important
correlations between these factors and property prices highlights its potential
in price prediction applications. Another study by Ntakaris
et al. (2023) illustrates how XGBoost can
be used to improve the prediction of students' academic performance by
incorporating various factors such as attendance records, test scores, and
extracurricular activities. The model provides deep insights into the relative
contribution of each factor to academic performance, demonstrating the
versatility of this algorithm across different areas of data science (Asselman et al., 2023).
In the context
of agriculture, previous studies have demonstrated the effectiveness of XGBoost in predicting agricultural commodity
prices. Lakshmi et al. (2020) used XGBoost to
forecast the prices of crops such as vegetables and fruits. Their study found
that XGBoost can capture complex
patterns in price data and produce more accurate predictions than traditional
predictive models. (Bayona-Or� et al., 2021).. Similarly,
Kamble et al. (2021) applied XGBoost to
predict prices of various agricultural commodities and found that the model can
effectively use weather and distribution data to forecast commodity prices. In
the specific context of Indonesia, the price of cayenne pepper is highly
influenced by local factors such as high climate variability and gaps in
transportation infrastructure, which often cause distribution delays from
production areas to markets (Tran et al., 2023).. Given the success of XGBoost in
various applications, this study aims to apply the XGBoost
algorithm to analyze and predict the price of cayenne pepper in Indonesia.
Price
information on cayenne pepper will be collected from the National Food Agency
(BPN). This data often needs to be extensively processed to correct missing
figures or irregularities. In addition, other factors that affect prices such
as weather conditions, government policies, or global economic events may not
be recorded. We will also consider logistical aspects such as travel time and
transportation availability. Using this data, an accurate prediction model will
be built that projects prices and captures actual market conditions. (Price et al., 2023). By utilizing XGBoost, it is expected that important patterns and
variables that affect the price of cayenne pepper can be better identified. The
main focus of this research is to develop an XGBoost-based
price prediction model for cayenne pepper that provides more accurate forecasts
and deep insights into all variables affecting its price. The model will be
tested using cross-validation techniques and historical data to ensure its
predictive accuracy. It is expected that this research will produce a more accurate
and reliable price prediction model for cayenne pepper compared to conventional
methods. (Ananda et al., 2022)..
With this model,
it is expected to contribute to increased economic stability, reduced financial
risk, and improved welfare for both consumers and farmers. Farmers can use
these predictions to plan optimal harvest times and choose markets that offer
the best prices, while consumers and traders can better plan their purchases
and distribution. (Windhy & Jamil, 2021). In addition,
this research is expected to make a significant contribution to the scientific
literature in agriculture and data science and open up opportunities for
further development in the use of machine learning for other agricultural
products. (Nasution et al., 2021).
RESEARCH METHOD
Referring to Figure 1, the sequence of the research
process is as follows: Data Collection involves collecting daily prices of red
chili peppers in DKI Jakarta from 2021 to 2024. Data Pre-processing includes
data cleaning and preparation by handling missing values and outliers. Data
Sharing divides the data into training set and test set. Feature Scaling
standardizes the features. Modeling uses XGBoost
Regressor with optimized hyperparameters. Model Evaluation assesses
performance with R2 scores and cross-validation. Results Visualization displays
trends, seasonal analysis, correlation matrix, residuals, and future price
predictions.
Figure 1: Research process
1. Data Collection
The 1247 datasets used include information on the price of red chili
peppers in DKI Jakarta Province from 2021 to 2024. This dataset was downloaded
in Excel format and accessed using the pandas library
with the openpyxl engine. The dataset includes the
daily price of red chili in various areas of Jakarta, such as South Jakarta,
East Jakarta, Central Jakarta, West Jakarta, North Jakarta, and Kepulauan Seribu Regency. In
addition, this dataset has a 'Date' column, which is important for extracting
temporal features.
2. Data Preprocessing
Data
pre-processing is a crucial step to ensure data integrity and quality before
analysis or modeling. The pre-processing steps in
this study include:
1.
Replacing Infinitive Values: Infinite or infinitive values are replaced with NaN
(Not a Number) to ensure the data can be processed further.
2.
Filling Missing Values: Missing values are filled using forward-fill and backward-fill methods to
maintain data continuity.
3.
Deleting Zero Values: Rows with zero prices in various columns are deleted to avoid distortions
in the analysis.
4.
Resetting the Index: After deleting rows, the index is reset to maintain data consistency.
5.
Date Conversion: The 'Date' column is converted to DateTime
format for easy extraction of temporal features.
6.
Outliers Removal: Outliers in the price column in DKI
Jakarta were removed using the Interquartile Range (IQR) method to get a more
representative dataset.
7.
Feature Additions: New features such as month, year, day of the week, quarter, and whether
it is the beginning or end of the month were added. A lag feature (previous
day's price) was also created to capture complex temporal patterns in the data.
3. Data Splitting
To effectively evaluate the model, the dataset is divided into two parts:
training set and test set. In this study, the data was split with 20% for the
training set and 80% for the test set using sklearn's
train_test_split function. This split allows the
model to be trained on a fraction of the data and tested on data that has not
been seen before, giving a better indication of the model's performance on
unseen data.
4. Feature Scaling
Feature scaling converts data features to a uniform scale. In this study, StandardScaler from sklearn is
used to standardize the features so that they have a mean of 0 and a standard
deviation of 1. This scaling is important for algorithms like XGBoost, which are sensitive to feature scaling. The
scale is applied separately to the training data and test data to prevent
information leakage from the test set to the training set.
5. Modeling
The modeling process involves the use of the XGBoost Regressor algorithm, which is known
for its ability to handle complex data. The hyperparameters of the model were
optimized using RandomizedSearchCV to
find the best combination of parameters from a predefined range. The optimized
parameters include the number of estimators, maximum depth of the tree,
learning rate, subsampling ratio, and colsample_bytree.
Once the best parameter set is found, the XGBoost
model is trained using the scaled training data.
6. Model Evaluation
Model performance is evaluated using several metrics. An R2 (R-squared)
score is calculated to assess how well the model captures the variance in the
data. R2 scores were calculated for both training and test data, and the
average was also reported. In addition, cross-validation was performed using cross_val_score to obtain a more accurate and reliable
assessment of the model's performance. Further evaluation includes residual
analysis to examine the distribution of prediction errors and identify
potential model bias.
7. Result Visualization
Visualization of
results is critical to understanding and communicating model performance. Key
visualizations created include:
1.
Price Trend Chart: This visualization shows the historical price trend of red chili in DKI
Jakarta, providing insight into seasonal trends and price fluctuations.
2.
Seasonal Analysis: Seasonal patterns by month and day of the week were analyzed
to identify seasonal effects on red chili prices.
3.
Correlation Matrix: The correlation heatmap shows the relationship between features, helping
to understand which features affect the price of red chili peppers the most.
4.
Residual Plot: A scatter plot of residuals (the difference between predicted and actual
prices) helps identify systematic errors in model predictions.
5.
Future Price Predictions: Red chili price predictions for the following year are visualized along
with confidence intervals to show the uncertainty in the predictions.
RESULT AND
DISCUSSION
Results
Due to several
geopolitical considerations, the Special Capital Region (DKI) of Jakarta serves
as Indonesia's main center of trade and population. The first reason is that
price fluctuations here have a major impact on the local and national economy
as the region has the highest consumption rate of red cayenne pepper in
Indonesia [11]. Thanks to the regular efforts of the Jakarta National Food
Agency in collecting data, historical information regarding the price of red
bird's eye chili in Jakarta is also more complete and stable. In addition, the
presence of various types of markets, both traditional and modern, in Jakarta
facilitates the collection of complete and diverse data [12]. Data collection
procedures and conducting research are easier in Jakarta due to its excellent
infrastructure and high accessibility. Price analysis is particularly important
for national policy formulation in Jakarta, because as the nation's capital,
its trade and economic policies often serve as national standards. In conclusion,
the complicated red cayenne pepper market in Jakarta, which is characterized by
sudden price changes, offers an interesting challenge to evaluate the
effectiveness of the XGBoost algorithm in a dynamic setting [13].
Table 1. Statistical Analysis
Statistical Analysis |
Results |
R-squared (R2)
Score Train |
0.99 |
R-squared (R2)
Score Test |
0.92 |
Cross-Validation
Score |
0.96 |
Table 1 displays the model evaluation results.
These results show that both the training data and the new data can be
predicted by the model with remarkable accuracy. With an R2 (coefficient of
determination) score of 0.99 on the training dataset, the model is almost able
to explain 99% of the variability in the training dataset, indicating an almost
perfect predictive capacity. With an R2 score of 0.92 on the test data, the
model can explain 92% of the variability, indicating strong generalizability
and the ability to produce reliable predictions on new data that has never been
seen before. Additionally, the model performed well and consistently across
different subsets of the training data, as seen from the cross-validation score
of 96%. This cross-validation technique ensures that the model performs well on
all training data.
Figure 1: Average Red Cayenne Pepper
Prices in DKI Jakarta by Month and by Day of the Week
The upward trend in the average price of red
cayenne pepper in Jakarta between August and December is shown in the figure above.
The graph shows that prices peaked in November when the average price per
kilogram reached 75,000 Indonesian rupiah. Prices started to rise in September.
As a result, prices experienced a minor decline in December. The dry season,
short supply, and unfavorable weather patterns may be some of the causes of
this price spike. Rising prices of red bird's eye chilies, an important
commodity for Indonesians, can affect inflation and people's purchasing power.
Therefore, to maintain the stability of red chili price, production should be
increased, distribution should be improved, and market intervention is needed.
The daily price fluctuation pattern of red
cayenne pepper in DKI Jakarta is depicted in the figure above. On Mondays and
Sundays, prices are usually highest, while on Wednesdays and Thursdays, prices
are lowest. Monday is the most expensive day with prices reaching Rp60,200 per
kg. However, on Thursday, the price gradually drops to IDR 59,500 per kg. Weather,
religious holidays, supply and demand, and other factors affect these price
fluctuations. Mondays and Sundays experience price spikes due to high demand at
the beginning and end of the week and possibly limited availability. The supply
of red cayenne pepper is more consistent and demand tends to decrease in the
middle of the week.
Correlation Matrix
Based on Figure 2, a correlation matrix shows
the relationship between the price of red chili in various locations in DKI
Jakarta. The correlation matrix displays the correlation coefficient between
the price of red chili in a particular region and the price of red chili in
another region. The correlation coefficient has a value from -1 to 1. When the
correlation is positive, it shows that when the price of red chili increases in
one place, the price of red chili also increases in another place. A negative number
indicates a negative correlation, which is a phenomenon where the price of red
chili falls in one region when it increases in another. There is no correlation
between the price of red chili in the two regions when the correlation value is
0.
Correlation Matrix Interpretation:
Figure 3. Learning Curves and Residual
Plot
Learning curves are shown in the figure
above to provide important insights into the learning process of machine
learning models. At first, the error rapidly decreases, indicating that the
model quickly understands the basics of the data. But thereafter, the rate of
decline slows down, indicating that the model is having difficulty in
discovering new patterns and is approaching its capacity limit. The error
eventually stabilizes, indicating that the model has made the best use of the
available data. The shape of this curve depends on variables such as data
quality, model complexity, and model design. By understanding this learning
curve, we can estimate model performance, determine the optimal amount of
training data, and compare different models.
Residual Plot of training and testing data. The difference between the anticipated and
observed values by the model is called the residual. The performance of the
model on the training and testing data can be seen more clearly with the help
of this figure. The model successfully predicted the values on the training
data, as seen from the blue line in the plot, which represents the
residuals of the training data. These residuals are mostly clustered around
zero. The green line also represents the residuals from the testing data. A
larger number of residuals indicates a less accurate prediction of the values
on the test data by the model. This graph shows that overall, the model
performs well on both training and testing data; however, it is more accurate
on testing data. This can be explained because the testing data is data that
the model has not previously seen, while the training data is used to train the
model.
Figure 4: Predicted price of red bird's
eye chilies in DKI Jakarta in the next 1 year
Figure 4 shows the forecasted cost of red
cayenne pepper in the Jakarta Special Capital Region. The price of red cayenne
pepper increases from July 2021 to January 2024, as seen from the blue line
that rises steadily in the historical data. In addition, the rising orange line
indicates that the price trend will continue to rise for the coming year,
starting from July 2024 to January 2026. However, keep in mind that these
predictions are only estimates, and may not always come true. It is possible
that the actual price of red cayenne pepper will differ from the forecast. This
uncertainty is shown by the gray confidence interval area surrounding the
projected data line. The price of red cayenne pepper may increase due to
various factors, including supply and demand in the market as well as climatic
conditions. As a result, the actual price of red cayenne pepper may experience
significant short variations.
Discussion
This research resulted in an
in-depth understanding of the price fluctuations of red cayenne pepper in the
Jakarta Special Capital Region from 2021 to 2024. Statistical analysis shows
that the applied XGBoost model has excellent performance with R-squared
values reaching 0.99 on training data and 0.92 on testing data. This indicates
that the model is able to explain most of the variability in the price of red
cayenne pepper, both in the data used for training and new data that has never
been seen before. In addition, the learning curve showed that the model quickly
understood the basic pattern of the data but had difficulty finding new
patterns, indicating limitations in the available data.
The findings have significant
practical implications for economic and trade policy in Indonesia. The price
stability of red cayenne pepper in Jakarta is crucial given its impact on
inflation and people's purchasing power. By understanding price fluctuations
that are influenced by factors such as seasonality, supply availability, and
weather patterns, the government can design a comprehensive strategy to
maintain price stability. (M'hamdi et al., 2024). This includes increasing production, managing market operations, and
diversifying promotions, which are important to protect the interests of
traders, farmers, and consumers. Thus, these findings not only provide deep
insights into the dynamics of the red cayenne pepper market but also suggest
concrete measures to improve food security and the national economy. (Komaria et al., 2023).
This research is supported by
comprehensive and diverse data collected by the Jakarta National Food Agency,
as well as a careful analysis of the price of red cayenne pepper in various
areas of DKI Jakarta. (Deviyanto & Aji, 2023).. Relevant references include previous research on commodity price
fluctuations and machine learning techniques, which strengthen the
methodology and interpretation of results. In addition, the analysis of
correlations between prices in different locations of DKI Jakarta adds
interpretative value, enabling a better understanding of price interdependencies
within a given geographical area. (Kusumiyati et al., 2021)..
As a step towards updating
knowledge in this area, this study presents new findings in the analysis of red
chili pepper prices in Jakarta. While there have been previous studies on red
chili price fluctuations, this approach integrates machine learning techniques
with a broader and more detailed dataset, resulting in more accurate price
predictions and more focused policy recommendations. As such, this study fills
a gap in the literature by expanding the understanding of the factors affecting
red chili pepper prices and their strategic implications for economic
decision-making in Indonesia. (Nababan et al., 2023).
CONCLUSION
This study reveals that the
price fluctuations of red cayenne pepper in DKI Jakarta from 2021 to 2024 are
influenced by factors such as seasonality, supply availability, and demand.
Statistical analysis shows that the XGBoost model provides excellent
performance, with R-squared values reaching 99% on the training data and
92% on the test data, demonstrating the model's ability to explain price
variability effectively. The findings have significant implications for
economic and trade policy in Indonesia, with red cayenne pepper price stability
being key in managing inflation and people's purchasing power. Suggested
strategies include increasing production, managing market operations, and
promoting diversification to protect national economic interests. As such, this
study not only provides in-depth insights into local market dynamics, but also
provides a framework for results-oriented policy actions, which are essential
for improving food security and national economic stability in the future.
REFERENCES
Ananda,
D., Pertiwi, A., & Muslim, M. A. (2022). Prediksi Rating Aplikasi Playstore
Menggunakan Xgboost Prediksi Rating Aplikasi Playstore Menggunakan Xgboost. ResearchGate,
October 2020, 6.
Asikin,
M. Z., Fadilah, M. O., Saputro, W. E., Aditia, O., & Ridzki, M. M. (2024).
The Influence Of Digital Marketing On Competitive Advantage And Performance of
Micro, Small And Medium Enterprises. International Journal of Social Service
and Research, 4(03), 963�970.
Asselman,
A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of student
performance based on the machine learning XGBoost algorithm. Interactive
Learning Environments, 31(6), 3360�3379.
https://doi.org/10.1080/10494820.2021.1928235
Bayona-Or�,
S., Cerna, R., & Hinojoza, E. T. (2021). Machine learning for price
prediction for agricultural products. WSEAS Transactions on Business and
Economics, 18, 969�977. https://doi.org/10.37394/23207.2021.18.92
Budholiya,
K., Shrivastava, S. K., & Sharma, V. (2022). An optimized XGBoost based
diagnostic system for effective prediction of heart disease. Journal of King
Saud University - Computer and Information Sciences, 34(7), 4514�4523.
https://doi.org/10.1016/j.jksuci.2020.10.013
Deviyanto,
A., & Aji, J. M. M. (2023). Fluktuasi Harga dan Efisiensi Pemasaran Cabai
Rawit di Desa Sepanjang Kecamatan Glenmore Kabupaten Banyuwangi. Jurnal
Pertanian Agros, 25(1), 529�537.
Et.
al., J. A. (2021). Prediction of House Price Using XGBoost Regression
Algorithm. Turkish Journal of Computer and Mathematics Education (TURCOMAT),
12(2), 2151�2155. https://doi.org/10.17762/turcomat.v12i2.1870
Harga,
P., Rawit, C., Di Kota, H., Menggunakan, J., Markov, R., Budiarti, D. I.,
Kholijah, G., Yurinanda, S., & Mardhotillah, B. (2023). Price Prediction
of Green Cayenne Pepper in Kota Jambi Using Markov Chain. 2(1),
2023.
Komaria,
V., Maidah, N. El, & Furqon, M. A. (2023). Prediksi Harga Cabai Rawit di
Provinsi Jawa Timur Menggunakan Metode Fuzzy Time Series Model Lee. Komputika :
Jurnal Sistem Komputer, 12(2), 37�47.
https://doi.org/10.34010/komputika.v12i2.10644
Kusumiyati,
K., Putri, I. E., & Munawar, A. A. (2021). Model Prediksi Kadar Air Buah
Cabai Rawit Domba (Capsicum frutescens L.) Menggunakan Spektroskopi Ultraviolet
Visible Near Infrared. Agro Bali: Agricultural Journal, 4(1), 15�22.
https://doi.org/10.37637/ab.v0i0.615
M�hamdi,
O., Tak�cs, S., Palot�s, G., Ilahy, R., Helyes, L., & P�k, Z. (2024). A
Comparative Analysis of XGBoost and Neural Network Models for Predicting Some
Tomato Fruit Quality Traits from Environmental and Meteorological Data. Plants,
13(5). https://doi.org/10.3390/plants13050746
Nababan,
A. A., Jannah, M., Aulina, M., & Andrian, D. (2023). Prediksi Kualitas
Udara Menggunakan Xgboost Dengan Synthetic Minority Oversampling Technique
(Smote) Berdasarkan Indeks Standar Pencemaran Udara (Ispu). JTIK (Jurnal
Teknik Informatika Kaputama), 7(1), 214�219.
https://doi.org/10.59697/jtik.v7i1.66
Nasution,
M. K., Saedudin, R. R., & Widartha, V. P. (2021). Perbandingan Akurasi
Algoritma Na�ve Bayes Dan Algoritma Xgboost Pada Klasifikasi Penyakit Diabetes.
E-Proceeding of Engineering, 8(5), 9765�9772.
Tran,
N.-Q., Nguyen Ngoc, T., Tran, Q., Felipe, A., Huynh, T., Tang, A., &
Nguyen, T. (2023). Predicting Agricultural Commodities Prices with Machine
Learning: A Review of Current Research.
Windhy,
A. M., & Jamil, A. S. (2021). Peramalan Harga Cabai Merah Indonesia :
Pendekatan ARIMA. Jurnal Agriekstensia, 20(1), 78�87.
Yuditya,
A., Hardjanto, A., & Sehabudin, U. (2023). Fluktuasi Harga dan Integrasi
Pasar Cabai Merah Besar (Studi Kasus: Pasar Induk kramat Jati dan Pasar Eceran
di DKI Jakarta). Indonesian Journal of Agriculture Resource and
Environmental Economics, 2(1), 1�13.
https://doi.org/10.29244/ijaree.v2i1.50669