Eduvest – Journal of Universal Studies Volume 4 Number 06, June, 2024 p- ISSN 2775-3735- e-ISSN 2775-3727

Flood Prediction based on Weather Parameters in Jakarta using K-Nearest Neighbours Algorithm
Hariman Lumbantobing¹, Irma Ratna Avianti², Kukuh Harisapto³, Suharjito⁴ ^1,2,3,4Universitas Bina Nusantara, Jakarta, Indonesia Email: hariman.lumbantobing@binus.ac.id, irma.avianti@binus.ac.id, kukuh.harisapto@binus.ac.id, suharjito@binus.ac.id
ABSTRACT
Flooding is a difficult and common hazard in Indonesia, particularly in Jakarta during the rainy season. Floods have been the subject of several endeavours, ranging from discovering the causes to reducing their impacts. Floods cause significant damage to infrastructure, the social economy, and human lives. The government continues to create reliable flood risk maps and plans for long-term flood risk management. According to data from Jakarta Flood Monitoring, 12 sub-districts and 26 urban villages were hit by floods each year between 2016 and 2020, with an average flood length of nearly 2 days. The flood tendency in Jakarta decreased from 2018 to 2019, but increased in 2020. Floods are produced by a variety of reasons, including weather, geography, and human actions such as deforestation. Strong flood prediction is required for disaster management, however this might be difficult owing to changing weather conditions. This study focuses on flood prediction in Jakarta based on weather parameters utilising machine learning techniques to provide accurate and real-time predictions. K-Nearest Neighbours (KNN) is an algorithm employed to forecast the areas that will encounter the consequences of floods. The outcomes of this research with the value of k=2 to k=9 obtained the best performance values at k=7, where the level of accuracy reaches 92.25%, 88.89% precision, 92.25% recall, and F1-measure of 89.52%. The integration of machine learning algorithms which encompasses multiple weather variables provides significant utility in comprehensive flood predictions and early warning systems in flood disaster mitigation.
KEYWORDS	Flood Prediction, Weather Parameters, Machine Learning, K-Nearest Neighbours
	*This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International*

INTRODUCTION

Floods are one of the most complicated and widespread disasters (Tayfur et al., 2018). Floods are a common occurrence in Indonesia during the rainy season, particularly in Jakarta. Numerous attempts have been undertaken to prevent and reduce the possibility of floods, starting from the identification of the causes and mitigation of their impact (Shafizadeh-Moghadam et al., 2018). Floods are the most damaging natural disasters, with widespread impacts on infrastructure, social economics, and human life (Chuang et al., 2020). Thus, the government continues to develop reliable and accurate maps of flood risk areas and prepare for future sustainable flood risk management that emphasizes protection and preventative measures (Kamal et al., 2018).

According to data collected between 2016 and 2020 in Jakarta, Indonesia from Jakarta Flood Monitoring (pantaubanjir.jakarta.go.id). The average area impacted by floods during seven years was 12 sub-districts and 26 urban villages, and the average duration of flooding was almost 2 days.

Figure 1. Average of Affected Sub-district and Urban-village by Flood in Jakarta, Indonesia

Figure 2. Average of Flood Duration Affected Areas

(Source: pantaubanjir.jakarta.go.id)

Based on the graph above, the areas affected by flooding in the Jakarta area experienced a downward trend from 2018 to 2019 but experienced an increasing trend in 2020. Floods can be caused by a combination of several causal factors, namely high rainfall, topography, and human factors. The increase in human activity and population is closely related to increased land use and increased land area covered. This causes the water absorption area to decrease and the abundance of water to increase along with the increase in forest conversion into agricultural areas and residential areas. This deforestation phenomenon is directly proportional to increased erosion and shallowing of rivers (Riza et al., 2020).

Floods often occur in big cities as a result of rapid urban growth which causes water catchment areas to be covered by buildings erected by humans. The occurrence of floods cannot be separated from the behavior of humans who like to throw rubbish into waterways and rivers, plus the river area has become narrow with houses built along the riverbanks (Hernawan et al., 2024). Strong, reliable, and accurate prediction models are needed in hazard assessment and extreme event management, where this contributes greatly to strategies and policies in water resource management and evacuation in the event of a disaster. Advanced prediction systems for predicting floods and other hydrological events in the short and long term are highly prioritized to reduce the damage that occurs (Harianto, 2022). However, predicting flood waiting times and locations where floods occur can be a complicated problem due to the dynamic climatic conditions in each region. Thus, current major flood prediction models mostly use specific data and have simplified assumptions (Putra et al., 2019).

In this study, the researchers conducted flood prediction to delineate the regions that would experience flooding in the Jakarta area. This was achieved through the utilization of a sophisticated computational technique that can yield accurate and real-time predictions. Machine learning methodologies have been extensively employed in all predictive analyses involving variables that have the potential to influence the outcomes of the predictions. Drawing upon a thorough examination of the existing literature about flood prediction and taking into account the available data and desired outcomes, the researchers opted for the implementation of K-Nearest Neighbors (KNN) algorithm models for flood prediction specifically in the Jakarta area situated in Indonesia.

Research conducted in Riza (2020), titled "Advancing Flood Disaster Mitigation in Indonesia using Machine Learning Methods", undertook a comprehensive literature review encompassing all publications about the utilization of machine learning in the field of flood disaster mitigation within the Indonesian context. The empirical data employed for this study encompassed flood events occurring throughout Indonesia across a span of 15 years, specifically between 2005 and 2019. This research scrutinized a wide array of factors that contribute to the occurrence of flood disasters, while also considering multiple structural and non-structural approaches to mitigating such disasters (Khosravi et al., 2018). The application of machine learning techniques for flood disaster mitigation has been explored in various research studies, utilizing flood event data from diverse regions in Indonesia and employing different algorithms. This study encompasses a broad range of factors associated with flooding, commencing with rainfall forecasting and river water level predictions, which serve as integral components of flood forecasting and early warning systems.

In the Banyuwangi region of Indonesia, ANFIS (Adaptive Neuro Fuzzy Inference System) was employed to conduct rainfall forecasts. Two neural network methods were utilized, with the initial method displaying superior accuracy. Similarly, in the Denpasar area, a comparative analysis was conducted in 2016 to predict rainfall, employing Adaline and Multiple Linear Regression. The resulting error rates, namely MSE, were 0.025129 and 0.025953, respectively. Subsequently, the RMSE values were determined to be 0.158522 and 0.161098. In Malang City, rainfall predictions were carried out using ANN (Artificial Neural Network), yielding time-specific forecasts. The monthly, daily, and hourly error rates were determined to be 11.49%, 8.49%, and 19.32%, respectively.

In the Ciliwung region of Indonesia, the use of ANFIS, ANN, and FIS was employed to make predictions regarding river water levels. The findings demonstrated that ANFIS, utilizing three feature data, exhibited superior performance in comparison to ANN. The outcomes of ANFIS can be utilized as input for the FIS model, which subsequently enables accurate water level predictions at the Manggarai sluice gate. Within this context, BP-NN (Backpropagation-Neural Network) is employed to forecast both rainfall and water levels. Moreover, SVM is incorporated to predict floods for the upcoming six days, with the optimal combination consisting of 60 training data and 40 testing data points.

Similarly, within the Deli Serdang area of North Sumatra, BP-NN was employed to forecast rainfall and river water levels. Additionally, SVM was utilized to predict floods. In the Ular Tajur River area, BP-NN was implemented to forecast river water discharge. By utilizing a combination of training and testing data, flood predictions in a 60:40 ratio can be obtained, thereby serving as an initial warning for the subsequent six days. Furthermore, in this area, flood predictions were conducted using the K-NN algorithm and Naïve Bayes approaches, based on rainfall and water levels. These predictions demonstrated a commendable accuracy rate of 93.4%, with an associated error rate of 6.6%. Numerous other regions in Indonesia have been the subject of research endeavors aimed at predicting floods and rainfall using machine learning techniques. The outcome of this literature review research will serve as a foundation for BPBD DKI Jakarta in providing flood prediction data, which will be instrumental in making decisions regarding the opening and closing of the Manggarai floodgates. In the future, flood predictions can be facilitated through the implementation of single and double algorithms, incorporating Fuzzy Logic and ANN.

Research conducted in Sankaranarayanan (2020), titled "Flood Prediction based on Weather Parameters using Deep Learning" explored flood predictions in India. The predictions were based on various factors such as rainfall, humidity, temperature, water flow, and water level. To determine the most accurate model, the deep neural network was compared to other models including SVM, KNN, and Naïve Bayes. The research utilized a dataset consisting of flood events in India spanning from 1990 to 2002, ultimately concluding that the Deep Neural Network model was the most suitable for the Indian dataset.

In a study conducted in Gauhar (2021), titled "Prediction of Flood in Bangladesh using k-Nearest Neighbors Algorithm," flood predictions were carried out specifically in the Bangladesh region. The researchers employed the feature selection and the KNN (k-Nearest Neighbors) algorithm model. The dataset utilized consisted of 20544 data points from 32 districts in Bangladesh. The attributes used in the study included rainfall, cloud coverage, relative humidity, minimum temperature, and wind speed. The results yielded a high testing accuracy of 94.91%, an average precision of 92%, and an average recall of 91% using the KNN model.

RESEARCH METHOD

The weather records spanning 5 years in Jakarta have been sourced from Kaggle, drawing from data compiled by various outlets located in Jakarta. Specifically, details regarding flood occurrences in specific months and years were gathered from diverse outlets such as annual flood reports, newspapers, and academic papers. These findings were subsequently generated to a dataset, accessible in, comprising a total of 90185 entries. Figure 3 explained dataset that contains 15 attributes data such as average humidity, rainfall, minimum temperature, maximum temperature, average temperature, wind direction at maximum speed, maximum wind speed, location of the station, duration of sunshine, most wind direction, station id which record the data, and the station name. These attributes data are independent variables, last column is flood as the target data, which is 1 means true and 0 means false. This flood column is dependent variable. There was 4.7% missing data should be fixed using normalization method.

Figure 3. Dataset Preparation

Given the disparate units, ranges, and magnitudes observed in the dataset's features, it became imperative to standardize or normalize the data. This study employed z-score to address this variability. This method transforms the data by centering it around a mean value of zero and adjusting its scale to achieve unit variance. The z-score normalization formula is represented as,

Here, x represents the sample data, μ signifies the mean of the training sample, and σ denotes the standard deviation of the training sample. This normalization process ensures that the features are on a comparable scale, facilitating more effective analysis and modeling.

Figure 4 represents all attributes has been normalizing using z-score whereas no more missing data in dataset. Therefore the dataset would be proceeding with k-NN algorithm and then data also be evaluated.

Figure 4. Dataset Scaling

A. Machine Learning Classifier:

The k-nearest Neighbor (k-NN) algorithm is a widely used supervised machine learning technique that leverages feature similarity to predict outcomes for new data points. This approach assigns values to predicted data points based on their resemblance to the nearest points in the training set. KNN algorithm was used for predicting floods in Jakarta that involves several steps:

1. Find the k-value, where k represents the number of nearest neighbors.

2. Determine the distances between the training data points and the data point to be classified.

3. Order the training data points by arranging them in ascending sequence according to their distance values.

4. Make predictions based on the majority of the nearest neighbors.

5. In our study, k-value adjusted from 2 to 9. We employed a uniform weight function to assign equal weights to all points within each neighborhood. To compute distances, we utilized the Minkowski distance formula,

Figure 1 provides a visual representation of how the k-NN algorithm operates in predicting floods. The data point under consideration is compared to its k-nearest neighbors, and based on the closest and most similar points, it is classified accordingly.

In this phase, the complete dataset was divided into two subsets: a training dataset and a testing dataset, ratio maintained by 80:20. Specifically, designating 20% of the data for testing purposes while retaining the remaining 80% for training the model. To achieve this split, we utilized a test size parameter set to 0.2, indicating that 20% of the dataset would be allocated to the testing set. The random state was fixed at 50 to ensure reproducibility and consistency in the train-test splits across different runs. This deterministic approach guarantees same training and test subset that generated each time the dataset is partitioned, aiding in reliable model evaluation and comparison.

In assessing for effectiveness of the model, this study employed several key metrics: accuracy, precision, recall, and F1-score. True Positive (TP) signifies instances where the model correctly predicts the occurrence of floods, while True Negative (TN) represents cases where the model accurately predicts the absence of floods. False Positive (FP) occurs when the model incorrectly predicts flood occurrences, whereas False Negative (FN) indicates instances where the model inaccurately predicts the absence of floods.

The following formulas were utilized to compute these metrics:

Accuracy: This metric gauges the overall correctness of the model's predictions and is calculated as:

Precision: Precision quantifies the proportion of true flood predictions among all positive predictions, and is expressed as:

Recall: Recall, also known as sensitivity or true positive rate, measures the ability of the model to correctly identify actual flood occurrences, and is given by:

F1-Score: The F1-score is the harmonic mean of precision and recall, offering a balanced assessment of the model's performance, and is calculated as:

These metrics provide valuable insights into the model's predictive capabilities, allowing for a comprehensive evaluation of its performance across various aspects of flood prediction.

Result Analysis

The implementation of the system incorporates Orange Data Mining, as illustrated in Figure 5, representing a strategic choice in employing a versatile tool for data analysis and modeling. Following meticulous preprocessing and dataset scaling, the training data is fed into the k-NN model using an iterative function that defines k from 2 to 9. This iterative approach allows for a comprehensive exploration of various neighborhood sizes and utilized Eucledian metric and uniform weighted each k-NN.

Figure 5. Implementation of k-NN Model

Test and Score with stratified number of folds value equal 2, and random sampling with several parameters such as repeat train-test 50 times and a training set size of 80%, the k-NN model undergoes rigorous evaluation to ensure robustness and generalization. This comprehensive testing strategy enables the assessment of model performance across various iterations, providing insights into its stability and consistency. By stratifying the dataset into two folds and repeating the train-test process multiple times, the model's performance is evaluated under diverse conditions, minimizing the impact of random variations and enhancing the reliability of results. Additionally, the utilization of a substantial training set size ensures ample data for model training while maintaining a sizable portion for validation, striking a balance between model complexity and data availability.

Upon executing the k-NN model with varying values of k, a detailed analysis of performance metrics is conducted, as summarized in Table 1. These metrics provide a comprehensive overview of the system's performance across different k-values. Notably, the system achieves its peak performance with an Accuracy of 92.25%, indicating a high proportion of correctly classified instances. Additionally, metrics such as F1-score, Precision, and Recall offer insights into the model's ability to balance between true positives, false positives, and false negatives, crucial for tasks where misclassifications carry significant consequences.

Table 1. Performance Measure Index

	Accuracy	F1	Precision	Recall	AUC
k=2	91.88273	89.07542	87.6295	91.88273	0.598
k=3	91.06815	89.12275	87.8290	91.06815	0.624
k=4	92.20919	89.28322	88.4142	92.20919	0.640
k=5	91.93661	89.51938	88.5256	91.93661	0.652
k=6	92.36450	89.33794	88.9306	92.36450	0.661
k=7	92.25357	89.51967	88.89151	92.25357	0.671
k=8	92.45325	89.37243	89.38343	92.45325	0.677
k=9	92.35182	89.44414	89.0314	92.35182	0.685

Figure 6. Performance Measurement Graph

The performance of the k-NN model across different values of k is further elucidated through Figure 6, the Performance Measurement Graph. This visual representation facilitates the identification of an optimal k-value by showcasing the relationship between k and performance metrics such as accuracy, precision, and recall. Notably, a peak in performance is observed when k = 7, suggesting an optimal balance between model complexity and predictive efficacy.

Moreover, the iterative exploration of k-values allows for the identification of trends beyond the optimal point. Beyond k = 9, a decline in both accuracy and precision values becomes apparent, indicating potential overfitting or loss of generalization beyond a certain neighborhood size. This observation underscores the importance of careful hyperparameter selection and model evaluation to ensure robust performance across varying datasets and application scenarios.

CONCLUSION

Precise flood prediction is crucial for Jakarta, enabling the nation to effectively manage the aftermath of flooding. In our study, we utilized dataset preprocessing and scaling. Subsequently, utilizing a model of KNN for prediction of flood. By evaluating system's performance across different k values, we identified the optimal parameter. The optimal k value was determined to be 7, resulting in the highest levels of accuracy reaches 92.25%, 88.89% precision, 92.25% recall, and F1-measure of 89.52%. In the future, we anticipate that this study will significantly contribute to advancing flood prediction techniques, ultimately providing Jakarta with a strategic advantage in effectively managing future flood events.

Another limitation of the research is the lack of access to a more recent dataset for validating the machine learning model for flood prediction. This absence potentially restricts the model's ability to achieve higher accuracy. However, this limitation is not considered a major issue. The study also suggests several directions for future research. Firstly, it highlights the need to address data distribution imbalance, especially in light of advancements in neural network prediction techniques. Secondly, the incorporation of additional topographical factors, such as flood water level, could offer further insights and improve the accuracy of flood prediction models.

REFERENCES

Chuang, M.-T., Chen, T.-L., & Lin, Z.-H. (2020). A review of resilient practice based upon flood vulnerability in New Taipei City, Taiwan. International Journal of Disaster Risk Reduction, 46, 101494.

Gauhar, N., Das, S., & Moury, K. S. (2021). Prediction of flood in Bangladesh using K-nearest neighbors algorithm. 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), 357–361.

Harianto, D. W. I. Y. (2022). Analisis Muka Air Banjir Sungai Segeri Pada Persilangan Jalur Ka Lintas Makassar-Parepare. UNIVERSITAS BOSOWA.

Hernawan, A., Savandha, S. D., Karsa, A. H. A. N., Asikin, M. Z., & Fadilah, M. O. (2024). Application of Business Model Canvas in MSMEs in Karangwuni Village. International Journal of Social Service and Research, 4(03), 912–917.

Kamal, A. S. M. M., Shamsudduha, M., Ahmed, B., Hassan, S. M. K., Islam, M. S., Kelman, I., & Fordham, M. (2018). Resilience to flash floods in wetland communities of northeastern Bangladesh. International Journal of Disaster Risk Reduction, 31, 478–488.

Khosravi, K., Pham, B. T., Chapi, K., Shirzadi, A., Shahabi, H., Revhaug, I., Prakash, I., & Bui, D. T. (2018). A comparative assessment of decision trees algorithms for flash flood susceptibility modeling at Haraz watershed, northern Iran. Science of the Total Environment, 627, 744–755.

Putra, F. E. K., Romadhoni, A. Z., & Moe, I. R. (2019). Evaluasi Banjir di Kecamatan Bula Kabupaten Seram Bagian Timur. MEDIA KOMUNIKASI TEKNIK SIPIL, 27(2), 260–267.

Riza, H., Santoso, E. W., Tejakusuma, I. G., & Prawiradisastra, F. (2020). Advancing flood disaster mitigation in Indonesia using machine learning methods. 2020 International Conference on ICT for Smart Society (ICISS), 1–4.

Sankaranarayanan, S., Prabhakar, M., Satish, S., Jain, P., Ramprasad, A., & Krishnan, A. (2020). Flood prediction based on weather parameters using deep learning. Journal of Water and Climate Change, 11(4), 1766–1783.

Shafizadeh-Moghadam, H., Valavi, R., Shahabi, H., Chapi, K., & Shirzadi, A. (2018). Novel forecasting approaches using combination of machine learning and statistical models for flood susceptibility mapping. Journal of Environmental Management, 217, 1–11.

Tayfur, G., Singh, V. P., Moramarco, T., & Barbetta, S. (2018). Flood hydrograph prediction using machine learning methods. Water, 10(8), 968.