Eduvest � Journal of Universal Studies Volume 2
Number 10, October, 2022 p- ISSN
2775-3735- e-ISSN 2775-3727 |
||
|
|
|
THE USE OF DISTANCE BLOCKS REPRESENTATIVE TO AVOID EMPTY GROUPS DUE TO
NON-UNIQUE MEDOIDS |
|
|
Kariyam1,2, Abdurakhman2, Adhitya
Ronnie Effendie2 1 Jurusan Statistika, Fakultas
Matematika dan Ilmu Pengetahuan Alam, Universitas
Islam Indonesia, Indonesia 2 Jurusan Matematika, Fakultas
MIPA, Universitas Gadjah Mada, Indonesia Email:
[email protected], [email protected], [email protected] |
|
|
ARTICLE
INFO��� ����ABSTRACT |
|
|
|
The existence of identical
objects in a data set is a necessity. This paper proposes a new indicator and
procedure to obtain the initial medoids. The new algorithm guarantees no
empty groups and identical objects in the same group, either in the initial
or final groups. We use six real data sets to evaluate the proposed method
and compare the results of other methods in terms of adjusted Rand index and
clustering accuracy. The experiment results show that the performance of the
proposed method is comparable with other methods. |
|
KEYWORDS |
clustering, identical
objects, accuracy |
|
|
This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0 International |
|
INTRODUCTION
Research related to object grouping techniques and
procedures has been carried out with various approaches; even with the
development of big data, the need for grouping techniques has become inevitable.
Jinchao, et al. (2021) proposed a multi-view clustering
algorithm for mixed numeric and categorical data. Jinchao, et al. (2020) also proposed a mixed data clustering
method based on Cuckoo Search and K-Prototypes. Behzadi, et al. (2020) investigated the use of the ClicoT (CLustering mix-type data
including Concept Trees) for grouping mixed-type objects, especially nominal
and numeric, and the method is noise-resistant grouping results can be
interpreted well. This method can also automatically detect group enumeration even
though there is no prior knowledge about the number of clusters. Selosse et al. (2019) have proposed using the Multiple
Latent Block Model (MLBM) to classify mixed-type data sets using a stochastic
algorithm approach. Oesting and Schnurr,
(2020), have examined
the relative position of an object in a group using a group-size probability
distribution approach on ordinal-type data.
Cluster analysis is generally distinguished into
hierarchical, non-hierarchical and combination techniques (Jonhson & Wichern, 2009). One of the
popular non-hierarchical techniques is the K-Means method. As the name implies,
the K-Means method begins with determining k means that are determined directly
or by partitioning objects into k groups and then calculating the cluster
average. The K-Means method is unsuitable if there are outliers in the data
set. To deal with this, the Partitioning Around Medoid (PAM) method, often
called the K-Medoids, was proposed by Kauffman and Rousseeuw
in 1987 and underwent development or modification. One problem of k-medoids is a
high run time cost, so some investigations have been developed. Erich and Rousseuw (2021) have modified the second phase of
k-medoid by eagerly performing additional swaps; so
achieve an
A simple and fast k-medoids
(Fast-KM) algorithm is one of the partitioning methods of datasets that
consists of three phases (Park & Jun, 2009). The Fast-KM
use a specific formula to select the set of initial medoids in the first stage.
To update the medoids, the second phase of the Fast-KM use objects that
minimize the total distance within groups. The third phase of Fast-KM is to assign
objects to medoids. The second and third phases repeat until the total distance
is stable. Even though this algorithm is simple and fast, it has neglected
local optima and empty groups that may arise. For this reason, a simple k-medoids
(Simple-KM) algorithm was developed to improve the performance of Fast-KM (Weksi & Leisch, 2019). The Simple-KM
always use the most centrally-located object as one member of the initial
medoids. Meanwhile, another medoid was determined randomly. The initialization
repeats s times. The Simple-KM suggestion for s is twenty times.
In case the medoids set are non-unique objects, for example, two medoids have
equal values in all their variables, the Simple-KM regulate that the closest
objects to the non-unique medoids assign only one of the medoids. In the
Simple-KM, to avoid an empty cluster, the non-unique medoids are restricted to
preserve their label, meaning that identical objects may occur in different
groups during the initialization process to obtain initial groups. The flexible
k-medoids (Flexible-KM) have been developed to improve the performance of
Fast-KM and Simple-KM (Kariyam et al., 2022). The
Flexible-KM use an object representing a block of a combination of standard
deviation and the sum of variable values. This algorithm guarantee that no
empty groups and identical object as non-unique objects are in the same group
in the initialization process.
This research aims to develop another way to improve the performance of the
Fast-KM addressed to avoid the empty group due to non-unique medoids. We apply the proposed method to the six
real datasets from the University of California, Irvine (UCI) repository to
evaluate it.
RESEARCH METHOD
Proximity measure and transformation method
�� We use a simple matching coefficient to get
the matrix distance for binary, categorical, and nominal variables (Everitt et al., 2011). For numerical data,
we apply Euclidean or Manhattan distance. While for mixed data (categorical and
numeric), we implement a Generalized Distance Function (GDF). The GDF measure
between objects
where
The Euclidean
distance between objects
We can standardize
numerical data if the datasets are mixed or contain outliers (Jajuga, 2000). The
transformations
on numerical data are linear transformations with a standardization formula as
follows [11],
where
where
External validation measures
We use Purity and F-measure to validation of our
method. Table 1 shows the contingency matrix, namely, the crosstab between the members of the true partition in the data set (partition C) and separation
based on a clustering method (partition D) (Wu, 2012).
The formula of Purity is as follows,
The formula of the F-measure is as follows,
Table 1
The contingency matrix
|
Partition C |
|
||||||||||
|
|
|
|
|
|
|||||||
Partition P |
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|||||||
. |
. |
. |
|
. |
. |
|||||||
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|||||||
Meanwhile, the formula for the accuracy level is as follows,
We also use the adjusted Rand index
(ARI) to evaluate the proposed method. Suppose that C is a clustering result
under consideration and P is the actual partition, then the ARI is formulated as follows (Warrens & Hoef, 2022),
where
Proposed Method
A simple and fast
k-medoids algorithm selects the initial medoids using a set of ordered
The initial medoids were chosen based on the first
This paper considers an
empty group that may
occur in simple and fast k-medoids due to non-unique medoids. We adopt the idea of flexible
k-medoids to obtain the initial groups, using the combination of standard
deviation (variance) and a sum of p-variables values. We call this
method a k-medoids algorithm based on distance variance blocks (Var-KM). The detail of our proposed algorithm
is as bellows.
Stage 1: (Determine the initial
groups)
(i)
Calculate the distance matrix for all pairs of
objects,
(ii) For each object
�����������������������
where
(iii)
Arrange
all objects, first based on standard deviation such as Equation (8),
(iv) For the
first
(v)
Determine
the members of
Stage 2:
(Determine the final groups)
(i)
Update
the current medoid in each cluster based on the object that minimizes the
average distance to other things in its group. The average distance within cluster gth, which has
(ii)
Obtain
the cluster by assigning each object to the nearest medoid and calculate the total distance from all items to their
medoids,
�����������������������
where
(iii) Repeat steps (i)
and (ii) until the
����������� The novelty of Var-KM (proposed
method) is the step for determining the initial groups. We use different
indicators to search the initial medoid, such as steps (ii), (iii) and (iv) in
the first stage. We combine the two and three phases of the Fast-KM into one scene.
In addition, we add the regulation to obtain the final groups using a
pre-determined number of iterations to achieve stability of total distance or
set of medoids does not change. These processes ensure that there are no empty
groups and that the identical objects or non-unique medoids are in the same
group, either in the initial or final group. The source to calculate indicators
in the Var-KM differs from the flexible k-medoids algorithm. The flexible
k-medoids directly calculated the standard deviation from the original or
standardized data. The steps of the second phase of the Var-KM are also
different from the flexible k-medoids. In
the second phase of the Flexible-KM, use B times randomly to obtain the final
medoids, meaning that different random sizes could produce no similar member groups.�
An
illustrative example of k-medoids based on distance variance blocks
Suppose we have fourteen
objects observed on three numerical variables, such as in Table 2.
Table 2
Artificial small data
Variables |
Objects |
|||||||||||||
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
|
X |
35 |
65 |
70 |
45 |
45 |
20 |
25 |
10 |
30 |
15 |
100 |
90 |
85 |
96 |
Y |
10 |
15 |
40 |
45 |
45 |
70 |
80 |
65 |
70 |
70 |
80 |
85 |
90 |
92 |
Z |
15 |
20 |
15 |
20 |
20 |
50 |
45 |
50 |
55 |
55 |
20 |
25 |
35 |
30 |
From table 2., we
get the Euclidean distance matrix, such as follows,
We clustered
datasets into three groups. If we implement fast k-medoids algorithms
based on indicator
Table 3. Artificial small data, for example of the proposed
method
Fast-KM |
Proposed methods |
|||||||
First phase |
Second phase |
|||||||
Objects ordered |
|
Objects ordered |
Std. dev. of distance |
Sum of distance |
Initial groups |
Iteration (I) |
||
I-1 |
I-2 |
I-3** |
||||||
(1) |
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
(9) |
D |
0,798* |
D |
21,741 |
598,550* |
G1* |
G3 |
G3* |
G3* |
E |
0,798* |
E |
21,741 |
598,550 |
G1 |
G3 |
G3 |
G3 |
I |
0,859* |
C |
22,102 |
687,943* |
G2* |
G3 |
G3 |
G3 |
F |
0,879 |
B |
27,559 |
833,655* |
G3* |
G3 |
G3 |
G3 |
G |
0,888 |
I |
28,140 |
647,216 |
G1 |
G1 |
G1 |
G1 |
C |
0,946 |
G |
28,764 |
668,067 |
G1 |
G1 |
G1 |
G1 |
J |
0,961 |
M |
29,644 |
752,537 |
G2 |
G2 |
G2 |
G2 |
H |
0,989 |
A |
30,259 |
904,147 |
G3 |
G3* |
G3 |
G3 |
M |
1,055 |
L |
31,743 |
755,188 |
G2 |
G2* |
G2* |
G2* |
L |
1,064 |
F |
32,307 |
663,725 |
G1 |
G1* |
G1* |
G1* |
B |
1,157 |
K |
33,399 |
834,429 |
G2 |
G2 |
G2 |
G2 |
N |
1.178 |
H |
34,229 |
742,162 |
G1 |
G1 |
G1 |
G1 |
K |
1,179 |
J |
34,491 |
721,956 |
G1 |
G1 |
G1 |
G1 |
A |
1,249 |
N |
34,574 |
834,943 |
G2 |
G2 |
G2 |
G2 |
|
Total deviation of groups, |
489,3 |
227,1 |
175,4 |
175,4 |
* Medoids������ ** Final groups
����������� Meanwhile,
in the first phase of Var-KM (proposed method), objects D, C and B are initial
medoids such as columns (5) in Table 3. These medoids resulted from the initial
groups, such as columns (6) in Table 3. We can see that the identical object, i.e. D and E, are in the same group. In order, the second
phase of the proposed method concluded that the total deviation of the third
iteration is equal to the second iteration, such as columns (8) and (9). Even
though the group members of iteration one are similar
to iterations two and three, the total deviation of iteration one is more
excellent than iteration two, so we concluded that the final groups are
iteration two (equal to iteration three). The profile of the initial groups and final groups of the proposed
method are in Figure 1a and Figure 1b.
Figure 1a. The plot of the initial
groups |
Figure 1a. The plot of the final
groups |
Real datasets
Six datasets from the University
of California, Irvine repository, namely ionosphere,
breast cancer, primary
tumour, vote, soybean small, zoo, and credit approval, are applied to check the effectiveness
of our proposed
methods (Lichman, 2022). The profile of six datasets, such as Table 4.
Table 4
Profile of the real datasets
Data
Set |
|
|
|
|
Type |
1.
Breast cancer 2.
Ionosphere 3.
Primary tumour 4.
Soybean small 5.
Zoo 6.
Heart disease (HD) case 2 |
683 351 65 47 101 303 |
9 33 - - 1 5 |
- - 17 35 15 8 |
2 2 4 4 7 2 |
Numerical Numerical Categorical Categorical Mixed Mixed |
RESULT AND DISCUSSION
From the Fast-KM, Var-KM (proposed), and other methods, we obtained the
comparison of adjusted Rand index (ARI) and clustering accuracy, such as in Table
5.
The breast cancer dataset involves nine numerical variables computed
from a digitized image describing the cell nuclei characteristics. These data
are classified into two classes and consist of 683 non-missing cases. We apply
Equation (5) for all features with f=5 and use Euclidean distance. The fast
k-medoids algorithm fails to group this data into two classes because the
medoids are non-unique objects. Breast cancer data has many identical objects,
so if the initialization is random, it is possible to find empty groups. A k-medoids
algorithm based on distance variance blocks (Var-KM) can handle it well. The
adjusted Rand index and accuracy of Var-KM (proposed method) are comparable
with methods offered by Ji et al., 2020
and Ji et al., 2021.
The ionosphere data contains 351 objects with 34
numerical features assigned into two groups. The ARI and accuracy of the Fast
k-medoids algorithm are similar to the proposed method.
Table 5
The comparison of ARI and
accuracy from the Var-KM, Fast-KM and others
Datasets |
Adjusted Rand index |
Clustering accuracy (in per cent) |
||||
Var |
Fast |
Others |
Var |
Fast |
Others |
|
1 Breast cancer |
0.869 |
Fail |
0.891a |
96.6 |
Fail |
97.2a, 95.8d, 95.6e |
2 Ionosphere |
0.173 |
0.146 |
0.292a, 0.173c |
70.9 |
69.2 |
78.4a |
3 Primary tumour |
0.181 |
Fail |
0.336a |
50.8 |
Fail |
68.2a |
4 Soybean small |
1.000 |
0.498 |
1.000a, b, c |
100 |
80.9 |
100a,b, 97.8f, 98.5d |
5 Zoo |
0.963 |
Fail |
0.967a, 0.901d, 0.939e |
92.1 |
Fail |
96a, 82.2b, 88.8d, 89.9e |
6 HD case 2 |
0.360 |
0.16 |
0.430a |
80.2 |
70.3 |
84.2a, 81.2d, 81.0e |
a(Kariyam et al., 2022), b(Weksi & Leisch, 2019), c(Yu et al., 2018), d(Ji, et al., 2021), e(Ji, et al., 2020), f(Yuan et al., 2020)
As the breast cancer dataset, the Fast-KM fails to cluster the primary tumour into four groups. The Fast-KM algorithm produces two
empty initial groups caused by the second, third and fourth medoids being
identical objects. Even though the ARI and accuracy of Var-KM are not very high,
this method can avoid empty groups on Fast-KM. The soybean small dataset has 35
variables; three have an ordinal type tendency, namely precip,
temperature and germination features. This data consists of 47 items assigned to
four classes. We operate simple matching distance for categorical data. We
utilize Equation (5) with f=1 for ordinal data before computing the Manhattan
distance. The adjusted Rand index and accuracy of the proposed method are
perfectly one or hundred per cent. This achievement is equal to flexible
k-medoids (Kariyam et al., 2022) and simple k-medoids (Weksi & Leisch, 2019). In comparison, the Fast-KM produces an accuracy of 80.9% with an ARI
of 0.498. Meanwhile, the clustering accuracy by Ji et al. (2021) and Yuan et al. (2020) are less than the proposed method.
The zoo dataset has 101 animals assigned to seven clusters, comprising
15 binary attributes and one numerical data. We use Equation (5) with value
factor f=1 for numerical data before executing Manhattan distance. The zoo
dataset has 25 blocks of objects with the same standard deviation and 30 object
blocks with the same standard deviation and sums up all variables. If the
Fast-KM method is used, two initial clusters will empty because the medoid
groups are non-unique. The Var-KM produces the adjusted Rand index and accuracy
with high values and is comparable to other methods. For the last real data, i.e. the heart disease case 2 data, the ARI and accuracy of the
proposed method are higher than fast k-medoids. The heart disease case 2 data
comprised 303 patients assigned into two clusters. These data are mixed
variables with three binary, five categorical and five numerical data. Before
implementing the Manhattan distance, we applied Equation (5) for numerical
data, using the factor of f=5. Meanwhile, we take simple matching for others.
Based on Table 5, we concluded that the k-medoids algorithm based on
distance variance blocks could avoid empty groups that may occur in the
Fast-KM. The adjusted Rand index and clustering accuracy of Var-KM for six real
datasets are generally relatively comparable to other methods.
To balance the evaluation of our proposed method, we calculate Purity
and F-measure for six real datasets, such as in Table 6. According to Table 6,
the k-medoids algorithm based on distance variance blocks produces Purity and
F-measure values of more than 0.7, especially for breast cancer, ionosphere,
soybean small and heart disease case 2 data datasets. Even though the Purity
and F-measure of the primary tumour and zoo datasets
are unsatisfaction, the new method can cluster into a certain number of groups
and no empty classes.
Table 6
External
validation for the proposed method
Datasets |
Purity |
F-measure |
1 Breast cancer |
0.966 |
0.966 |
2 Ionosphere |
0.709 |
0.704 |
3 Primary tumour |
0.508 |
0.302 |
4 Soybean small |
1.000 |
1.000 |
5 Zoo |
0.581 |
0.584 |
6 Heart disease case 2 |
0.802 |
0.802 |
CONCLUSION
Several investigations to avoid empty groups in the k-medoids
algorithm have developed. This paper proposed another view method that
guarantees there are no empty groups and identical objects as non-unique
medoids in the same cluster, either in the initial groups or final groups. The
k-medoids algorithms based on distance variance blocks (proposed method) can
handle clustering datasets containing identical objects as non-unique medoids. The
new algorithm applies to any type of data, i.e.
categorical, numerical or mixed data. Evaluation of new methods on the six real
datasets, i.e. breast cancer, ionosphere, primary tumour, soybean small, zoo and heart disease case 2,
produce the adjusted Rand index; and clustering accuracy is comparable with
another method. The Var-KM algorithm (new method) enriches the reference on
partitioned-based clustering of data sets.
REFERENCES
Behzadi,
S., Muller, N.S., Plant, C., and Bohm, C., (2020), Clustering
of mixed-type data considering concept hierarchies: problem specification and
algorithm, International Journal of Data Science and Analysics, 10: 233-248.
Budiaji, W., Leisch,
F., (2019), Simple k-medoids partitioning
algorithm for mixed variable data, in Algorithms, 12(177): 1-15.
Dinata, R.K., Retno,
S., and Hasdyna, N. (2021). Minimization of the Numer of
Iterations in K-Medoids Clustering with Purity Algorithm.� Revue d'Intelligence
Articielle, 35(3): 193-199.
Everitt, B.S., Landau, S.
Leese, M., and Stahl, D. (2011). Cluster
analysis, 5th edn., John Wiley &
Sons., Ltd., Publication. ISBN: 978-0-470-97844-3. pp. 49-50.
Jajuga, K. (2000). Standardization of Data Set under Different
Measurement Scales.
Ji., J., Li, R., Pang, W., He, F., Feng, G., Zhao, X.
(2021). A Multi-View Clustering Algorithm for Mixed Numeric and
Categorical Data. IEEE Access, 10: 24913-24924.
Ji., J., Pang, W., Li, Z., He, F.,
Feng, G., Zhao, X., (2020).
Clustering mixed numeric and categorical data with cuckoo search. IEEE Access,
8: 30988-31003.
Johnson, R.A., Wichern, D.W.
(2009). Applied Multivariate Statistical Analysis, John Wiley &
Sons. ISBN: 0-13-187715-1. pp. 680-695.
Kariyam, Abdurakhman,
Subanar, and Herni, U.
(2022). The Initialization of Flexible
K-Medoids Partitioning Methods Using a Combination of Deviation and Sum of
Variable Values. Mathematics and Statistics, 10(5): 895-908.
Kaufman, L., Rousseeuw, P.J. (1990).
Finding groups in data: An introduction to cluster analysis, John Wiley &
Sons, Inc., New YorkISBN: 0-471-73578-7. pp. 28-38,
68-71, 102-104.
Lichman, M. (2021-2022). UCI Machine Learning Repository, University of
California: Irvine, CA, USA, accessed on Nov. 10, 2021, and Jun. 15, 2022)
Nitesh, S., Chawda, B.,� Vasant, A. (2022). An improved
K-medoids clustering approach based on the crow search algorithm. Journal of
Computational Mathematics and Data Science. 3: 100034. https://doi.org/10.016/j.jcmds.2022.100034
Oesting,
M., and Schnurr, A., (2020), Ordinal patterns in clusters of
subsequent extremes of regularly varying time series, Applied Intelligence,
Vol. 50, p.1498-1509
Park, H.S., Jun, C.H. (2009). A Simple and Fast Algorithm for K-Medoids Clustering.
Expert System with Applications, 36(2): 3336�3341.
Schubert, E., and Rousseeuw, P.J. (2021).
Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM,
CLARA, and CLARANS algorithms. Information Systems, Elsevier, 101(2021): 101804.
Selosse,
M., Jacques, J., and Bienachki, C., (2019), Model-based
co-clustering for type data, manuscript version under the Elsevier user
Licence.
Warrens,
M.J., van der Hoef, H. (2022). Understanding the Adjusted Rand Index and Other Partition Comparison
Indices Based on Counting Object Pairs. Journal of Classification. � https://doi.org/10.1007/s00357-022-09413-z
Wu, J. (2012). Advance in K-means Clustering: A Data Mining Thinking.
Springer-Verlag, Berlin Heidelberg.
Xu, R., and Wunsch, D.C.II. (2009).
Clustering. John Wiley & Sons, Inc., Hoboken, New Jersey. ISBN:
978-0-470-27680-8. pp. 23-24.
Yu, D., Liu, G., Guo, M., Liu,
X. (2018). An improved k-medoids algorithm based on step increasing
and optimizing medoids. Expert System With
Applications, 92(2018): 464-473.
Yuan, F., Yang, Y., Yuan, T.
(2020). A dissimilarity measure for mixed
nominal and ordinal attribute data in k-Modes
algorithm. Applied Intelligence, 50: 1498-1509,
Zorn, C., (2003). Agglomerative Clustering
of Rankings Data, with Application to Prison Rodeo Events. Department of
Political Science, Emory Universit, Atlanta, GA
30322.