K-means Clustering and its use case in the security domain

ARINDAM
3 min readJul 19, 2021

What is Clustering?

Clustering is the task of dividing the population or data points into several groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.

What is k-means Clustering?

K-means clustering is a method of separating data points into several similar groups, or “clusters,” characterized by their midpoints, which we call centroids. The goal of the k-means algorithm is to find groups in the data, with the number of groups represented by the variable k . The outputs of executing a k-means on a dataset are:

  • k centroids: centroids for each of the k clusters identified from the dataset.
  • Complete dataset labeled to ensure each data point is assigned to one of the clusters.

K-Means clustering supports various kinds of distance measures, such as:

  • Euclidean distance measure
  • Manhattan distance measure
  • A squared euclidean distance measure

Here’s how it works:

  1. Select K, the number of clusters you want to identify. Let’s select K=3.
  2. Randomly generate K (three) new points on your chart. These will be the centroids of the initial clusters.
  3. Measure the distance between each data point and each centroid and assign each data point to its closest centroid and the corresponding cluster.
  4. Recalculate the midpoint (centroid) of each cluster.

How to select the number of clusters (K)?

The most common way to do this is by using an elbow plot. The goal of the elbow method is to find the inflection point on the curve or the “elbow.” After this point, additional clusters do not minimize the within-cluster variance significantly enough to justify additional groupings in the dataset. There is no hard and fast rule here, as it’s often up to the discretion of the data scientist, but looking at an elbow plot tends to be a good place to start

USE CASE

Intrusion Detection and Attack Classification Using K-Means Algorithm

The intrusion detection system is mainly to distinguish normal behavior and abnormal behavior and then take corresponding measures. In the midst of a data set, can through the simple data preprocessing and system audit, use these data sets in our system, but this method is only used in simple normal behavior and behavior analysis, premise is to know the difference between the abnormal data and normal data. By clustering algorithm, one group can not distinguish between normal and abnormal data processing, can summarize and find common ground, and then make a distinction.

K-means algorithm first determines the input parameters, the n in the sample data is divided into K-class, the same data in a cluster similarity is high, the center of the cluster needs to be from the similarity of data in the group of the lower average.

Here we can cluster the data into 4 classes as per the confusion matrix, which are:

  • true positive
  • true negative
  • false positive
  • false negative

Data mining has a set of complete analytical methods, mainly adopted detection rate and the rate of false positives as evaluation index, are defined as follows: related to, the higher detection rate, the lower the rate of false positives, it shows that the better the performance of the proposed data mining algorithm to detect.

--

--