K-Means Cluster Analysis
History of K-Means Clustering
The name "k-means" was originally used by James MacQueen in 1967, but the idea goes back to Hugo Steinhaus in 1956. Stuart Lloyd of Bell Labs devised the standard method in 1957 as a technique for pulse-code modulation, although it was not published in a journal until 1982. In 1965, Edward W. Forgy developed the same approach, therefore it is also referred to as the Lloyd-Forgy algorithm.
K-means Cluster Analysis Image Created by Origin Pro
Definition of Clustering
Clustering is a collection of techniques for identifying subsets of observations within a data set. When clustering observations, we want observations in the same group to be similar and observations in separate groups to be distinct. Because there is no response variable, this is an unsupervised approach, which means it explores correlations between the n observations without being trained on a response variable. Clustering helps us to determine which observations are similar and maybe label them accordingly. K-means clustering is the simplest and most often used method for partitioning a dataset into k groups.
Applications of Biological Sciences
The k-means algorithm is very popular and used in a variety of applications such as Population Genetics, Gene Expression Analysis, Protein Structure and Function, Ecological Community Analysis, Drug Discovery and Pharmacogenomics, Metagenomics and Microbiome Analysis and Biomedical Imaging.
Population Genetics:
In population genetics, K-means clustering may be used to identify genetic clusters or subpopulations within a species using genetic marker data (for example, microsatellites and SNPs). It is useful for researching population structure, genetic diversity, and admixture patterns in natural populations.
Gene Expression Analysis:
K-means clustering is a technique in transcriptomics and gene expression research that groups genes or samples based on their expression patterns across different experimental circumstances or tissues. It helps in discovering co-regulated genes, functional modules, and gene expression profiles linked to certain biological processes or disorders.
Protein Structure and Function:
In structural biology and proteomics, K-means clustering can be used to study protein structure and function based on amino acid content, physicochemical parameters, or structural motifs. It aids in grouping proteins into functional categories, predicting protein function, and comprehending protein-protein interactions.
Ecological Community Analysis:
In ecology, K-means clustering is used to evaluate species abundance data to discover ecological communities or groups of species with similar abundance patterns. It is useful for investigating community structure, species diversity, and ecological interactions within ecosystems.
Drug Discovery and Pharmacogenomics:
In pharmacology and drug development, K-means clustering is used to assess high-dimensional drug response data, genetic profiles, or compound chemical characteristics. It aids in identifying drug response subtypes, forecasting medication effectiveness, and discovering new pharmacological targets or biomarkers.
Metagenomics and Microbiome Analysis:
In metagenomics and microbiome analysis, K-means clustering is used to examine microbial community composition data collected via sequencing methods. It aids in identifying microbial species, defining community structure, and investigating microbial diversity and ecological relationships.
Biomedical Imaging:
In medical imaging and bioimaging, K-means clustering can be used to segment and categorize biological structures or regions of interest. It aids in picture segmentation, feature extraction, and pattern identification for applications such as tumour detection, cell segmentation, and organ localization.