Clustering with Missing Features: A Penalized Dissimilarity Measure based approach
This topic contains 0 replies, has 1 voice, and was last updated by arXiv 1 month ago.

Clustering with Missing Features: A Penalized Dissimilarity Measure based approach
Many realworld clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we put forth the concept of Penalized Dissimilarity Measures which estimate the actual distance between two data points (the distance between them if they were to be fully observed) by adding a penalty to the distance due to the observed features common to both the instances. We then propose such a dissimilarity measure called the Feature Weighted Penalty based Dissimilarity (FWPD) measure. Using the proposed dissimilarity measure, we also modify the traditional kmeans clustering algorithm and the standard hierarchical agglomerative clustering techniques so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also present a detailed analysis showing that the new FWPD based kmeans algorithm converges to a local optimum within a finite number of iterations. We have also conducted extensive experiments on various benchmark datasets showing that the proposed clustering techniques have generally better results compared to some of the popular imputation methods which are commonly used to handle such incomplete data. We have appended a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be nonexistent).
Clustering with Missing Features: A Penalized Dissimilarity Measure based approach
by Shounak Datta, Supritam Bhattacharjee, Swagatam Das
https://arxiv.org/pdf/1604.06602v6.pdf
You must be logged in to reply to this topic.