Order-Sensitive Imputation for Clustered Missing Values

Qian Ma, Yu Gu, Wang-chien Lee, Ge Yu

Research output: Contribution to journalArticle

Abstract

The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the Clustered Missing Values Phenomenon, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.

Original languageEnglish (US)
Article number8330055
Pages (from-to)166-180
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Volume31
Issue number1
DOIs
StatePublished - Jan 1 2019

Fingerprint

Heuristic algorithms
Learning algorithms
Learning systems
Hardness
Experiments

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

@article{fa465fdf40eb45579f260760e09fa4e7,
title = "Order-Sensitive Imputation for Clustered Missing Values",
abstract = "The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the Clustered Missing Values Phenomenon, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.",
author = "Qian Ma and Yu Gu and Wang-chien Lee and Ge Yu",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/TKDE.2018.2822662",
language = "English (US)",
volume = "31",
pages = "166--180",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "1",

}

Order-Sensitive Imputation for Clustered Missing Values. / Ma, Qian; Gu, Yu; Lee, Wang-chien; Yu, Ge.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 31, No. 1, 8330055, 01.01.2019, p. 166-180.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Order-Sensitive Imputation for Clustered Missing Values

AU - Ma, Qian

AU - Gu, Yu

AU - Lee, Wang-chien

AU - Yu, Ge

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the Clustered Missing Values Phenomenon, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.

AB - The issue of missing values (MVs) has appeared widely in real-world datasets and hindered the use of many statistical or machine learning algorithms for data analytics due to their incompetence in handling incomplete datasets. To address this issue, several MV imputation algorithms have been developed. However, these approaches do not perform well when most of the incomplete tuples are clustered with each other, coined here as the Clustered Missing Values Phenomenon, which attributes to the lack of sufficient complete tuples near an MV for imputation. In this paper, we propose the Order-Sensitive Imputation for Clustered Missing values (OSICM) framework, in which missing values are imputed sequentially such that the values filled earlier in the process are also used for later imputation of other MVs. Obviously, the order of imputations is critical to the effectiveness and efficiency of OSICM framework. We formulate the searching of the optimal imputation order as an optimization problem, and show its NP-hardness. Furthermore, we devise an algorithm to find the exact optimal solution and propose two approximate/heuristic algorithms to trade off effectiveness for efficiency. Finally, we conduct extensive experiments on real and synthetic datasets to demonstrate the superiority of our OSICM framework.

UR - http://www.scopus.com/inward/record.url?scp=85058226786&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058226786&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2018.2822662

DO - 10.1109/TKDE.2018.2822662

M3 - Article

AN - SCOPUS:85058226786

VL - 31

SP - 166

EP - 180

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 1

M1 - 8330055

ER -