Table des matières

Definitions

$$ \begin{array}{lrcl} f: & \mathbb{R}^{d} & \longrightarrow & \mathbb{R}^{d} \\ & X = (x_{1}, \ldots, x_{d}) & \stackrel{f}{\longmapsto} & Y = f(X) = (y_{1} = f_{1}(x_{1}), \ldots, y_{d} = f_{d}(x_{d})) \end{array} $$

$f$ will be called the prediction function in subsequent sections.

$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} \le_{X} X_{2} \iff \|X - X_{1}\|_{E} \le_{K} \|X - X_{2}\|_{E}$$

$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} =_{X} X_{2} \iff \|X - X_{1}\|_{E} =_{K} \|X - X_{2}\|_{E}$$

k-NN multiple imputation

For a given $d, n \in \mathbb{N}$, let's define the finite data set $\mathcal{D} = \{X_{i} \in \mathbb{R}^{d} | \, i \in \{1, \ldots, n\}\}$.

Lemme 1

For $X \in \mathbb{R}^{d}$, $\le_{X}$ is a complete order on $\mathcal{D}$.

Proof.

k-NN

For $X \in \mathbb{R}^{d}$, $(\mathcal{D}, \le_{X})$ is a fully ordered finite set: $\mathcal{D} = \{X_{i} | \, \forall i \in \{2,\ldots,n\}, X_{i-1} \le_{X} X_{i} \}$

For $k \in \{1,\ldots,n\}$, let's define:

.

Imputation on $2$ features data set

Let's define $\mathcal{D^{\prime}} = \{(X_{i},Y_{i}) \in \mathcal{D} \times f(\mathcal{D}) | \, \forall i \in \{1,\ldots,n\}, Y_{i} = f(X_{i})\}$ a two features data set where $X_{i}$ is known and $Y_{i}$ can be missing at random.

Unique missing value imputation

$(X^{*}, Y^{*}) \in \mathcal{D} \times \mathbb{R}^{d}$ a partially missing value, as $X^{*}$ is known, to impute in $\mathcal{D^{\prime}}$.

For $k \in \{1,\ldots,n\}$, we consider the image $f(\mathcal{N}^{k}_{X^{*}}) = \{Y_{i} \in f(\mathcal{D})| \, Y_{i} = f(X_{i}) \land \forall i \in \{1, \ldots, k\}, X_{i} \in \mathcal{N}^{k}_{X^{*}}\}$ of the k nearest neighbors of $X^{*}$ in $\mathcal{D}$ as a sample of the possible imputed elements for $Y^{*}$.

(discuss conditions on $f$)

\[ Y^* = \frac{1}{k} \sum_{i=1}^{k} Y_{i} \]

\[ Y^* \sim \text{Uniform}(f(\mathcal{N}^{k}_{X^{*}})) = \text{Uniform}(Y_{1}, \ldots, Y_{k}) \]

Several missing values imputation

For $p \in \mathbb{N}$, $\{(X^*_j, Y^*_j) \in \mathcal{D} \times \mathbb{R}_{d}| \, j \in \{1,\ldots,p\}\}$ the finite set of partially missing values, describe how are built for $1 \le j \le p, Y^*_j$ imputed value.

Multiple imputation

For $m \in \mathbb{N}$, describe how are built the $m$ imputed dataset using k-NN imputation with random sampling.

Imputation on $N \in \mathbb{N}$ features data set

Given $f: (X_{1},\ldots,X_{N-1}) \mapsto Y_N = f((X_{1},\ldots,X_{N-1}))$, let's define $\mathcal{D^{\prime}} = \{(X_{1,i},\ldots,X_{N-1,i}, Y_{N,i}) \in \mathbb{R}^{d} \times \ldots \times \mathbb{R}^{d} | \, \forall i \in \{1,\ldots,n\}, Y_{N,i} = f((X_{1,i},\ldots,X_{N-1,i}))\}$ a $N$ features finite data set.