Definitions
- $A$ and $B$ two sets, $g: A \longrightarrow B$ a function:
- $g$ is injective $\iff \forall x, y \in A, f(x) =_{B} f(y) \implies x =_{A} y$
- $g$ is surjective $\iff \forall y \in B, \exists x \in A, y =_{B} f(x)$
- $g$ is bijective $\iff g$ is injective and surjective $\iff \forall y \in B, \exists! x \in A, y =_{B} f(x) \iff \exists g^{-1}: B \longrightarrow A, g \circ g^{-1} = id_A \land g^{-1} \circ g = id_B$
- Subset $g$ image : $A^{\prime} \subset A, g(A^{\prime}) = \{g(x) \in B| \, x \in A^{\prime}\} \subset B$
- Subset inverse $g$ image: $B^{\prime} \subset B, g^{-1}(B^{\prime}) = \{x \in A| \, g(x) \in B^{\prime}\} \subset A$
- For a given $d \in \mathbb{N}$, let's define the function $f$:
$$ \begin{array}{lrcl} f: & \mathbb{R}^{d} & \longrightarrow & \mathbb{R}^{d} \\ & X = (x_{1}, \ldots, x_{d}) & \stackrel{f}{\longmapsto} & Y = f(X) = (y_{1} = f_{1}(x_{1}), \ldots, y_{d} = f_{d}(x_{d})) \end{array} $$
$f$ will be called the prediction function in subsequent sections.
- For a given normed space vector on corpse $K$ $(E, \|~\|_{E})$ and $X \in E$, let' define the binary relation $\le_{X}$ (nearest neighborhood order):
$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} \le_{X} X_{2} \iff \|X - X_{1}\|_{E} \le_{K} \|X - X_{2}\|_{E}$$
- For a given normed space vector on corpse $K$ $(E, \|~\|_{E})$ and $X \in E$, let' define the binary relation $=_{X}$ (nearest neighborhood equality):
$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} =_{X} X_{2} \iff \|X - X_{1}\|_{E} =_{K} \|X - X_{2}\|_{E}$$
k-NN multiple imputation
For a given $d, n \in \mathbb{N}$, let's define the finite data set $\mathcal{D} = \{X_{i} \in \mathbb{R}^{d} | \, i \in \{1, \ldots, n\}\}$.
Lemme 1
For $X \in \mathbb{R}^{d}$, $\le_{X}$ is a complete order on $\mathcal{D}$.
Proof.
- Reflexivity: $\forall X_{1} \in \mathcal{D}, X_{1} \le_{X} X_{1} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{1}\|$
- Antisymmetry: $\forall X_{1} \in \mathcal{D} \land \forall X_{2} \in \mathcal{D}, ((X_{1} \le_{X} X_{2} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\|) \land (X_{2} \le_{X} X_{1} \iff \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{1}\|)) \underset{\le_{\mathbb{R}}\, is\, antisymmetric}\implies \|X - X_{1}\| =_{\mathbb{R}} \|X - X_{2}\| \iff X_{1} =_{X} X_{2}$
- Transitivity: $\forall X_{1} \in \mathcal{D} \land \forall X_{2} \in \mathcal{D} \land \forall X_{3} \in \mathcal{D}, X_{1} \le_{X} X_{2} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\| \land X_{2} \le_{X} X_{3} \iff \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{3}\| \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{3}\| \underset{\le_{\mathbb{R}}\, is\, transitive}\implies \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{3}\| \iff X_{1} \le_{X} X_{3}$
- Completeness: order on finite set is complete (trivial proof to be done).
k-NN
For $X \in \mathbb{R}^{d}$, $(\mathcal{D}, \le_{X})$ is a fully ordered finite set: $\mathcal{D} = \{X_{i} | \, \forall i \in \{2,\ldots,n\}, X_{i-1} \le_{X} X_{i} \}$
For $k \in \{1,\ldots,n\}$, let's define:
- $\mathcal{N}^{k}_{X} = \{X_{i} \in \mathcal{D}|\, i \in \{1,\ldots,k\} \land \forall i \in \{2,\ldots,k\}, X_{i-1} \le_{X} X_{i} \}$ the ordered finite set of the $k$ nearest neighbors of $X$ in $\mathcal{D}$
- $\underset{\le_{X}}\max \mathcal{N}^{k}_{X}$ the k-th nearest neighbor of $X$ in $\mathcal{D}$
.
Imputation on $2$ features data set
Let's define $\mathcal{D^{\prime}} = \{(X_{i},Y_{i}) \in \mathcal{D} \times f(\mathcal{D}) | \, \forall i \in \{1,\ldots,n\}, Y_{i} = f(X_{i})\}$ a two features data set where $X_{i}$ is known and $Y_{i}$ can be missing at random.
Unique missing value imputation
$(X^{*}, Y^{*}) \in \mathcal{D} \times \mathbb{R}^{d}$ a partially missing value, as $X^{*}$ is known, to impute in $\mathcal{D^{\prime}}$.
For $k \in \{1,\ldots,n\}$, we consider the image $f(\mathcal{N}^{k}_{X^{*}}) = \{Y_{i} \in f(\mathcal{D})| \, Y_{i} = f(X_{i}) \land \forall i \in \{1, \ldots, k\}, X_{i} \in \mathcal{N}^{k}_{X^{*}}\}$ of the k nearest neighbors of $X^{*}$ in $\mathcal{D}$ as a sample of the possible imputed elements for $Y^{*}$.
(discuss conditions on $f$)
- Impute with the mean:
\[ Y^* = \frac{1}{k} \sum_{i=1}^{k} Y_{i} \]
- Impute with the median:
- Impute with random sampling:
\[ Y^* \sim \text{Uniform}(f(\mathcal{N}^{k}_{X^{*}})) = \text{Uniform}(Y_{1}, \ldots, Y_{k}) \]
Several missing values imputation
For $p \in \mathbb{N}$, $\{(X^*_j, Y^*_j) \in \mathcal{D} \times \mathbb{R}_{d}| \, j \in \{1,\ldots,p\}\}$ the finite set of partially missing values, describe how are built for $1 \le j \le p, Y^*_j$ imputed value.
Multiple imputation
For $m \in \mathbb{N}$, describe how are built the $m$ imputed dataset using k-NN imputation with random sampling.
Imputation on $N \in \mathbb{N}$ features data set
Given $f: (X_{1},\ldots,X_{N-1}) \mapsto Y_N = f((X_{1},\ldots,X_{N-1}))$, let's define $\mathcal{D^{\prime}} = \{(X_{1,i},\ldots,X_{N-1,i}, Y_{N,i}) \in \mathbb{R}^{d} \times \ldots \times \mathbb{R}^{d} | \, \forall i \in \{1,\ldots,n\}, Y_{N,i} = f((X_{1,i},\ldots,X_{N-1,i}))\}$ a $N$ features finite data set.