en:cs:k-nn_multiple_imputation

Ceci est une ancienne révision du document !


Definitions

  • $A$ and $B$ two sets, $g: A \longrightarrow B$ a function:
    • $g$ is injective $\iff \forall x, y \in A, f(x) =_{B} f(y) \implies x =_{A} y$
    • $g$ is surjective $\iff \forall y \in B, \exists x \in A, y =_{B} f(x)$
    • $g$ is bijective $\iff g$ is injective and surjective $\iff \forall y \in B, \exists! x \in A, y =_{B} f(x) \iff \exists g^{-1}: B \longrightarrow A, g \circ g^{-1} = id_A \land g^{-1} \circ g = id_B$
    • Subset $g$ image : $A^{\prime} \subset A, g(A^{\prime}) = \{g(x) \in B| \, x \in A^{\prime}\} \subset B$
    • Subset inverse $g$ image: $B^{\prime} \subset B, g^{-1}(B^{\prime}) = \{x \in A| \, g(x) \in B^{\prime}\} \subset A$
  • For a given $d \in \mathbb{N}$, let's define the function $f$:

$$ \begin{array}{lrcl} f: & \mathbb{R}^{d} & \longrightarrow & \mathbb{R}^{d} \\ & X = (x_{1}, \ldots, x_{d}) & \stackrel{f}{\longmapsto} & Y = f(X) = (y_{1} = f_{1}(x_{1}), \ldots, y_{d} = f_{d}(x_{d})) \end{array} $$

$f$ will be called the prediction function in subsequent sections.

  • For a given normed space vector on corpse $K$ $(E, \|~\|_{E})$ and $X \in E$, let' define the binary relation $\le_{X}$ (nearest neighborhood order):

$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} \le_{X} X_{2} \iff \|X - X_{1}\|_{E} \le_{K} \|X - X_{2}\|_{E}$$

  • For a given normed space vector on corpse $K$ $(E, \|~\|_{E})$ and $X \in E$, let' define the binary relation $=_{X}$ (nearest neighborhood equality):

$$\forall X_{1} \in E \land \forall X_{2} \in E, X_{1} =_{X} X_{2} \iff \|X - X_{1}\|_{E} =_{K} \|X - X_{2}\|_{E}$$

k-NN multiple imputation

For a given $d, n \in \mathbb{N}$, let's define the finite data set $\mathcal{D} = \{X_{i} \in \mathbb{R}^{d} | \, i \in \{1, \ldots, n\}\}$.

For $X \in \mathbb{R}^{d}$, $\le_{X}$ is a complete order on $\mathcal{D}$.

Proof.

  • Reflexivity: $\forall X_{1} \in \mathcal{D}, X_{1} \le_{X} X_{1} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{1}\|$
  • Antisymmetry: $\forall X_{1} \in \mathcal{D} \land \forall X_{2} \in \mathcal{D}, ((X_{1} \le_{X} X_{2} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\|) \land (X_{2} \le_{X} X_{1} \iff \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{1}\|)) \underset{\le_{\mathbb{R}}\, is\, antisymmetric}\implies \|X - X_{1}\| =_{\mathbb{R}} \|X - X_{2}\| \iff X_{1} =_{X} X_{2}$
  • Transitivity: $\forall X_{1} \in \mathcal{D} \land \forall X_{2} \in \mathcal{D} \land \forall X_{3} \in \mathcal{D}, X_{1} \le_{X} X_{2} \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\| \land X_{2} \le_{X} X_{3} \iff \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{3}\| \iff \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{2}\| \le_{\mathbb{R}} \|X - X_{3}\| \underset{\le_{\mathbb{R}}\, is\, transitive}\implies \|X - X_{1}\| \le_{\mathbb{R}} \|X - X_{3}\| \iff X_{1} \le_{X} X_{3}$
  • Completeness: order on finite set is complete (trivial proof to be done).

For $X \in \mathbb{R}^{d}$, $(\mathcal{D}, \le_{X})$ is a fully ordered finite set: $\mathcal{D} = \{X_{i} | \, \forall i \in \{2,\ldots,n\}, X_{i-1} \le_{X} X_{i} \}$

For $k \in \{1,\ldots,n\}$, let's define:

  • $\mathcal{N}^{k}_{X} = \{X_{i} \in \mathcal{D}|\, i \in \{1,\ldots,k\} \land \forall i \in \{2,\ldots,k\}, X_{i-1} \le_{X} X_{i} \}$ the ordered finite set of the $k$ nearest neighbors of $X$ in $\mathcal{D}$
  • $\underset{\le_{X}}\max \mathcal{N}^{k}_{X}$ the k-th nearest neighbor of $X$ in $\mathcal{D}$

.

Let's define $\mathcal{D^{\prime}} = \{(X_{i},Y_{i}) \in \mathcal{D} \times f(\mathcal{D}) | \, \forall i \in \{1,\ldots,n\}, Y_{i} = f(X_{i})\}$ a two features data set where $X_{i}$ is known and $Y_{i}$ can be missing at random.

$(X^{*}, Y^{*}) \in \mathcal{D} \times \mathbb{R}^{d}$ a partially missing value, as $X^{*}$ is known, to impute in $\mathcal{D^{\prime}}$.

For $k \in \{1,\ldots,n\}$, we consider the image $f(\mathcal{N}^{k}_{X^{*}}) = \{Y_{i} \in f(\mathcal{D})| \, Y_{i} = f(X_{i}) \land \forall i \in \{1, \ldots, k\}, X_{i} \in \mathcal{N}^{k}_{X^{*}}\}$ of the k nearest neighbors of $X^{*}$ in $\mathcal{D}$ as a sample of the possible imputed elements for $Y^{*}$.

(discuss conditions on $f$)

  • Impute with the mean:

\[ Y^* = \frac{1}{k} \sum_{i=1}^{k} Y_{i} \]

  • Impute with random sampling:

\[ Y^* \sim \text{Uniform}(f(\mathcal{N}^{k}_{X^{*}})) = \text{Uniform}(Y_{1}, \ldots, Y_{k}) \]

For $p \in \mathbb{N}$, $\{(X^*_j, Y^*_j) \in \mathcal{D} \times \mathbb{R}_{d}| \, j \in \{1,\ldots,p\}\}$ the finite set of partially missing values, describe how are built for $1 \le j \le p, Y^*_j$ imputed value.

For $m \in \mathbb{N}$, describe how are built the $m$ imputed dataset using k-NN imputation with random sampling.

Given $f: (X_{1},\ldots,X_{N-1}) \mapsto Y_N = f((X_{1},\ldots,X_{N-1}))$, let's define $\mathcal{D^{\prime}} = \{(X_{1,i},\ldots,X_{N-1,i}, Y_{N,i}) \in \mathbb{R}^{d} \times \ldots \times \mathbb{R}^{d} | \, \forall i \in \{1,\ldots,n\}, Y_{N,i} = f((X_{1,i},\ldots,X_{N-1,i}))\}$ a $N$ features finite data set.

  • en/cs/k-nn_multiple_imputation.1714854799.txt.gz
  • Dernière modification : il y a 6 mois
  • de fraggle