Evaluate Dataset and Sample Difficulty

This article shares papers on evaluating datasets and sample difficulty, specifically:

The main content comes from the sharing of the group meeting, the slides is right here.

1. Shannon Mutual Information

In probability theory and information theory, the mutual information (Mutual Information, MI) of two random variables measures the degree of mutual dependence between the two variables. Specifically, for two random variables, MI is the "information" (usually in bits) of one random variable that is reduced by knowing the other.

The mutual information of discrete random variables X and Y can be calculated as:

Shannon Mutual Information Definition of Discrete Variables

where p(x, y) is the joint probability mass function of X and Y, and p(x) and p(y) are the marginal probability mass functions of X and Y, respectively.

Mutual information can be equivalently expressed as:

However, in the context of machine learning, Shannon mutual information conflicts with some of our current empirical understanding, such as the example shown in the picture below:

The Shannon mutual information of plaintext, ciphertext and label is equal

Under Shannon's mutual information theory, the mutual information of the plaintext and the label is equal to the mutual information of the ciphertext and the label. However, for us humans, the label can be easily identified from the plaintext; it cannot be judged from the ciphertext. That is, as shown in the figure below:

To resolve this conflict, the authors propose a new concept: considering mutual information under computational constraints.

2. Extended Definition of Shannon Mutual Information——\(\mathcal{V}\)-Information

First, the author introduces three concepts:

predictive family

Let \(\Omega=\{f: \mathcal{X} \cup\{\varnothing\} \rightarrow \mathcal{P}(\mathcal{Y})\}\). We say that \(\mathcal{V} \subseteq \Omega\) is a predictive family if it satisfies \[ \forall f \in \mathcal{V}, \forall P \in \operatorname{range}(f), \quad \exists f^{\prime} \in \mathcal{V}, \quad \text { s.t. } \quad \forall x \in \mathcal{X}, f^{\prime}[x]=P, f^{\prime}[\varnothing]=P \]
predictive conditional \(\mathcal{V}\)-entropy

Let \(X, Y\) be two random variables taking values in \(\mathcal{X} \times \mathcal{Y}\), and \(\mathcal{V}\) be a predictive family. Then the predictive conditional \(\mathcal{V}\)-entropy is defined as \[ \begin{aligned} H_{\mathcal{V}}(Y \mid X) & =\inf _{f \in \mathcal{V}} \mathbb{E}_{x, y \sim X, Y}[-\log f[x](y)] \\ H_{\mathcal{V}}(Y \mid \varnothing) & =\inf _{f \in \mathcal{V}} \mathbb{E}_{y \sim Y}[-\log f[\varnothing](y)] \end{aligned} \] We additionally call \(H_{\mathcal{V}}(Y \mid \varnothing)\) the \(\mathcal{V}\)-entropy, and also denote it as \(H_{\mathcal{V}}(Y)\) .
predictive conditional \(\mathcal{V}\)-information

Let \(X, Y\) be two random variables taking values in \(\mathcal{X} \times \mathcal{Y}\), and \(\mathcal{V}\) be a predictive family. The predictive \(\mathcal{V}\)-information from \(X\) to \(Y\) is defined as \[ I_{\mathcal{V}}(X \rightarrow Y)=H_{\mathcal{V}}(Y \mid \varnothing)-H_{\mathcal{V}}(Y \mid X) \]

\(\mathcal{V}\)-information has the following properties:

basic properties

Let \(Y\) and \(X\) be any random variables on \(\mathcal{Y}\) and \(\mathcal{X}\), and \(\mathcal{V}\) and \(\mathcal{U}\) be any predictive families, then we have
- Monotonicity: If \(\mathcal{V} \subseteq \mathcal{U}\), then \(H_{\mathcal{V}}(Y) \geq H_{\mathcal{U}}(Y), H_{\mathcal{V}}(Y \mid X) \geq H_{\mathcal{U}}(Y \mid X)\).
- Non-Negativity: \(I_{\mathcal{V}}(X \rightarrow Y) \geq 0\).
- Independence: If \(X\) is independent of \(Y, I_{\mathcal{V}}(X \rightarrow Y)=I_{\mathcal{V}}(Y \rightarrow X)=0\).
Data processing inequalities (different from Shannon mutual information)
- Shannon Mutual Information: Letting \(t: \mathcal{X} \rightarrow \mathcal{X}\) be any function, \(t(X)\) cannot have higher mutual information with \(Y\) than \(X: I(t(X) ; Y) \leq I(X ; Y)\).
- \(\mathcal{V}\)-Information: Denoting \(t\) as the decryption algorithm and \(\mathcal{V}\) as a class of natural language processing functions, we have that: \(I_{\mathcal{V}}(t(X) \rightarrow Y)>I_{\mathcal{V}}(X \rightarrow Y) \approx 0\).
Asymmetry (different from Shannon mutual information)

If \(\mathcal{V}\) contains all polynomial-time computable functions, then \(I_{\mathcal{V}}(X \rightarrow h(X)) \gg I_{\mathcal{V}}(h(X) \rightarrow X)\) , where \(h: \mathcal{X} \rightarrow \mathcal{Y}\).

According to the nature of \(\mathcal{V}\)-information, we can reasonably explain the problems shown in the above example, thus:

The above is a strict definition of \(\mathcal{V}\)-Information, but in machine learning, we don't have a real distribution, but only a finite size data set sampled from the distribution. Next, the authors address how to estimate \(\mathcal{V}\)-Information on a finite-sized dataset:

Let \(X, Y\) be two random variables taking values in \(\mathcal{X}, \mathcal{Y}\) and \(\mathcal{D}=\left\{\left(x_i, y_i\right)\right\}_{i=1}^N \sim X, Y\) denotes the set of samples drawn from the joint distribution over \(\mathcal{X}\) and \(\mathcal{Y} . \mathcal{V}\) is a predictive family. The empirical \(\mathcal{V}\)-information (under \(\mathcal{D}\) ) is the following \(\mathcal{V}\)-information under the empirical distribution defined via \(\mathcal{D}\) : \[ \hat{I}_{\mathcal{V}}(X \rightarrow Y ; \mathcal{D})=\inf _{f \in \mathcal{V}} \frac{1}{|\mathcal{D}|} \sum_{y_i \in \mathcal{D}} \log \frac{1}{f[\varnothing]\left(y_i\right)}-\inf _{f \in \mathcal{V}} \frac{1}{|\mathcal{D}|} \sum_{x_i, y_i \in \mathcal{D}} \log \frac{1}{f\left[x_i\right]\left(y_i\right)} \] Then we have the following PAC bound over the empirical \(\mathcal{V}\)-information:

Assume \(\forall f \in \mathcal{V}, x \in \mathcal{X}, y \in \mathcal{Y}, \log f[x](y) \in[-B, B]\). Then for any \(\delta \in(0,0.5)\), with probability at least \(1-2 \delta\), we have: \[ \left|I_{\mathcal{V}}(X \rightarrow Y)-\hat{I}_{\mathcal{V}}(X \rightarrow Y ; \mathcal{D})\right| \leq 4 \mathfrak{R}_{|\mathcal{D}|}\left(\mathcal{G}_{\mathcal{V}}\right)+2 B \sqrt{\frac{2 \log \frac{1}{\delta}}{|\mathcal{D}|}} \] where we define the function family \(\mathcal{G}_{\mathcal{V}}=\{g \mid g(x, y)=\log f[x](y), f \in \mathcal{V}\}\), and \(\mathfrak{R}_N(\mathcal{G})\) denotes the Rademacher complexity of \(\mathcal{G}\) with sample number \(N\).

3. Evaluate Dataset and Sample Difficulty using \(\mathcal{V}\)-Information

Using the \(\mathcal{V}\)-Information tool introduced above and conducting experiments, some interesting phenomena can be found.

The accuracy and \(\mathcal{V}\)-Usable Information of the large model are higher, because extracting more information makes recognition easier
\(\mathcal{V}\)-Information is more sensitive to over-fitting compared to accuracy

Provides a way to measure the difficulty of different datasets

Then, a method for evaluating sample point \(\mathcal{V}\)-Information (Pointwise \(\mathcal{V}\)-Information, PVI) is introduced:

Given random variables \(X, Y\) and a predictive family \(\mathcal{V}\), the pointwise \(\mathcal{V}\)-information (PVI) of an instance \((x, y)\) is \[ \operatorname{PVI}(x \rightarrow y)=-\log _2 g[\varnothing](y)+\log _2 g^{\prime}[x](y) \] where \(g \in \mathcal{V}\) s.t. \(\mathbb{E}[-\log g[\varnothing](Y)]=H_{\mathcal{V}}(Y)\) and \(g^{\prime} \in \mathcal{V}\) s.t. \(\mathbb{E}\left[-\log g^{\prime}[X](Y)\right]=H_{\mathcal{V}}(Y \mid X)\).

PVI is to \(\mathcal{V}\)-information what PMI is to Shannon information: \[ \begin{aligned} I(X ; Y) & =\mathbb{E}_{x, y \sim P(X, Y)}[\operatorname{PMI}(x, y)] \\ I_{\mathcal{V}}(X \rightarrow Y) & =\mathbb{E}_{x, y \sim P(X, Y)}[\operatorname{PVI}(x \rightarrow y)] \end{aligned} \]

The complete algorithm flow is as follows:

Then, using PVI to experiment, we can also observe some interesting phenomena:

There are many labeling errors in the sample with the lowest PVI

The PVI threshold for the model to correctly classify samples is around 0.5

In addition, the original paper also showed some interesting experimental phenomena, which will not be repeated here, please refer to the original paper for details.

Deep Learning > Read Papers

#dataset & example difficulty

Evaluate Dataset and Sample Difficulty

https://arcsin2.cloud/en/2023/03/12/Evaluate Dataset and Sample Difficulty/

Author

arcsin2

Posted on

March 12, 2023

Licensed under

Some Missing Knowledge Points of C Programming Language for C++ Programmer Previous

Install QEMU on Ubuntu 22.04 Next