diff --git a/collections/_posts/2023-10-27-LUPI.md b/collections/_posts/2023-10-27-LUPI.md new file mode 100644 index 00000000..51a5389f --- /dev/null +++ b/collections/_posts/2023-10-27-LUPI.md @@ -0,0 +1,103 @@ +--- +layout: review +title: "LUPI: Learning Using Privileged Information" +tags: machine learning, SVM, privileged information +author: "Juliette Moreau" +cite: + authors: "Vladimir Vapnik and Akshay Vashist" + title: "A new learning paradigm: learning using privileged information. doi: https://doi.org/10.1016/j.neunet.2009.06.042" + venue: "Neural Networks, 2009" +--- + + +# Highlights + +* This paper introduces a new learning paradigm called LUPI. +* The addition of privileged information during the training even if it is not present during the inference improves the performances. +* The paradigm is tested on three different applications. + +# Introduction +Unlike machine learning, where the role of the teacher is not important, in human learning the teacher is very important. He/she provides additional information hidden in explanations, comments or comparisons. The paper explores a new paradigm in which specific forms of privileged information are considered during the training but not during the test phase. The superiority of this paradigm over the classical one is demonstrated. it is called the LUPI paradigm for learning using privileged information. + + +# Problematization +Usually the training is based on a set of pairs: + +$$(x_1,y_1),(x_2, y_2),...,(x_n,y_n) \quad where \quad x_i \in X, \quad y_i \in \{-1,1\} $$ + +following an unknown probability measure $$P(x,y)$$ and the goal is to find among a collection of functions $$f(x, \alpha), \alpha \in \Lambda$$ the function $$y=f(x, \alpha*)$$ that minimizes the number of incorrect classifications. + +In the LUPI paradigm we have triplets: + +$$(x_1, x^*_1, y_1),(x_2, x^*_2, y_2),...,(x_n, x^*_n, y_n) \quad where \quad x_i \in X, \quad x^*_i \in X^*, \quad y_i \in \{-1,1\} $$ + +following the unknown probability measure $$P(x, x^*, y)$$ and the goal is to find among a collection of functions $$f(x, \alpha), \alpha \in \Lambda$$ the function $$y=f(x, \alpha*)$$ that minimizes the number of incorrect classifications. + +The objective is exactly the same, but during training there is additional information that is not available during the test, that's why it's privileged information. + +They imagine several types of privileged information and some examples in the medical domain. For temporal prediction, some information from intermediate steps can be added (knowing the patient's symptoms at 3, 6 and 9 months to predict the evolution at 12 months). For images, holistic descriptions can be used (doctor's description of biopsy to classify between healthy and cancerous). + +# Privileged information in SVM type algorithms: the SVM+ method + +They apply this paradigm to SVM, the goal is to show that the convergence rate is significantly improved with this method. + +Some propositions are demonstrated introducing oracle SVM. + +> Proposition 1: If any vector $$x \in X$$ belongs to one and only one of the classes and there exists an Oracle function with respect to the best decision rule in the admissible set of hyperparameters, then with the probability $$-\eta$$ the following bound holds true +> +> $$ P(y[(w_l,x)+b_l]<0) \leq P(1-\xi^0 <0) + A \frac{h ln\frac{l}{h}-ln (\eta)}{l}$$ +> +> where $$P(y[(w_l,x)+b_l]<0)$$ is the probability of error for the Oracle SVM solution on the training set of size $$l$$, $$P(1-\xi^0 <0)$$ is the probability of error for the best solution in the admissible set of functions, $$h$$ is the VC dimension of the admissible set of hyperplanes, and $$A$$ is a constant. +> +> That is the Oracle solution converges to the best possible solution in the admissible set of solutions with the rate $$O(h/l)$$. + +But in reality a teacher does not know the oracle function, but can supply privileged information instead with the admissible set of the correcting functions $$\phi(x^*, \delta), \delta \in \Delta$$ which defines that values of the oracle function $$ \xi^0(x_i) = \phi(x^*_i, \delta_0), \forall (x_i, x^*_i, y_i)$$ + +![](/collections/images/LUPI/comparaison_SVM_Bayes.jpg) + +> Proposition 2: Under the conditions of proposition 1 with probability $$1-\eta$$ the following bound holds true +> +> $$ P(y[(w_l,x)+b_l]<0) \leq P(1-\phi(x^*, \delta_l<0)+A \frac{(h+h^*) ln\frac{2l}{h+h^*}-ln (\eta)}{l}$$ +> +> where $$P(y[(w_l,x)+b_l]<0)$$ is the probability of error for solution of the problem $$R(w,b,\delta) = \sum_{i=1}^{l} \theta [\phi(x^*_i, \delta)-1]$$ (where $$\theta(u)=1$$ if $$u>0$$) subjects to constraints $$y_i[(w,x_i)+b] \geq -\phi(x^*_i, \delta), i=1,\ldots,l$$ on the training set of size $$l$$, $$P(1-\phi(x^*, \delta_l<0)$$ is the probability of event $$\{\phi(x^*, \delta_l)>1\}$$, $$h$$ is the VC dimension of the admissible set of hyperplanes and $$h^*$$ is the VC dimension of the admissible set of correcting functions. + +For the rest, two models of the set of correcting functions are considered: the general $$X^*$$SVM+ model (an admissible set of non-negative correcting functions is defined in the multi-dimensional $$X^*$$-space) and the particular dSVM+ model (a set of admissible non-negative correcting functions is defined in a special one-dimensional d-space). The "plus" designs the use of privileged information. + +# Three examples of privileged information + +## Advanced technical model as privileged information: protein folding + +Proteins are classified in a human-made hierarchical classification according to their 3D structure in order to know how they are connected through evolution. But this 3D structure is complicated and time-consuming to obtain, whereas the amino acid sequence is easy to obtain, but getting the first from the second is not straightforward. + +The aim is to classify the proteins in this classification, but only with the amino acid sequence thanks to a pattern recognition technique. SVMs are considered to be one of the best techniques for constructing this type of decision rule. + +They focus on finding homologies between amino acid sequences between superfamilies in 80 cases of binary classification with 80 superfamilies that are representative of protein diversity and have enough sequences to be significant. Similarity between sequences is computed with profile-kernel, similarity between 3D structures with MAMMOTH. The set of vectors $$x \in X$$ is the similarity between sequences used for training, while $$x^* \in X^*$$ is the similarity between 3D structures used as privileged information. + +![](/collections/images/LUPI/error_rates_protein_folding.jpg) + +The classical paradigm is in no way superior to the LUPI paradigm. In 3 cases there is no error for either paradigm. In 11 cases there is no difference between them. In all other cases LUPI improves the results. dSVM+ is better than $$X^*$$SVM+, but the classification with the 3D structure remains better. + +## Future events as privileged information: Mackey-Glass series + +There are two parametrizations of such problem: quantitatively when with the value at time $$t$$ we predict the value at $$t+Dt$$ or qualitatively , when with the value at time $$t$$ we estimate if it is greater or lower at time $$t+Dt$$. + +In the first case LUPI can be used with a regression model and in the second case a pattern recognition model can be used. The latter case is studied in the article using chaotic series and the Mackey-Glass equation. These equations are used in several models and one in particular, which shows that the SVM is better than many other techniques for predicting time series, is used as a comparison for the LUPI paradigm. + +The goal is to predict whether $$x(t+T)>x(t)$$ with a 4-dimensional vector of time series observations: only one step ahead, or also 5 and 8 steps ahead of the time of interest. + +![](/collections/images/LUPI/error_rate_Mackey-Glass_series.jpg) + +The Bayesian law is approximated to evaluate the proximity between the error of the Oracle SVM and the Bayesian error considering that the SVM error converges to Bayesian error. + + +## Hollistic description as privileged information: digits classification + +As an example of holistic description as privileged information, they use the MNIST classification between 5 and 8 with resized to 10x10 images. A poetic description of the numbers is produced and translated into a 21-dimensional feature vector that gives characteristics of the number. + +![](/collections/images/LUPI/error_rates_digit_recognition.jpg) + +# Discussion + +3D information is very strong, so classification based on it remains better, but there are numerous cases where the LUPI paradigm gives remarkable results, such as with time series. Finding an even better space $$X^*$$ for the privileged information can improve the results even more. + +The LUPI paradigm is useful to speed up the convergence to the Bayesian solution (especially with few data). diff --git a/collections/images/LUPI/comparaison_SVM_Bayes.jpg b/collections/images/LUPI/comparaison_SVM_Bayes.jpg new file mode 100644 index 00000000..57bcbdc9 Binary files /dev/null and b/collections/images/LUPI/comparaison_SVM_Bayes.jpg differ diff --git a/collections/images/LUPI/error_rate_Mackey-Glass_series.jpg b/collections/images/LUPI/error_rate_Mackey-Glass_series.jpg new file mode 100644 index 00000000..8a302f8b Binary files /dev/null and b/collections/images/LUPI/error_rate_Mackey-Glass_series.jpg differ diff --git a/collections/images/LUPI/error_rates_digit_recognition.jpg b/collections/images/LUPI/error_rates_digit_recognition.jpg new file mode 100644 index 00000000..6c032d9f Binary files /dev/null and b/collections/images/LUPI/error_rates_digit_recognition.jpg differ diff --git a/collections/images/LUPI/error_rates_protein_folding.jpg b/collections/images/LUPI/error_rates_protein_folding.jpg new file mode 100644 index 00000000..4e2067f9 Binary files /dev/null and b/collections/images/LUPI/error_rates_protein_folding.jpg differ