Выберите одно из ключевых слов слева ...

Machine LearningLikelihood Ratio Classification

Время чтения: ~20 min

In this section, we will continue our study of statistical learning theory by introducing some vocabulary and results specific to binary classification. Borrowing from the language of disease diagnosis, will call the two classes positive and negative (which, in the medical context, indicate presence or absence of the disease in question). Correctly classifying a positive sample is called detection, and incorrectly classifying a negative sample is called false alarm or type I error.

Suppose that \mathcal{X} is the feature set of our classification problem and that \mathcal{Y} = \{+1,-1\} is the set of classes. Denote by (X,Y) a random observation drawn from the probability measure on \mathcal{X} \times \mathcal{Y}. We define p_{+1} to the probability that a sample is positive and p_{-1} = 1 - p_{+1} to be the probability that a sample is negative. Let f_{+1} be the conditional PMF or PDF of X given the event \{Y = +1\}, and let f_{-1} be the conditional PMF or PDF of X given \{Y = -1\}. We call f_{+1} and f_{-1} class conditional distributions.

Given a function h: \mathcal{X} \to \mathcal{Y} (which we call a classifier), we define its confusion matrix to be

\begin{align*} \begin{bmatrix} \mathbb{P}(h(X) = +1 | Y = +1) & \mathbb{P}(h(X) = +1 | Y = -1) \\ \mathbb{P}(h(X) = -1 | Y = +1) & \mathbb{P}(h(X) = -1 | Y = -1) \end{bmatrix}.\end{align*}

We call the top-left entry of the confusion matrix the detection rate (or true positive rate, or recall or sensitivity) and the top-right entry the false alarm rate (or false positive rate).

The precision of a classifier h is the conditional probability of \{Y = +1\} given \{h(X) = +1\}. Show that a classifier can have high detection rate, low false alarm rate, and low precision.

Solution. Suppose that p_{-1} = 0.999 and that h has detection rate 0.99 and false alarm rate 0.01. Then the precision of h is

\begin{align*} \mathbb{P}(Y = +1 | h(X) = +1) = \frac{\mathbb{P}(\{Y = +1\} \cap \{h(X) = +1\}) }{\mathbb{P}(h(X) = +1)} = \frac{(0.001)(0.99)}{(0.001)(0.99) + (0.999)(0.01)} \approx 0.09.\end{align*}

We see that, unlike detection rate and false alarm rate, precision depends on the value of p_{-1}. If p_{-1} is very high, it can result in low precision even if the classifier has high accuracy within each class.

The Bayes classifier

\begin{align*} h(\mathbf{x}) = \begin{cases} +1 & \text{if }p_{+1}f_{+1}(\mathbf{x}) \geq p_{-1}f_{-1}(\mathbf{x}) \\ -1 & \text{otherwise} \\ \end{cases}\end{align*}

minimizes the probability of misclassification. In other words, it is the classifier h for which

\begin{align*} \mathbb{P}(h(X) = +1 \text{ and }Y = -1) + \mathbb{P}(h(X) = -1 \text{ and }Y = +1)\end{align*}

is as small as possible. However, the two types of misclassification often have different real-world consequences, and we might therefore wish to weight them differently. Given t \geq 0, we define the likelihood ratio classifier

\begin{align*} h_t(\mathbf{x}) = \begin{cases} +1 & \text{if }\frac{f_{+1}(\mathbf{x})}{f_{-1}(\mathbf{x})} \geq t \\ -1 & \text{otherwise}. \end{cases}\end{align*}

Show that the likelihood ratio classifier is a generalization of the Bayes classifier.

Solution. If we let t = p_{-1}/p_{+1}, then the inequality \frac{f_{+1}(\mathbf{x})}{f_{-1}(\mathbf{x})} \geq t simplifies to p_{+1}f_{+1}(\mathbf{x}) \geq p_{-1}f_{-1}(\mathbf{x}). Therefore, the Bayes classifier is equal to h_{p_{-1}/p_{+1}}.

Receiver Operating Characteristic

If we increase t, then some of the predictions of h_t switch from +1 to -1, while others stay the same. Therefore, the detection rate and false alarm rate both decrease as t increases. Likewise, if we decrease t, then detection rate and false alarm rate both increase. If we let t range over the interval [0,\infty] and plot each ordered pair (\operatorname{FAR}(h_t), \operatorname{DR}(h_t)), then we obtain a curve like the one shown in the figure below. This curve is called the receiver operating characteristic of the likelihood ratio classifier.

The ideal scenario is that this curve passes through points near the top left corner of the square, since that means that some of the classifiers in the family \{h_t : t \in [0,\infty]\} have both high detection rate and low false alarm rate. We quantify this idea using the area under the ROC (called the AUROC). This value is close to 1 for an excellent classifier and close to \frac{1}{2} for a classifier whose ROC is the diagonal line from the origin to (1,1).

A graph of the receiver operating characteristic (ROC).

Suppose that \mathcal{X} = \mathbb{R} and that the class conditional densities for -1 and +1 are normal distributions with unit variances and means 0 and \mu, respectively. For each \mu \in \{1/4,1,4\}, predict the approximate shape of the ROC for the likelihood ratio classifier. Then calculate it explicitly and plot it.

Solution. We predict that the ROC will be nearly diagonal for \mu = \frac{1}{4}, since the class conditional distributions overlap heavily, and therefore any increase in detection rate will induce an approximately equal increase in false alarm rate. When \mu = 4, we expect to get a very large AUROC, since in that case the distributions overlap very little. The \mu = 1 curve will lie between these extremes. To plot these curves, we begin by calculating the likelihood ratio

\begin{align*} \frac{f_{+1}(x)}{f_{-1}(x)} = \frac{\operatorname{e}^{-(x-\mu)^2/2}}{\operatorname{e}^{-x^2/2}} = \operatorname{e}^{\mu x - \mu^2/2}, \end{align*}

So the detection rate for h_t is the probability that an observation drawn from \mathcal{N}(\mu,1) lies in the region where \operatorname{e}^{\mu x - \mu^2/2} \geq t. Solving this inequality for x, we find that the detection rate is equal to the probability mass assigned to the interval \left[\frac{\log t}{\mu} + \frac{\mu}{2},\infty\right) by the distribution \mathcal{N}(\mu,1).

Likewise, the false alarm rate is the probability mass assigned to the same interval by the negative class conditional distribution, \mathcal{N}(0,1).

using Plots, Distributions
FAR(μ,t) = 1-cdf(Normal(0,1),log(t)/μ + μ/2)
DR(μ,t) = 1-cdf(Normal(μ,1),log(t)/μ + μ/2)
ROC(μ) = [(FAR(μ,t),DR(μ,t))
               for t in exp.(-20:0.1:20)]
plot!(xlabel = "false alarm rate",
      ylabel = "detection rate")
Bruno Bruno