# Machine LearningLogistic Regression

In this section we discuss *logistic regression*, which is a discriminative model for binary classification.

**Example**

Consider a binary classification problem where the two classes are equally probable, the class-0 conditional density is a standard multivariable normal distribution in two dimensions, and the class-1 conditional density is a multivariate normal distribution with mean and covariance . Find the class boundary for the Bayes classifier.

*Solution.* The Bayes classifier is , where

By symmetry, the classifier will predict class 1 for every point above the line and class 0 for every point below the line. We can obtain the same result by solving the equation . We get

which simplifies to , as desired.

**Example**

Find the regression function for the example above. Plot a heatmap of this function.

*Solution.* Let's use the multivariate normal type `MvNormal`

from the `Distributions`

package.

using Plots, Distributions, Optim mycgrad = cgrad([:MidnightBlue,:SeaGreen,:Gold,:Tomato]) gr(aspect_ratio=1,fillcolor=mycgrad) # Plots.jl defaults A = MvNormal([0,0],[1.0 0; 0 1]) B = MvNormal([1,1],[1.0 0; 0 1]) xgrid = -5:1/2^5:5 ygrid = -5:1/2^5:5 r(x,y) = 0.5pdf(B,[x,y])/(0.5pdf(A,[x,y])+0.5pdf(B,[x,y])) heatmap(xgrid,ygrid,r)

We can see from the heatmap that restricting to any line of slope 1 yields a function which asymptotes to 0 in the southwest direction and to 1 in the northeast direction, increasing smoothly in between. Such a function is called a **sigmoid** function.

Given the regression function , we can recover the Bayes classifier by predicting class 1 whenever and class 0 whenever . However, the value of the regression function also conveys the degree of confidence associated with the prediction. If and , then observations at and are both predicted as class 1, but the latter with much more confidence.

The graph in the example above suggests modeling parametrically as a composition of a linear map and a sigmoid function. Specifically, we posit the model , where , , and .

To select the parameters and , we penalize lack of confident correctness for each training sample. We give a sample of class 1 the penalty (which is

**Exercise**

Experiment with the sliders below and get the loss value below 2.45.

loss = ${loss}

using Optim Z = [-1.2, -0.8, -0.7, 0.4, -2.4, 1.13] O = [2.2, 1.3, 0.8, 2.5, 2.62] f(α, β, x) = 1/(1+exp(-α-β*x)) function loss(Z, O, θ) α, β = θ sum(log(1/(1-f(α, β, x))) for x in Z) + sum(log(1/f(α, β, x)) for x in O) end optimize(θ->loss(Z,O,θ), [0.0, 1.0])

**Example**

Sample 1000 points by choosing one of the two multivariate Gaussian distributions uniformly at random and then sampling from the selected distribution. Find the function of the form which minimizes

*Solution.* We begin by sampling the points as suggested.

observations = [rand(Bool) ? (rand(A),0) : (rand(B),1) for i in 1:1000] cs = [c for ((x,y),c) in observations] scatter([(x,y) for ((x,y),c) in observations], group=cs, markersize=2)

Next, we define the loss function and minimize it:

σ(u) = 1/(1 + exp(-u)) r(β,x) = σ(β'*[1;x]) C(β,xᵢ,yᵢ) = yᵢ*log(1/r(β,xᵢ))+(1-yᵢ)*log(1/(1-r(β,xᵢ))) L(β) = sum(C(β,xᵢ,yᵢ) for (xᵢ,yᵢ) in observations) β̂ = optimize(L,ones(3),BFGS()).minimizer heatmap(xgrid,ygrid,(x,y)->r(β̂,[x,y]))

We can see that the resulting heatmap looks quite similar to the actual regression function.

**Example**

In the example above, is it true that for some and ?

*Solution.* We calculate

which is equal to if and . So the assumption was correct in this example.

**Exercise**

Consider a binary classification problem for which the regression function satisfies for some and . Show that the decision boundary is linear.

*Solution.* We solve to find the decision boundary. This equation is equivalent to , the solution set of which is linear (by definition, since the equation is linear).

This exercise shows that directly applying logistic regression always yields linear decision boundaries. However, we can use logistic regression to find nonlinear decision boundaries by appending components to the feature vectors which are derived from the original features. For example, if we apply the map to each feature vector, then the linear boundary we discover in will correspond to a quadric curve in the original feature space .