Let’s begin with our simple problem of estimating the parameters for a linear regression model with gradient descent.
Where the gradient \(\nabla J(\theta)\) is in general defined as:
\[ \nabla J(\theta) = \left[ \frac{\partial J}{\partial \theta_{0}},\frac{\partial J}{\partial \theta_{1}}, \cdots, \frac{\partial J}{\partial \theta_{p}} \right]\]
And in the case of linear regression is:
\[ \nabla J(\theta) = \frac{1}{N}(y^{T} - \theta X^{T})X \]
The gradient decent algorithm finds parameters in the following manner:
repeat while (\(||\eta \nabla J(\theta)|| > \epsilon\)){ \[ \theta := \theta - \eta \frac{1}{N}(y^{T} - \theta X^{T})X \] }
As it turns out, this is quite easy to implement in R as a function which we call gradientR below:
gradientR<-function(y, X, epsilon,eta, iters){
epsilon = 0.0001
X = as.matrix(data.frame(rep(1,length(y)),X))
N= dim(X)[1]
print("Initialize parameters...")
theta.init = as.matrix(rnorm(n=dim(X)[2], mean=0,sd = 1)) # Initialize theta
theta.init = t(theta.init)
e = t(y) - theta.init%*%t(X)
grad.init = -(2/N)%*%(e)%*%X
theta = theta.init - eta*(1/N)*grad.init
l2loss = c()
for(i in 1:iters){
l2loss = c(l2loss,sqrt(sum((t(y) - theta%*%t(X))^2)))
e = t(y) - theta%*%t(X)
grad = -(2/N)%*%e%*%X
theta = theta - eta*(2/N)*grad
if(sqrt(sum(grad^2)) <= epsilon){
break
}
}
print("Algorithm converged")
print(paste("Final gradient norm is",sqrt(sum(grad^2))))
values<-list("coef" = t(theta), "l2loss" = l2loss)
return(values)
}
Let’s also make a function that estimates the parameters with the normal equations: \[ \theta = (X^{T}X)^{-1}X^{T}Y \]
normalest <- function(y, X){
X = data.frame(rep(1,length(y)),X)
X = as.matrix(X)
theta = solve(t(X)%*%X)%*%t(X)%*%y
return(theta)
}
Now let’s make up some fake data and see gradient descent in action with \(\eta = 100\) and 1000 epochs:
y = rnorm(n = 10000, mean = 0, sd = 1)
x1 = rnorm(n = 10000, mean = 0, sd = 1)
x2 = rnorm(n = 10000, mean = 0, sd = 1)
x3 = rnorm(n = 10000, mean = 0, sd = 1)
x4 = rnorm(n = 10000, mean = 0, sd = 1)
x5 = rnorm(n = 10000, mean = 0, sd = 1)
ptm <- proc.time()
gdec.eta1 = gradientR(y = y, X = data.frame(x1,x2,x3, x4,x5), eta = 100, iters = 1000)
## [1] "Initialize parameters..."
## [1] "Algorithm converged"
## [1] "Final gradient norm is 9.84293888300476e-05"
proc.time() - ptm
## user system elapsed
## 0.232 0.004 0.237
Let’s check if we got the correct parameter values
normalest(y=y, X = data.frame(x1,x2,x3,x4,x5))
## [,1]
## rep.1..length.y.. 0.006146655
## x1 0.011377991
## x2 -0.005618812
## x3 0.004558885
## x4 -0.003782078
## x5 -0.001489584
gdec.eta1$coef #Coefficients from gradient descent
## [,1]
## rep.1..length.y.. 0.006162578
## x1 0.011349010
## x2 -0.005620819
## x3 0.004591632
## x4 -0.003783501
## x5 -0.001495606
Pretty close.
Let’s take a look at the L2-loss for each epoch:
plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = "Epoch", ylab = "L2-loss")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)
What if we decreased the learning rate to 10?:
## [1] "Initialize parameters..."
## [1] "Algorithm converged"
## [1] "Final gradient norm is 0.0493919754917037"
plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = "Epoch", ylab = "L2-loss")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)
Gradient descent can often have slow convergence because each iteration requires calculation of the gradient for every single training example.
If we update the parameters each time by iterating through each training example, we can actually get excellent estimates despite the fact that we’ve done less work.
For stochastic gradient descent, thus:
\[ \nabla J(\theta) = \frac{1}{N}(y^{T} - \theta X^{T})X \]
Becomes:
\[ \nabla J(\theta)_{i} = \frac{1}{N}(y_{i} - \theta^{T} X_{i})X_{i} \]
Where i is each row of the data set.
This the stochastic gradient descent algorithm proceeds as follows for the case of linear regression:
Randomly shuffle the data
repeat \[\{\] for \[i := 1, \cdots,N\{\] \[ \theta := \theta - \eta \nabla J(\theta)_{i} \] \[\}\] \[\}\]
Part of the homework assignment will be to write a R function that performs stochastic gradient descent.