Optimization techniques in Deep Learning

 In the deep learning world, the neural network is connected to all the layers(Input layer, Hidden layer and the output layer). In the front propagation we get the Y^ and calculate the error function. Error function is also called as the lost function or cost function. To reduce the loss function the optimizers are used. They update the weight in the back propagation.

 

Gradient Descent:

The foremost optimizer used was the Gradient Descent. It works as follows

1. Calculate what a small change in each individual weight would do to the loss function

2. Adjust each individual weight based on its gradient

3. Keep doing steps #1 and #2 until the loss function gets as low as possible

 

During optimization there could be problem in getting stuck on local minima. To avoid this need to make use of learning rate.

 

The learning rate variable is used to multiply the gradients to scale them and need to ensure by changing weights at the right pace, not making any changes that are too big or too small

 

Stocastic Gradient Descent (SGD):

SGD is introduced to solve the problem of huge dataset usage. Stochastic gradient descent is an implementation that either uses batches of examples at a time or random examples on each pass.

 There are other types of optimizers based on gradient descent that are used, few of them are:

 

Adagrad:

Adagrad adapts the learning rate specifically to individual features: that means that some of the weights in your dataset will have different learning rates than others. This works really well for sparse datasets where a lot of input examples are missing. Adagrad has a major issue though: the adaptive learning rate tends to get really small over time.

 

RMSprop:

RMSprop is a special version of Adagrad. Instead of letting all of the gradients accumulate for momentum, it only accumulates gradients in a fixed window.

 

Adam:

Adam stands for adaptive moment estimation, and is another way of using past gradients to calculate current gradients. Adam also utilizes the concept of momentum by adding fractions of previous gradients to the current one. This optimizer has become pretty widespread, and is practically accepted for use in training neural nets.

 

Comments

Popular posts from this blog

Exploring Gini Index in Decision Tree

Loss functions in Deep Learning using Keras