### Introduction

Many deep learning models pick up objectives using the gradient-descent method. Gradient-descent optimization needs a big number of training samples for a model to converge. That creates it out of shape for few-shot learning.

We train our models to learn to achieve a sure objective in generic deep learning models. However, humans train to learn any objective. There are different optimization methods that emphasize learn-to-learn mechanisms. In this article, we will take an overview of the Gradient descent method in detail.

### Description

Neural network architectures normally involve a lot of parameters. It is optimized using a gradient-descent algorithm. That receipts various iterative steps over many examples to do well. The gradient descent algorithm delivers a decent presentation in its models. Gradient descent is used in machine learning to discover the values of a function’s parameters and coefficients. That reduces a cost function to the extent that possible.

### What is a Gradient?

- A gradient in machine learning is a derived function. It has more than one input variable.
- In mathematics terms, it is known as the slope of a function.
- The gradient just measures the variation in all weights with regard to the change in error.

### How does gradient descent work?

- We must review some concepts from linear regression before knowing the working of gradient descent.
- We can recall the formula for the slope of a line that is y = mx + b.
- Here, m denotes the slope and b is the intercept on the y-axis.
- We can similarly recall plotting a scatterplot in statistics and outcome the line of best fit.
- That needed calculating the error between the actual output and the predicted output (y-hat) using the mean squared error formula.
- The gradient descent algorithm acts in the same way then it is created on a convex function, for example, the one below:

- The starting point is only an arbitrary point for us to assess the presentation.
- We will discover the derivative or slope from that starting point.
- We may use a tangent line to observe the steepness of the slope from there.
- The slope will inform the updates to the parameters such as weights and bias.
- The slope at the beginning point will be steeper than as new parameters are made.
- The steepness should slowly decrease until it reaches the lowest point on the curve, recognized as the point of convergence.
- The objective of gradient descent is to reduce the cost function and the error between predicted and actual y alike to result in the line of best fit in linear regression.
- It needs two data points in order to do this as a direction and a learning rate.
- These factors fix the partial derivative calculations of future iterations.
- That allows it to steadily reach the local or global minimum such that point of convergence.

### Learning Rate

- By the learning rate, we determined how big the steps are gradient descent takes into the direction of the local minimum.
- It informs out how fast or slow we will move towards the optimal weights.
- We need to set the learning rate to a suitable value for gradient descent to reach the local minimum.
- It neither should be too low nor too high.
- This is significant for the reason that if the steps it takes are too big, it cannot reach the local minimum as it rebounds back and forth between the convex function of gradient descent.
- The gradient descent will finally reach the local minimum but that may take a while if we set the learning rate to a very small value.

### The cost or loss function

- It processes the difference, or error, between actual y and predicted y at its existing position.
- This advances the machine learning model’s effectiveness by providing feedback to the model with the intention of it may regulate the parameters to reduce the error and discover the local or global minimum.
- It nonstop iterates, stirring beside the direction of steepest descent or the negative gradient up to the cost function is close to or at 0.
- The model will break learning at this point.
- Moreover, though the terms, cost function and loss function, are well-thought-out synonymous, there is a small change between them.
- It’s important to note that a loss function states to the error of one training example, though a cost function gauges the average error across a whole training set.

### Types of Gradient Descent

There are following three main types of gradient descent learning algorithms:

- Batch gradient descent,
- Stochastic gradient descent
- Mini-batch gradient descent.

#### Batch gradient descent

- It totalities the error for each point in a training set.
- Similarly, it updates the model after all training instances have been assessed.
- This process started as a training epoch.
- It may still have a long processing time for big training datasets.
- As it still requires storing all of the data into memory though this batching makes available computation efficiency.
- Batch gradient descent too typically produces a steady error gradient and convergence.
- On the other hand, sometimes that convergence point isn’t the most ideal to find the local minimum versus the global one.

#### Stochastic gradient descent

- It goes a training era for each example inside the dataset.
- It updates each training instance’s parameters one at a time.
- Meanwhile, we only require to hold one training example as they are informal to store in memory.
- It may result in fatalities in computational competence when compared to batch gradient descent.
- Its common updates can result in noisy gradients.
- This can also be useful in evading the local minimum and finding the global one.

#### Mini-batch gradient descent

- It links concepts from equally batch gradient descent and stochastic gradient descent.
- It separates the training dataset into small batch sizes and does updates on each of those batches.
- This method strikes a balance between the computational productivity of batch gradient descent and the speed of stochastic gradient descent.