Neural networks first became popular through fields such as image recognition and natural language processing. Yet, there’s a growing trend in which the same toolbox is being used for more “conventional” classification tasks, in conjunction with non-image or language features. This trend is especially visible in the ad tech industry, which has unique properties favoring the usage of neural networks, including: bid stream, app engagement, and ad performance data – all of which offer an abundance of tagged (labeled) training data. This wealth of data enables data scientists in ad tech to meet neural networks requirement for large data sets for their training process, and achieve excellent results doing so.

Furthermore, since neural networks may be trained in large batches by feeding the training process with multiple observations as a group, accurate results are obtained also on highly sparse data, such as bundles usage or GPS location points per user.

**Training neural networks: the concept of “loss”**

Classification models introduce a certain degree of error, reflected as the difference between the desired outcome (truth) and the predictions output of the model. Naturally, minimizing this error is desirable. Neural networks measure this error using a loss function (for example, negative log-likelihood), and try to minimize it using an iterative process, which “fixes” the model in small, successive and repeating steps, to diminish the size of the loss or error.

Neural networks are built in layers which are made from atomic units called “neurons”. The data flowing through the net is calibrated by weights and biases, implemented in each neuron. Minimizing the loss is obtained by modifying these weights and biases towards the “correct” direction.

**How does the neural network training process determine these “loss-reducing” steps?**

Let’s consider the following analogy:

Assume you are standing on a mountaintop, and trying to make your way down to the deepest valley below. While at the peak, you don’t have to calculate the entire route down in advance, but rather take baby steps in its general direction, and re-evaluate your way as you go along. However, this process may sometimes prove difficult, and you may end up finding yourself in a valley which isn’t the deepest one, like the path on the right in the image above. This is called a “local minima,” in which you reach a certain point with an apparent minimal level, while there are still one or more points (and routes that lead to that point) which are even lower. The best point possible is called global minimum (or minima), representing the smallest level reachable.

If we relate to this mountain as a “mountain of loss,” going down the mountain is actually equivalent to minimizing that loss (error), and the global minima represents the optimal point, in which the model loss will also be minimal.

**Using optimizers to guide the path downhill**

Making your way downhill may prove quite complex: if you take steps that are too large before re-evaluating your route (as a part of the iterative process), it may cause you to miss the deepest point of the valley and unintentionally climb back up the mountain. Trying to avoid this “overshooting” by taking tiny steps may prolong your journey, making it a long and costly journey.

Optimizers provide a way to find an optimal route that will allow you to reach the deepest valley. Note that this route isn’t pre-calculated in full, but rather being created in a consecutive fashion. Eventually, taking the route will manifest itself as minimal loss as possible, thus obtaining a more accurate model.

Optimizers use several techniques to find the optimal route: some control the rate of descent, represented as the learning rate of the training process. This means that the weights and biases of neurons are only “permitted” to change to a certain extent, and they cannot exceed that. Controlling the rate of decent usually indicates taking large steps at the beginning of the journey downhill, and making your steps smaller as you gradually advance. Other optimizers utilize a more sophisticated approach, taking into account the scarcity of turns and changes on the way itself, thus treating common and uncommon turns differently. In addition, optimizers may also control the momentum (acceleration and deceleration of the downhill journey), in an effort to reduce “speed” when approaching global minima in the deepest valley.

**An intro to TensorFlow optimizers**

TensorFlow, a Neural Network framework released by the Google Brain team in November 2015 is considered a “second generation” machine learning system, offering many “out of the box” optimizers intended to reduce the loss down to global minima. They all operate according to the same notion, in which model weights and biases are adjusted as an outcome of the objective functions’ loss gradient, calculated by the partial derivative of each parameter (feature).

**To be continued…**

In my next article, I will offer an in-depth review of several TensorFlow optimizers such as the AdaGrad, AdaDelta, Adam, and others, all of which implement different methods and characteristics. In addition, I will discuss stochastic gradient descent, the Nestrov momentum update, and the concept of convergence in neural networks.

Wonderful, thanks for sharing!