Best practices for model training and serving in machine learning – Part 2

Published on: November 10, 2022

In our previous article, we discussed what are the parameters controlling the training process of machine learning algorithms. We explained how to run and log experiments to find the optimal set of these parameters and achieve good performance results for your model. Now we will look at what to do when your network isn’t working as expected and how to convert the final model into a format suitable for deployment.


When training a neural network, you may encounter situations where the network does not work as expected and you need to find the source of the problems. This task can be quite challenging given the fact that deep learning models operate like black boxes. In other words, their behavior cannot be fully interpreted by humans. However, there are still some general principles that might help you debug it:

Data inspection

The first thing to do is inspect your input data. Make sure your data is uploaded and has correctly passed all pre-processing steps, including augmentation, before entering your network. The easiest way to do this is to visually check the images.


Often the network makes wrong predictions in specific cases. You can try to identify patterns in mispredictions. These patterns point to missing examples in the training dataset.


Outliers are data that differ significantly from other data in your data set. Significant number of outliers reduce the performance of your algorithm. Since your network is unlikely to be able to learn from them, they can be safely eliminated from the dataset. The outliers could be found by analyzing image metadata (aspect ratio, resolution, brightness etc), embeddings or confidence score of your classifier. 

Reduced dataset

Running your first experiments on a full dataset is slow and inefficient. Instead, you can start with a small subset that represents your problem. Make sure you can run the entire learning pipeline and get meaningful predictions. Keep in mind that machine learning algorithms are likely to overfit on small datasets.

Model capacity 

Model performance on training images can tell you a lot about possible problems. Poor performance might mean that your network does not have enough capacity to deal with this particular problem. In this case you can try a deeper network with more layers and parameters. Near-ideal performance is also suspect, probably as a result of overfitting. To fix this, the network can be made smaller or the current network can use more training data.

Vanishing/exploding gradients 

Model training uses a backpropagation algorithm to update network parameters at each training step. The algorithm propagates backward from the output layer to the input layer calculating the error gradients (that is, derivatives) of the parameters on the way. There might be situations when the gradients are getting vanishingly small or extremely large as the backpropagation algorithm progresses towards the input layer. 

In the former case, the optimization algorithm never reaches the optimum solution (i.e. the training process gets stuck). In the latter case, the network weights are updated too heavily and the algorithm diverges. If possible, look at the gradients and activations inside the network. 

To cure this error, one can try different weights initialization strategies, change non-linear activation functions or apply batch normalization between the convolution layers. Another solution to the exploding gradient problem could be to set a threshold value for the gradient that they can never exceed (“gradient clipping”).

Model format

Model training is only half the way to a real application, since the model is likely to be executed in a different environment compared to the one in which it was trained. For example, a model can be trained on a laptop, but executed on a mobile phone or an embedded device. Thus, the original model might need to be converted into the format suitable for specific hardware it will be run on. 

Today, each deep learning framework has its own model format: .pt for PyTorch; .pb for TensorFlow, .caffemodel for Caffe or Intermediate Representation for Intel’s OpenVino. This causes many difficulties, since not all machine learning operations are implemented in all libraries. ONNX is a mediator used for conversion between different machine learning frameworks. Converting your model into .onnx format is a good idea, as it helps to simplify the portability of your model and increase cross-functionality. 

In this article, we have discussed possible ways to debug your deep learning algorithm when it is not working properly as well as how to prepare your final model for deployment. The outlined steps will help you achieve your goals in creating and serving a reliable AI solution for the problem you are trying to solve.

It seems like you're really digging this article.

Subscribe to our newsletter and stay up to date.

    guy digging a hole