PyTorch is a deep learning framework for modeling and training neural networks, with possible GPU acceleration. It is a very flexible framework with a Python-first interface that is executed dynamically (operations are done when the python instruction is executed) as opposed to other compiled system such as TensorFlow.
The outline of this tutorial is going to be as follows:
You need to know some basics about python, numpy, machine learning, and deep learning. For comprehensive introduction to deep learning, I recommend this series of videos, this one, and fast.ai, but of course there are many other ressources onlines.
This notebook is hosted on Github under MIT license. You can get a nicer display using jupyter nbviewer.
You can run this notebook locally using Python >= 3.6, and an installation of PyTorch. You can use PyTorch without GPU (all the functionalities are supported on CPU), however if you want GPU accleleration for you computation, you need an Nvidia GPU, CUDA, and CUDNN. Installation details are not provided in this tutorial.
We'll run the notebook on Google Colaboratory. This is Google drive's jupyter notebbok tool. It runs automatically on Google servers and we can also access a GPU, all with very little configuration. You can save file and install anything on the virtual machine, but the machine will be killed after some time. Be sure to download files that you want to keep (there are none in this tutorial).
Open this notebook in Colaboratory.
After opening go to Runtime
>Change runtime type
and select Python3
and GPU
.
All the dependencies should already be installed in Google Collaboratory. If this is not the case, run the installation cell below.
Install Numpy, PyTorch, TorchVision, and Tensorboard. If you already have them installed pip
will not upgrade anything.
!pip install numpy torch torchvision tensorboard
Commands that start with %
(or %%
for entire cells) are called magic commands. They are Jupyter-notebook extensions.
For instance to time the execution of a sum of 100 first square numbers, you can use the %timeit
(or %%timeit
).
%%timeit
sum(i**2 for i in range(100))
Commands that start with !
are shell commands. They are run in a subshell where the jupyter-notebook
is running. For instance, to see the files in the current folder, you can run the ls
command.
!ls
You can query the docstring (help) on anything (module, function, class, method, magic,...) using ?
. For instance, for the Python built-in min
function, you can execute
min?
You can get completion by hitting the [TAB]
key. For instance, if you want to know what methods exist on a string (or if you do not remember the exact name of one method), you can write
"Hello world!".
and press [TAB]
.
Notebook keep a state with all object, classes, and function that you have defined upon executing a given cell. Unless overritten, they remain even after you edit/delete the cell, which can easily be confusing. More than ever, it is important to reduce the number of global variables (for instance by turing cells into functions).
A good rule of thumb for organizing your code is that you should be confident about restarting the notebook at anytime.
Compiled executions graph such as TensorFlow enable some optimization of the computations, but PyTorch remains highly efficient, with underlying operations being done in low level languages (C, Cuda). In practice, being dynamic means that PyTorch will behave like numpy, doing the actual computation when the instruction is exectued. On the contrary, static computation graph use Python to build a description of what to do, and only perform computation later on. As for numpy, we take advantage of vectorized operations by making them over an ndarray
. In PyTorch, we call them Tensor
but they behave similarly.
import numpy as np
import torch
Checking the Pytorch Version
torch.__version__
X = torch.tensor([[1, 2], [3, 4]])
# @ is matrix multiplication
X @ X + X
Tensors collapsed to numpy array naturally, so they can be used in numpy. We can also explicitely convert from and to numpy:
X_np = X.numpy()
X_np
Y_np = np.arange(5)
Y = torch.from_numpy(Y_np)
Beware that using any of these two functions, both and X
and X_np
, Y
and Y_np
share the same underlying memory. An example of what this means:
Y[3] = -1
Y_np
This is to make efficient use of both frameworks together without having to copy the data every time. For creating a new object with a copy of the memory, simply use np.array
and torch.tensor
:
torch.tensor(Y_np)
np.array(Y)
.numpy()
and torch.from_numpy
np.array
and torch.tensor
Tensor can be converted to different data types:
Y.float()
This returns a new tensor with 32 bit floating point data.
The most used are .float()
, .int()
, .long()
, and byte()
. One can also use the more general .to(dtype=torch.int64)
with any given type from PyTorch. We can query the type of Tensor
using Tensor.dtype
.
Overall, PyTorch is less rich than numpy in the collections of functions it implements. To find the name of the functions implemented use completion, read the documentation, or search other ressourses online.
GPUs are processing units with many core able to do small simple operations in parallel, together with a very fast access to the (GPU) memory. We can use it to parallelize our Tensor
operations such as linear algebra. Neural networks make heavy use of tensor operations and get a nice speed-up with GPUs. Most of the PyTorch functions can be executed on GPU.
Initailly GPU where only intended for graphical processing. With deep learning, GPU for general computing are on the rise, with Nvidia dominating the market. This is because of its propietary CUDA framework that let program its GPUs in an efficient fashion. Nowadays all deep learning frameworks run on CUDA. You need to get an Nvidia GPU and install CUDA (+ CUDNN) if you want to get your own hardware.
Hopefully we'll eventually get frameworks that are hardware independent, maybe using OpenCl as is done in PlaidML.
First let's start by seeing if we have GPUs availables, and the number
We use torch.cuda.is_available()
to know if there are Cuda compatible GPUs and torch.cuda.device_count()
to know the how many of them.
torch.cuda.is_available()
torch.cuda.device_count()
To do computation on a GPU, we need to have a tensor in the memory of that GPU, which is not the same across different GPUs, and not the same as the CPU RAM. Once this is done, computation will happen naturally on the associate device.
To get a copy of a Tensor
on a GPU, we'll just use Tensor.to(device)
where device is:
Eventually "cuda:0"
for the GPU, here with index 0
;"cpu"
for the RAM/ CPU computing;torch.device
object that is just a wrapper around the above.Tensor.to()
can also be used
torch.dtype
to do return a Tensor
in another type;Tensor
to get a copy of the original Tensor
on the same device and with the same type as the Tensor
passed as argument.
We use Tensor.device
to query the device of a Tensor
.
X = torch.tensor([[1., 2.], [3., 4.]])
X_cuda0 = X.to("cuda:0")
X.device
X_cuda0.device
# Computation done on GPU, result tensor is also on GPU
X_2 = X_cuda0 @ X_cuda0
X_2
We cannot do cross devices opertions:
try:
X + X_cuda0
except RuntimeError as e:
print(e)
Operations done on a GPU are asynchonous. Operations on a GPU are enqueued to be performed in the same order. This allow to fully leverage multiple GPUs and CPU at the same time. This is done automatically for the user and synchonization operations are performed by PyTorch when required: copying a Tensor
from on device to another (including printing). In practice, this means that if you put a CPU operation that waits on the GPU in the middle of your network, you will not fully utilize your CPU and GPU.
There are other ways to manage the device on which computation is done, including Tensor.cuda(device)
and Tensor.cpu()
but Tensor.to(device)
is the more generic and let us write device agnostic code by simply changing the value of the device
variable.
It is also possible to create a Tensor
directly on a specific device, but this is limited to zeros, ones, and random initialization. This is for more advanced cases that we won't need in this tutorial.
More information can be found in the documentation.
Using PyTorch you cannot share a GPU with other processes. If you share a machine with multiple GPUs with other users without a job scheduler, you might end up getting conflicts.
Your PyTorch code should always assume contiguous GPUs indexes starting from 0. Then, when running your job, GPU 0 may not be available (run nvidia-smi
to see GPU availability). You would then run your code with the environement variable CUDA_VISIBLE_DEVICES
set to the GPUs you want to use, for instance CUDA_VISIBLE_DEVICES=2
to use only GPU 2 and CUDA_VISIBLE_DEVICES=1,3
to use GPU 1 and 3. In PyTorch you will see them as 0, 1,... It will also adjust the result from torch.cuda.device_count()
etc.
PyTorch is able to numerically compute gradients using reverse mode automatic differentiation, aka backprop' for the cool kids and chain rule for the mathematicians.
$$h: x \mapsto g(f(x))$$$$h': x \mapsto g'(f(x)) \times f'(x)$$Or with $y = f(x)$ and $z = g(f(x))$:
$$ \frac{\mathrm{d}z}{\mathrm{d}x} = \frac{\mathrm{d}z}{\mathrm{d}y} \frac{\mathrm{d}y}{\mathrm{d}x} $$In multiple (input) dimensions, this is: $$ \frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \frac{\partial y_j}{\partial x_i} $$
To do this automatically, PyTorch keeps track of the computation performed using a computation graph (built dynamically). When computig gradients, PyTorch sucessively applies the chain rule to every edge, starting from the output. Here is an example of a compute graph for a two layers perceptron from Deep Learning [Goodfellow et al. 2016].
Computing derivatives in reverse mode requires for PyTorch to remember the jacobians, which is memory intensive.
PyTorch computes the derivates only where required, which is set to nowhere by default.
import torch
import torch.autograd as autograd
X = torch.tensor([1, 2, 3], dtype=torch.float32)
l2_norm = X.norm()
try:
# Gradient of the l2_norm of X, with respect to X
autograd.grad(l2_norm, X)
except RuntimeError as e:
print(e)
Indeed, we did not specify that X
was to require a gradient:
X.requires_grad
To specify that X
will require a gradient, we need to specify it using the method .requires_grad_
. This let PyTorch know that we want to compute gradient with regard to this variable and that it should do what is necessary in order to do so.
X = X.requires_grad_()
l2_norm = X.norm()
# Gradient of the l2_norm of X, with respect to X
autograd.grad(l2_norm, X)
Note that all Tensor
that depend on X
also need to recieve a gradient for the backpropagation to work, so PyTorch will set them automatically:
l2_norm.requires_grad
Also note that after backpropagating, PyTorch will free the computation graph, so we cannot reuse it as is.
try:
# Try to backpropagate throught the graph a second time
autograd.grad(l2_norm, X)
except RuntimeError as e:
print(e)
The gradients are often computed in a object oriented mode using .backward()
. This will computed the gradients of everything in the computation graph that has requires_grad=True
. The gradients are stored in the .grad
attribute of the Tensor
. That way they will be accessed for the gradient descent without having to scpecify all the gradients one by one.
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)
l2_norm = X.norm()
l2_norm.backward()
X.grad
To reuse a piece of code without storing unecessary jacobians, we can use the following context manager (piece of code that start with the with
keyword). This is convenient to use a neural network without further training. The context manager will let us reuse the extact same code during training and inference.
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)
with torch.no_grad():
try:
X.norm().backward()
except RuntimeError as e:
print(e)
We can also use detach()
to disconnect the computation graph
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)
# Gradient of square norm of X
autograd.grad(X.norm() * X.norm(), X)
X = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True)
# Gradient of norm of X, time norm of X
autograd.grad(X.norm().detach() * X.norm(), X)
Overall the default PyTorch mecanisms are defined according to how gradients are used in neural networks. If you want to know more about automatic differentiation in PyTorch you can look at this tutorial.
Recall that if $f_\theta$ is a neural network parametrize by $\theta$ (the weights), we optimize over $\theta$ so that the network behave properly on the input points. More precisely, we have a loss function per-example $\mathcal{L}$ (e.g. categorical cross entropy for classification or mean square error for regression) and want to minimize the generalization error:
$$ \min_{\theta} \mathbb{E}_{x, y} \ell(f_{\theta}(x), y) $$Where $x$ and $y$ follow the unknown true distribution of data. To make it tracktable, we approxiamte this loss by the empirical training error (on the training data):
$$ \min_\theta \quad \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i) $$Because computing $f_\theta(x_i)$ and $\nabla_\theta f_\theta(x_i)$ for every $i$ is expensive, we compute it on a subset of the data points to make a gradient descent step over $\theta$. This is known as the stochastic gradient descent algorithm. It is stochastic from the fact that we compute the gradient of the loss function only over a mini batch.
The points sampled to estimate the gradient are known as the batch (or mini batch) and it size is the batch_size
. We change the batch after every gradient step.
We usually sample without replacement. Once all the points have been sampled, we start again a new loop. This is known as an epoch
.
We monitor the loss on a validation set during training and evaluate the final model on a test set.
If the model is not able to fit the training data, its capacity is too low and the model is said to underfit. If it fits the data nicely but do not generalize to unseen examples, the model is said to overfit and regularization) can be used to mitigate it.
In machine learning we represent our input points in a higher dimension tensor, where the first (usual convention) dimension is the index, other are features dimensions. For instance, if we have data for predicting the price of an appartment
size | downtown | renovated | parking |
---|---|---|---|
30 | false | true | true |
10.4 | false | true | false |
50 | true | false | true |
And the target vector
price |
---|
89.6 |
56 |
10 |
Would be represented as:
X = torch.tensor([
[30, 0, 1, 1],
[10.4, 0, 1, 1],
[50, 1, 1, 1]
])
Y = torch.tensor([89.6, 56, 10])
In the case of y
, the second dimension (dim=1
) has only one feature (the price) so we don't need to add a second axis for the vector.
Depending on the application, the features dimensions can be organised in more than one dimension to better use the structure. For instance, if we had images, we would have two features axis, hence havinga three dimensional tensor 0: index
, 1: x-axis
, and 2: y-axis
. If we had movies, we would have a four dimensional tensor 0: index
, 1: x-axis
, and 2: y-axis
, 3: time
. More advanced structured inputs (sequences, graphs, ...) require more carefully designing the tensor.
In traditional machine learning, we can pass the whole X
and y
tensor to an algorithm (linear regression, random forrest, SVM). In deep learning, we have way more inputs and use stochastic gradient descent.
A pseudo code for a simple gradient descent would look like the following.
for epoch in range(max_epoch):
n_batchs = X.size(0) // batch_size + 1
for i in range(n_batchs):
X_batch = X[i*batch_size: (i+1)batch_size]
y_batch = y[i*batch_size: (i+1)batch_size]
##### Detailed later
# compute predictions
# compute loss
# compute gradients
# apply gradient step
Notice we didn't truly go through the truble of sampling properly. We just assume the matrix is shuffled and reuse the same order (in practice this is acceptable).
PyTorch introduces the Dataset
and DataLoader
classes to do that work. The idea behind the dataloader is that the data might not hold in memory (or GPU memory) so it will load it only as necessary. Even when that is not an issue, the dataloader will come handy in order to avoid rewiting the strategies for sampling datapoints.
Note: in neural networks, we use float
to represent the data, even if it is integer, because we want to do operation with the weights, which are floats.
The Dataset
class is just a representation of our data. We have to implement the __len__
(number of examples) and the __getitem__
(return the example number i
, doesn't need to support slice
).
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, X, Y):
assert len(X) == len(Y)
self._X = X
self._Y = Y
def __len__(self):
return len(self._X)
def __getitem__(self, i):
return self._X[i], self._Y[i]
We can use it with our data:
ds = MyDataset(X, Y)
ds[0]
One can use the opportunity of the Datatset
class to read example from disk, either one by one in __getitem__
, or all at once in __init__
. It can also be used to generate new one one the flight.
In the case of simple Tensor
it is so straightforward that there is a factory function to do it for the previous dataset:
from torch.utils.data import TensorDataset
ds = TensorDataset(X, Y)
ds[0]
The DataLoader
combines a Dataset
, a Sampler
, and a batching function collate_fn
to form batches ready to use by the neural network.
The sampler represents the sampling strategy, it outputs indexes passed to the Dataset
. In practice, it can be constructed automatically by the DataLoader
.
collate_fn
takes mutliple examples anf form a batch. In most case this is simply concatenating but it can be changed for more advanced behaviors.
DataLoader
also has other useful paramters such as the number of workers. All is explained in the documentation of the function.
from torch.utils.data import DataLoader
data_loader = DataLoader(ds, batch_size=2, shuffle=True)
for epoch in range(3): # epoch loop
print(f"Epoch: {epoch}")
for batch in data_loader: # mini batches loop
x, y = batch
print("\t", y)
# compute loss and gradients
We see that the dataloder exposed served the batch (of size 2) one by one in a random order (reshuffling between the three epochs. The last batch of every epoch is of size 1 because they weren't enough example to form a full mini-batch. That is usually not a problem but can be controlled with the drop_last
option of the DataLoader
.
In this section we showed how mini-batches are handled in deep learning and presented the PyTorch convenient way of iterating through them. Using them is optional but reduce the amount of code to write.
Famous deep learning dataset come with their own class that download the data on the first use, save it to disk, and read it.
More information on data loading can be found in this tutorial.
In a feed forward neural networks, we alternate between layers of different sizes. More complex networks have more complex operations (convolutions, recurrent, gated, attention, stochastic, ...), but in the end, we represent everything as a layer.
For instance we will say that a matrix multiplication is a Linear
layer, that an elementwise non linearity (such as ReLU
) is another.
In PyTorch, we will call our layers and networks Module
s. A Module
can be a layer or a mix of other Module
s.
You can write a Module
to do many things as long as it can be expressed in PyTorch and that a gradient can be computed or estimated.
In PyTorch, all layers and models are instance of the same class Module
. This is because just one layer can be a neural network; and combining neural networks also makes a neural network.
In the following, remember that a Module
(a layer or a neural network) is just a sucession of operations, or simply a function. The goal of the Module
class is simply to keep track of some objects and to interfact seamlessly with the rest of PyTorch.
Let's make a simple feed forward neural network with a couple of Module
(or layers).
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(4, 20), # from 4 input neurons to 20 hidden neurons
nn.ReLU(), # an elementwise non-linearity
nn.Linear(20, 20), # this is a `Module`
nn.ReLU(), # this is also a `Module`
nn.Linear(20, 1) # output layer as one neuron for our regression
)
That's a neural network with two hidden layers, each of one having 20 neurons. The input layer has 4 entries (that was the number of features of our X
), that mean that we need a linear transformation mapping $\mathbb{R}^4$ to $\mathbb{R}^{20}$.
The whole network is a Module
:
isinstance(model, nn.Module)
We can use it on our data:
model(X)
Linear
and ReLU
are also Modules
, and we can use them on some data as well:
issubclass(nn.Linear, nn.Module)
module = nn.Linear(4, 20)
module(X)
A Module
has some methods that are the same as a Tensor
, such as .to(device)
, dtype(type)
, and .float()
.
The class-oriented API is what you need to use to define more complex models. It let you define how to do the computation using the .forward
method.
class MyModel(nn.Module):
def __init__(self):
# Initialize parent class, this is mandatory
super().__init__()
# Our previous network
# Creating these modules initialize them with different
# weights. There are different object representing
# different variables.
self.lin1 = nn.Linear(4, 20)
self.relu1 = nn.ReLU()
self.lin2 = nn.Linear(20, 20)
self.relu2 = nn.ReLU()
self.lin3 = nn.Linear(20, 1)
# forward is just the method name used by PyTorch.
# The parent class of `Module` implements __call__
# to call our `forward` but also needs to do some
# extra work.
def forward(self, X):
# This is just some PyTorch computations.
# Modules do PyTorch operations
h = self.lin1(X)
h = self.relu1(h)
h = self.lin2(h)
h = self.relu2(h)
out = self.lin3(h)
return out
There is no dark magic in the code above, it's simply a bunch of class initializing some Tensor
s, and then making some operations on the input X
. Same forward computation could be written in numpy (but we're going to need to compute some gradients). This works the same as before
model = MyModel()
# The result is not supposed to be the same because
# our linear `Module`s are initialized randomly.
model(X)
Actually there's bit more than just the Tensor
operations here. The reason why initialzing the nn.Module
in the __init__
is because PyTorch will keep track of all the Module
s set as attributes of the object for us.
That way, when calling methods that need to operate on all submodules, such as .float()
or .to(device)
(or .train()
, eval()
, parameters()
, apply()
seen later), PyTorch can apply this method recursivly for us.
Module
an attribute containing another Module
, PyTorch will keep track of them to apply some methods recurively. The attribute must me a Module
(not a list or dict of modules). However nn.ModuleList
and nn.ModuleDict
can be used to create a Module
out of other Module
s and behave like a list
and dict
respectively.
You're free to extend nn.Module
the way you want, add other methods etc.. You can have more parameters, options in __init__
and forward
. Usually you would put the size of you layers and other hyperparameters in your __init__
arguments.
Notice how in forward
, we just feed the result of the previous Module
to the next one? That's exactly what the Sequential
model from before does, it's just a shorthand.
Let's dig deeper. Our ReLU
doesn't have any parameters, so we don't need to keep track of it, we could simply have only one:
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.relu = nn.ReLU()
self.lin1 = nn.Linear(4, 20)
self.lin2 = nn.Linear(20, 20)
self.lin3 = nn.Linear(20, 1)
def forward(self, X):
h = self.lin1(X)
h = self.relu(h)
h = self.lin2(h)
h = self.relu(h)
out = self.lin3(h)
return out
Or even:
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(4, 20)
self.lin2 = nn.Linear(20, 20)
self.lin3 = nn.Linear(20, 1)
def forward(self, X):
h = self.lin1(X)
h = torch.max(h, 0)
h = self.lin2(h)
h = torch.max(h, 0)
out = self.lin3(h)
return out
We can also replace the torch.max
by a function of only one input
import torch.nn.functional as F
# Same as torch.max(X, 0)
F.relu(X)
In the case of ReLU, this just make the code slighly more readable, but other function that we'll use later are more and compicated. It's also an opportunity for the developpers to optimize the code behind.
You can have a look at the code for Linear
here (it's not hard to read). You'll see that it's very alike our own Module
. The class just initialize Parameter
s in __init__
, and use them to do the affine transformation in forward
F.linear
is just a function without internal Parameter
that return W @ X.t() + b
. This is known has the functional API, it achieves the same as the nn.Module
API but in a stateless way (pure functions).
An other important aspect to keep in mind is whether the model is in training mode or evaluation mode. Certain nn.Module
behave differently depending on the case. For instance, dropout is a regularization technique that will randomly put some output of a layer to zero. When the model is evaluated, all outputs are used (there is also a scaling so that the expectation stay the same at train ad test time).
nn.Module
keep tracks of that for us. All we need to do is say which mode we want and it will propagate to all the nn.Module
in our network.
We use .train()
to put the model in training mode and .eval()
to put it in evaluation mode. The .training
attribute tells us in which mode the network is.
# Put `nn.Module` in train mode
model.train()
# Check if `nn.Module` is in training mode
model.training
# Put `nn.Module` in train mode
model.eval()
# Check if `nn.Module` is in training mode
model.training
If you use the functional API (torch.nn.functional as F
) in you .forward
method, you can pass self.training
to the function that behave differently at train and test time.
# When using Dropout at train time, a scaling is applied
# to keep the same activation expactation downward.
F.dropout(X, training=True)
# no dropout at evaluation time
F.dropout(X, training=False)
In our forward loop, applying dropout before the last layer would look like this:
def forward(self, X):
h = self.lin1(X)
h = torch.max(h, 0)
h = self.lin2(h)
h = torch.max(h, 0)
h = F.dropout(h, training=self.training)
out = self.lin3(h)
return out
The neural netwok is a function of both its input $x$ and its parameters (or weights) $\theta$. We're gonna leave the internal weights for PyTorch to manage. That is the goal of the Parameter
class. This class wraps a Tensor
to let PyTorch know that this tensor needs to be updated. That means both computing its gradient (requires_grad=True
) but also making the gradient step
$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$.
To know what parameters to update, PyTorch will recursively look in your Module
attributes for other Module
, Parameter
or sequences of the previous.
You need to use Parameter
if you implement a very specific type of layer. If you make a network, you can just reuse the layer already available. Even we building new layers, it's often possible to reuse other layers.
Our Linear
has some parameters (a linear matrix transformation and a bias vector, so actually an affine transformation) :
for p in nn.Linear(4, 20).parameters():
print(p)
If you re-execute, you will see different values, that is because it's a different object. If you want to tie weights in you network, you can reuse the same Module
object.
ReLU
doesn't have any parameters:
for p in nn.ReLU().parameters():
print(p)
Our model has all the parameters of its submodules because this method is recursive. Let's print only one:
print(next(model.parameters()))
A good practice is to define a function to initialize the Parameters of the module.
The Module.apply
method let us apply a function on every module (hence submodule) in a Module
. To write a initialization function, we can used the initializations provided in torch.nn.init
Remember, that the function will be given all, module, including the main one, so we have to filter for the submodules that we wish to initialize.
# `.weight` is the name of the attribute used in
# the `Linear` layer. There is also `.bias`.
# This may be is different for other layers with `Parameter`s.
lin = nn.Linear(3, 3)
lin.weight
Therefore we can use
def my_init(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_uniform_(m.weight)
nn.init.constant_(m.bias, 0)
model.apply(my_init)
Calling .apply(func)
is the same as doing:
for m in model.modules():
my_init(m)
Getting a frozen copy of your model and being able to save it to the disk is important for many reasons
Parameter
s) performing the best on the validation set;
The method .state_dict()
of a Module
will return a dictionary with all inside information necessary to recover the state of the model (mostly the value of inner Parameters
. The method .load_state_dict(dict)
will load that state
The method torch.save
can be used to serialized (save to disk) torch objects. We can pass it the result from .state_dict()
. Similarilly, the method torch.load
It is possible to used the pickle
module directly on torch objects (open file as binary). It is also possible to serialize the whole Module
class instead of the its state, but there are some caveats.
The state returns by .state_dict()
shares an underlying memory with the Module
Parameter
s. This mean if you keep the states in python, they will keep changing as you optimize your model. To avoid that, you can save them to disk or make a deep copy the deepcopy
from the copy
python module.
lin = nn.Linear(3, 3)
state = lin.state_dict()
state
# Our state and `Module` parameters are linked.
state["weight"][:] = 0
lin.weight
We use it the following way:
lin = nn.Linear(3, 3)
# Save state dict.
torch.save(lin.state_dict(), "/tmp/my_model_state")
# Modify model.
lin.weight.data[:] = 0
lin.weight
# Reload the parameters.
lin.load_state_dict(torch.load("/tmp/my_model_state"))
# Parameters recovered:
lin.weight
Because a torch.nn.Module
holds Pararameter
s (which are Tensor
s), it is important to move the model on the device on which we want to do our computations.
Use model.to(device)
to move the nn.Module
on the given device. Compared to Tensor.to(device)
this function moves the current model on the device and returns itself.
If you are doing deep learning, there are good chances you are doing computer vision or natural language processing. In each case, you will need to master the classical convolution layers and recurrent layers respectively.
This layers, and other, works the same as what we have presented, but assume a more structure on their input data. Convolutional layers need the data to be structured with the width, height, and number of channels (e.g. RGB for input images). Recurrent layers need a time dimension (attention, by default this is the first one, even before the batch dimension).
CNN, RNN, and other type of layers that you can build will use weight sharing, meaning that some weight are mathematically the same in multiple operations. For instance in a RNN, the weights are reused at every time step.
To do this in PyTorch, just do what you would intuitively do: reuse the same python object (A Tensor
or a nn.Module
). In the forward pass, the same value will be used, while in the backward pass, gradients from different child in the compute graph will be summed, as it should be according to the chain rule.
A simple example a neural network that has the same weights for 5 hidden layers:
class WeightSharingNetwork(nn.Module):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(4, 20)
self.lin2 = nn.Linear(20, 20)
self.lin3 = nn.Linear(20, 1)
def forward(self, X):
# First layer 4 -> 20
h = F.relu(self.lin1(X))
# lin2 weights reused five times 20 -> 20
for _ in range(5):
h = F.relu(self.lin2(h))
out = self.lin3(h)
return out
weight_sharing_network = WeightSharingNetwork()
Everything works normally
# Forward OK
out = weight_sharing_network(X)
# Backward OK
out.sum().backward()
Now that we have an api for neural networks and automatic gradients computation for the parameters, gradient descent is going to be as easy as $\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}$.
PyTorch provides different optimizer that inherits from the torch.optim.Optimzer
base class. The most simple one is optim.SGD
which does exactly the update mentioned previously. optim.Adam
(ref here) is a nice go-to optimizer.
To instantiate an optimizer, we need to give it the parameters to optimize over, as well as optimization parameters. Thankfully, we've seen how to get all the parameters in our nn.Module
.
To create an optimizer, choose an algorithm from torch.optim
and pass it the model parameters (torch.nn.Module.parameters
), along with other hypermarameters (learning rate...).
More information on optimizer is available in the documentation.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# lr is learning rate, a very important hyperparameter
optimizer = optim.Adam(model.parameters(), lr=1e-3)
Note: that there is a weight_decay
($\mathcal{l}_2$ norm regularization) parameter in optimizer. This is a shorthand, that way, one doesn't need to list all the parameters in the loss function. Be careful, usually one does not regularize biases of the model.
To write an optimization step we use two functions from torch.optim.Optimizer
.
We need optimizer.zero_grad()
to reset the .grad
attribute of our parameters to zero (otherwise they would be sum, this is a feature required by backpropagation). Then compute new gradients, and use optimizer.step()
to perform a gradient update.
# This is going to make sure all `.grad` attribute
# in the tensor parameters of our network are reset
# to zero.
# Otherwise, as we keep computing gradients, they
# are summed in this `.grad` attribute.
optimizer.zero_grad()
# Forward pass
y_pred = model(x)
# Compute loss between the predictions of the true labels.
# Here we use the mean square error for regression.
# Note: y_pred had dimension `batch_size x 1`, when y has only
# size `batch_size`. We use `squeeze` to remove the 1-dimensions.
# We could also have added `squeeze` at the end of our `.forwad`
# method.
loss = F.mse_loss(y_pred.squeeze(1), y)
# Compute the gradients of the parameters in the compute
# graph
loss.backward()
# the optimizer apply a descent step to the parameters
optimizer.step()
Note: the optimizer are always minimizing, if you want to maximize, you can take the negative of your loss.
Overall, once we've created our model (nn.Module
), our Dataset
, our Dataloader
, and our Optimizer
the training loop looks like this:
for epoch in range(6): # epoch loop
print(f"Epoch: {epoch}")
for batch in data_loader: # mini batches loop
model.train() # make sure the model is in training mode
x, y = batch
optimizer.zero_grad()
y_pred = model(x)
loss = F.mse_loss(y_pred.squeeze(1), y)
loss.backward()
optimizer.step()
print("\t", f"Loss: {loss.item()}")
This loop is not supposed to converge because the data is ultra small, every design decision is random etc.
This double loop is where the training happens. It can last for days on big models/ datasets. The training needs to be babysitted: during this loop, it important to monitor the perfomances on the training and validation set, save the parameters to restart in case the programs get killed, save the best set of parameters found so far etc.
Monitoring the training loop is fastidious. We'll present some direction in the last section but won't have time to go into the details.
Optimizer also have an inside state (diminishing learning rate, momentums...). State and serialization is exactly the same as for Module
s.
device
and use Tensor.to(device)
, torch.rand(..., device=device)
etc.torch.cuda.FloatTensor
but Tensor.float()
, Tensor.to(dtype)
, torch.arange(..., dtype=dtype)
etc., where dtype
is somehting like torch.float
. PyTorch accepts the device
and dtype
argumentdevice
and device
is None
. It will keep the current device/type or use the default one when you construct a tensor.Dataset
, Dataloder
, nn.Module
, even if your code is simple. It will be easier to grow your code and increse the readability. Do not make these class do more than they should, keep simple, modular, easy to understand abstractions.PYTHON_PATH="."
to make your package visible, or better install your package in your virtual environement: pip install -e .
and run usingpython sripts/this experiment.py
argparse
to manage all your hyperparametersuuid
), and save it somewhere. It litteraly takes one line of python to read it into a dictionary. Use the same id to save your results/ models. Keep the command line arguments for things related to the execution settings, such as CPU?GPU, number of GPU/ visdom port...
You wouldn't change font in the middle of a paper; then mind the quality of your code.
We're going to train a neural network for digit recognition on the dataset called MNIST. This dataset is small (70000 28x28 images) and considered solved. Being able to fit MNIST is required but not sufficient to claim improvement in image recognition.
.view(-1, 28*28)
to reshape images into vectors.to(device)
). Don't move the whole dataset to GPU but on;y your current batch..eval()
on your model and with torch.no_grad()
state_dict
)
Numerical stability: In theory, we need to use a softmax on the last layer to get a probability distribution over classes, then use the cross entropy loss function. In practice this is numerically unstable, so the cross entropy function takes directly the linear activations. Another possibility is to use the log-softmax on the last layer and then use the nll-loss as loss function.
Note that this function come both in the nn.Module
API or the torch.nn.functional
API.
torchvision
is a package with some utilities for image recognition tasks. It implements the Dataset
class for MNIST. This class is quite nice, it will load the dataset from a given disk location and do anything that is necessary for serve us the Tensor
s. It will also download the dataset to the disk location if it is not found :)
import random
from typing import Dict, Mapping, Iterable
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader, Subset
from torch.utils.tensorboard import SummaryWriter
# Set random seed for reproducability
random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
# This is all the data available for training according to MNIST.
# Because we will use a validation dataset, this data is both the
# training and validation data.
train_valid_dataset = datasets.MNIST(
'data', # Path where we're sotring the dataset
train=True,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
)
# We split the dataset in validation and training data.
# List of all training and validation indices shuffled deterministically
train_valid_indices = list(range(len(train_valid_dataset)))
random.Random(42).shuffle(train_valid_indices)
# Number of element in the validation set
n_valid = 15000
# Indices in the list from n_valid to the end are for training.
train_dataset=Subset(train_valid_dataset, train_valid_indices[n_valid:])
# Indices in the list from the begining tp n_valid are for validation.
valid_dataset=Subset(train_valid_dataset, train_valid_indices[:n_valid])
# The test dataset does not need to be split.
test_dataset = datasets.MNIST(
'data',
train=False,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
)
The transforms
are just a way to do dynamic transofrmation on the data before passing it to the neural network. The first one is to say we want the image as a Tensor
(otherwise it's plain image not usable by PyTorch but convinient for visualizations), the second is a rescaling of the pixels values so that they'll end up in a nice range (in ML we always need to rescale the inputs, usually in the $[-1, 1]$ or $[0, 1]$ range).
Validation: You should not measure anything on the test set until you've finished all training (not even picking the best model on the test set), otherwise it is p-value hacking.
We'll do a validation set that you can use for measure performances during training and tryout different hyperparameters. This makes use of the Subset random sampler.
You can look at the PyTorch example of MNIST for inspiration. Note that this example does not use a validation set, and uses convolution, not presented here.
You can also use the code structure below. Feel free to adapt it to your thinking. This is the way that felt natural for me it may not be the best for you.
If you do use it, I recommend starting with train_model
, as it is the main function running. It will help you figure out what goes in the other functions.
class ConvNeurNet(nn.Module):
"""Convolutional Neural Network for MNIST classification."""
def __init__(self) -> None:
"""Initialize the layers."""
...
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
"""Forward pass of the neural network.
Parameters
----------
inputs:
A tensor with shape [N, 28, 28] representing a set of images, where N is the
number of examples (i.e. images), and 28*28 is the size of the images.
Returns
-------
logits:
Unnormalize last layer of the neural network (without softmax) with shape
[N, 10], where N is the number of input example and 10 is the number of
categories to classify (10 digits).
"""
...
def train_model(
model: nn.Module,
train_loader: DataLoader,
valid_loader: DataLoader,
optimizer: optim.Optimizer,
n_epoch: int,
device: torch.device,
) -> None:
"""Train a model.
This is the main training function that start from an untrained model and
fully trains it.
Parameters
---------
model:
The neural network to train.
train_loader:
The dataloader with the example to train on.
valid_loader:
The dataloder with examples used for validation.
optimizer:
The optimizer initialized with the model parameters.
n_epoch:
The number of epoch (iteration over the complete training set) to train for.
device:
The device (CPU/GPU) on which to train the model.
"""
# For using Tensorboard
writer = SummaryWriter(flush_secs=5)
writer.add_graph(model, next(iter(train_loader))[0])
...
def update_model(
model: nn.Module,
inputs: torch.Tensor,
targets: torch.Tensor,
optimizer: torch.Tensor,
) -> torch.Tensor:
"""Do a gradient descent iteration on the model.
Parameters
---------
model:
The neural network being trained.
inputs:
The inputs to the model. A tensor with shape [N, 28, 28], where N is the
number of examples.
targets:
The true category for each example, with shape [N].
optimizer:
The optimizer that applies the gradient update, initialized with the model
parameters.
Returns
-------
logits:
Unnormalize last layer of the neural network (without softmax), with shape
[N, 10]. Detached from the computation graph.
"""
...
def accuracy_from_logits(logits: torch.Tensor, targets: torch.Tensor) -> float:
"""Compute the accuracy of a minibatch.
Parameters
---------
logits:
Unnormalize last layer of the neural network (without softmax), with shape
[N, 10], where N is the number of examples.
targets:
The true category for each example, with shape [N].
Returns
-------
accuracy:
As a percentage so between 0 and 100.
"""
...
def evaluate_on_batch(logits: torch.Tensor, targets: torch.Tensor) -> Dict[str, float]:
"""Compute a number of metrics on the minibatch.
Parameters
---------
logits:
Unnormalize last layer of the neural network (without softmax), with shape
[N, 10], where N is the number of examples.
targets:
The true category for each example, with shape [N].
Returns
-------
metrics:
A dictionary mapping metric name to value.
"""
...
def evaluate_model(
model: nn.Module, loader: DataLoader, device: torch.device
) -> Dict[str, float]:
"""Compute some metrics over a dataset.
Parameters
---------
model:
The neural network to evaluate.
loader:
A dataloader over a dataset. The methdo can be sued with the validation
dataloader (during training for instance), or the testdataloder (after training
to cpmpute the final performances).
Returns
-------
metrics:
A dictionary mapping metric name to value.
"""
...
def log_metrics(
writer: SummaryWriter, metrics: Mapping[str, float], step: int, suffix: str
) -> None:
"""Log metrics in Tensorboard.
Parameters
---------
writer:
A the summary writer used to log the values.
metrics:
A dictionary mapping metric name to value.
step:
The value on the abscissa axis for the metric curves.
suffix:
A string to append to the name of the metric to group them in Tensorboard.
For instance "Train" on training data, and "Valid" on validation data.
"""
for name, value in metrics.items():
writer.add_scalar(f"{name}/{suffix}", value, step)
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
is_device_gpu = (device.type == "cuda")
def make_dataloader(dataset: Dataset, batch_size: int) -> DataLoader:
"""Factory function to create dataloader for different datasets."""
return DataLoader(
dataset,
batch_size=batch_size,
num_workers=1,
pin_memory=is_device_gpu
)
# This batch size influence optimization. It needs to be tuned.
train_loader = make_dataloader(train_dataset, batch_size=64)
# This batch size is just for evaluation, we want is as big as the GPU can support.
valid_loader = make_dataloader(valid_dataset, batch_size=512)
test_loader = make_dataloader(test_dataset, batch_size=512)
model = ConvNeurNet()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
train_model(
model=model,
train_loader=train_loader,
valid_loader=valid_loader,
optimizer=optimizer,
n_epoch=5,
device=device
)
This cell will launch Tensorboard in the notebook to visualize results. You will need to properly use the log_metrics
function to see something.
%load_ext tensorboard
%tensorboard --logdir=runs
In this section, I'll just give pointers to other features that exists in or around PyTorch. We won't go much in detail but it is useful to know they exist.
PyTorch Hub let you easily download the most popular (pretrain) models (see doc).
Alternatively, searching the name of the model, along with PyTorch
in a search engine should gives plenny of implementations ready to use and adapt.
Even if you implement a model from scratch, it is a good pratcice to look for implementation already existing as you might learn something about performance, numerical stability etc. reading the code.
PyTorch as limited support for sparse Tensor
s (including on GPU). Few operations are implemented and it is usually not possible to take the gradient with respect to a sparse Tensor
. This is because the gradient of a sparse tensor has no reason to be sparse.
More information in the documentation
The double for loop (over epoch and batch) can last quite some time (up to days) and it's important to add monitoring, logging, plotting, and checkpointing to it.
Some PyTorch framework exists to avoid maintaining a too big training function, to facilitate code reuse, and to separate the core algorithm from these monitoring operations, while leaving room for customization.
Some general framework do that for you. The most popular are Ignite, Lightning, and TorchBearer.
There exists many more task specific framework, have a look to the PyTorch Ecosystem.
Training neural networks can be hard, and one needs to visualize what is happening during training to improve on it. For instance, one can vizualize the learning curve (loss function over time / optimization iteration).
Tensorboard is powerful tool, coming from the Tensorflow ecosystem, to monitor many aspects related to neural networks training. It is now a supported by PyTorch using the torch.utils.tensorboard
module.
Another option is Visdom. It's more flexible so it takes a bit more time to define your curves. A nice thing is that it is not limited to neural networks, and works with numpy and matplotlib, so you can use it with whatever project you have.
Finally, there are some services (with free academic versions) such as tensorboard.dev, comet.ml, and weights and biases that host your experiments online, on top of providing vizualisation.
The most straightforward way to levrage more parallelism in deep neural network is to use data parallelism (parallelize across the btach dimension). This can be done across multiple GPUs in PyTorch by wrapping the model in the nn.Dataparallel
class (tutorial).
PyTorch also has some implementation for multiple machines, for instance using MPI.
With the version 1.0, PyTorch improves on its production setting. There is a new C++ interface to PyTorch, along with TorchSript, a subset of python that can be compiled and run outside of Python. Both interface should be compatible but they are still experimental.
It is also possible to export (some) models to another framework using ONNX.
Visit PyTorch documentation and PyTorch tutorials for more features such as tradding memory for compute, and named tensors.