Tutorial 1: Depth vs width

Tutorial 1: Depth vs width#

Week 2, Day 1: Macrocircuits

By Neuromatch Academy

Content creators: Gabriel Mel de Fontenay

Content reviewers: Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault

Production editors: Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk

Tutorial Objectives#

Estimated timing of tutorial: 1 hour

In this tutorial we will take a closer look at the expressivity of neural networks by observing the following:

The universal approximator theorem guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely wide.
We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths.
To create this complex function, we’ll build a random deep neural network. This is an example of the student-teacher setting, where we attempt to fit a known teacher function (the deep network) using a student model (the shallow/wide network).
We will find that the deep teacher network can be either very easy or very hard to approximate and that the difficulty level is related to a form of chaos in the network activities.
Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high expressivity - ie. allows them to implement a large number of different functions.

Let’s get started!

Setup#

Install and import feedback gadget#

Imports#

Figure settings#

Plotting functions#

Show code cell source Hide code cell source

# @title Plotting functions

def plot_loss(Es):
    """
    Plot loss progression over time.

    Inputs:
    - Es (np.ndarray): sequence of loss values during training.
    """
    with plt.xkcd():
        plt.semilogy(Es_deep)
        plt.xlabel('Epochs')
        plt.ylabel('Error')
    plt.title("Loss")
    plt.show()

def plot_loss_as_function_of_width(Ws_student, Es_test, Es_train):
    """
    Plot final loss of training as the function of the width of the network.
    """
    with plt.xkcd():
        plt.loglog(Ws_student, Es_test, '.-')
        plt.loglog(Ws_student, Es_train[:,-1], '.-')
        plt.legend(['Test', 'Train'])
        plt.xlabel('Width')
        plt.ylabel('Error')
    plt.title("Loss")
    plt.show()

def plot_students_predictions_vs_teacher_values(Es_train, X_test, y_test):
    """
    Plot loss progression over the time and predicted values of student after training versus true ones generated from teacher.

    Inputs:
    - Es_train (np.ndarray): loss values.
    - X_test (np.ndarray): test input data.
    - y_test (np.ndarray): test outpu data.
    """
    with plt.xkcd():
        fig, axes = plt.subplots(1,2,figsize=(10,5))
        plt.locator_params(nbins=3)

        axes[0].semilogy(Es_train/float(y_test.var()))
        axes[0].set_xlabel('Epochs')
        axes[0].set_ylabel('Error')

        axes[1].scatter(y_test.detach(),student(X_test).detach())
        axes[1].set_xlabel('Teacher')
        axes[1].set_ylabel('Student')

        axes[1].tick_params(axis='y', labelrotation=90)
        axes[1].set_yticks([-0.01,0,0.01])
        axes[1].set_xticks([-0.01,0,0.01])

def expressivity_visualization(layer, projected_traj_1, projected_traj_2, colors):
    """
    Plot projected trajectories for points in the given layer for two different networks.

    Inputs:
    - layer (int): layer of networks to visualize.
    - projected_traj_1 (np.ndarray): standard network projections.
    - projected_traj_2 (np.ndarray): quasilinear network projections.
    - colors (np.ndarray): colors to use in plotting.
    """

    with plt.xkcd():

        fig = plt.figure()
        fig.suptitle(f'Layer {layer}', fontsize=16)

        #standard net
        ax1 = fig.add_subplot(121, projection='3d')
        specific_layer_1 = projected_traj_1[layer]

        for i in range(len(specific_layer_1) - 1):
            ax1.plot([specific_layer_1[i, 0], specific_layer_1[i + 1, 0]], [specific_layer_1[i, 1], specific_layer_1[i + 1, 1]], [specific_layer_1[i, 2], specific_layer_1[i + 1, 2]], color=colors[i])

        for line in ax1.get_lines():
            line.set_path_effects([path_effects.Normal()])

        ax1.set_title('Standard Net')
        ax1.set_xlabel('X')
        ax1.set_ylabel('Y')
        ax1.set_zlabel('Z')

        ax2 = fig.add_subplot(122, projection='3d')
        specific_layer_2 = projected_traj_2[layer]

        for i in range(len(specific_layer_2) - 1):
            ax2.plot([specific_layer_2[i, 0], specific_layer_2[i + 1, 0]], [specific_layer_2[i, 1], specific_layer_2[i + 1, 1]], [specific_layer_2[i, 2], specific_layer_2[i + 1, 2]], color=colors[i])

        for line in ax2.get_lines():
            line.set_path_effects([path_effects.Normal()])

        ax2.set_title('Quasi-Linear Net')
        ax2.set_xlabel('X')
        ax2.set_ylabel('Y')
        ax2.set_zlabel('Z')

        plt.tight_layout()
        plt.show()

Helper functions#

Set random seed#

Section 1: Introduction#

In this section we will create functions to capture the snippets of code that we will use repeatedly in what follows.

Video 1: Introduction#

The universal approximator theorem (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The “catch” in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain’s computational “secret sauces”.

To illustrate this fact, we’ll create a complex function and then attempt to fit it with single-hidden-layer neural networks of different widths. What we’ll find is that although the UAT guarantees that sufficiently wide networks can approximate our function, the performance will actually not be very good for our shallow nets of modest width.

One easy way to create a complex function is to build a random deep neural network (figure above, right). We then have a teacher network which generates the ground truth outputs, and a student network whose goal is to learn the mapping implemented by the teacher. This approach - known as the student-teacher setting - is useful for both computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs.

Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The “complexity” of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between chaos and non-chaos in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the paper:

Exponential expressivity in deep neural networks through transient chaos Poole et al. Neurips (2016).

Submit your feedback#

Video 2: Setup#

Submit your feedback#

Coding Exercise 1: Create an MLP#

The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch. In between each layer, we insert a hyperbolic tangent nonlinearity layer (nn.Tanh()).

Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on.

Network Implementation#

Now, we implement an auxiliary function which calculates the number of parameters in the MLP.

def get_num_params(n_in,W,D):
    """
    Simple function to compute number of learned parameters in an MLP with given dimensions.

    Inputs:
    - n_in (int): input dimension.
    - W (int): width of the network.
    - D (int): depth if the network.

    Outputs:
    - num_params (int): number of parameters in the network.
    """
    ###################################################################
    ## Fill out the following then remove
    raise NotImplementedError("Student exercise: complete function which calculates the number of parameters in the defined architecture of MLP.")
    ###################################################################

    input_params = ... * ...
    hidden_layers_params = (...) * ...**2
    output_params = ...
    return input_params + hidden_layers_params + output_params

np.testing.assert_allclose(get_num_params(10, 3, 2), 33, err_msg = "Expected value of parameters number is different!")

Click for solution

Submit your feedback#

Coding Exercise 2: Initialize model weights#

Write a function that, given a model and a \(\sigma\), initializes all weights in the model according to a normal distribution with mean \(0\) and standard deviation

\[\frac{\sigma}{\sqrt{n_{in}}},\]

where \(n_{in}\) is the number of inputs to the layer.

set_seed(42)

def initialize_layers(net,sigma):
    """
    Set weight to each of the parameters in the model of value sigma/sqrt(n_in), where n_in is the number of inputs to the layer.

    Inputs:
    - net (nn.Sequential): network.
    - sigma (float): standard deviation.
    """
    ###################################################################
    ## Fill out the following then remove
    raise NotImplementedError("Student exercise: set initial values to the weights of MLP.")
    ###################################################################
    for param in ...:
        n_in = param.shape[1]
        nn.init.normal_(param, std = ...)

initialize_layers(net, 1)
np.testing.assert_allclose(next(net.parameters())[0][0].item(), 0.609, err_msg = "Expected value of parameter is different!", atol = 1e-3)

Click for solution

Submit your feedback#

Coding Exercise 3: Generate a dataset#

Given a network, generate the input data by sampling from a multivariate Gaussian distribution and output data by passing the inputs through the network. Don’t forget to .detach() the outputs - otherwise, gradients will be computed for these (with respect to the teacher weights, which we don’t want).

set_seed(42)

def make_data(net, n_in, n_examples):
    """
    Generate data by sampling from a multivariate gaussian distribution, and output data by passing the inputs through the network.

    Inputs:
    - net (nn.Sequential): network.
    - n_in (int): input dimension.
    - n_examples (int): number of data examples to generate.

    Outputs:
    - X (torch.tensor): input data.
    - y (torch.tensor): output data.
    """
    ###################################################################
    ## Fill out the following then remove
    raise NotImplementedError("Student exercise: complete data generation.")
    ###################################################################
    X = torch.randn(..., ...)
    y = net(...).detach()
    return X, ...

X, y = make_data(net, 10, 10000000)
np.testing.assert_allclose(X[0][0].item(), 1.927, err_msg = "Expected value of data is different!", atol = 1e-3)

Click for solution

Submit your feedback#

Coding Exercise 4: Train model and compute loss#

In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include the network, the training inputs and outputs, the number of steps, and the learning rate. Set up loss function as MSE.

set_seed(42)

def train_model(net, X, y, n_epochs, lr, progressbar=True):
    """
    Perform training of the network.

    Inputs:
    - net (nn.Sequential): network.
    - X (torch.tensor): input data.
    - y (torch.tensor): output data.
    - n_epochs (int): number of epochs to train the model for.
    - lr (float): learning rate for optimizer (we will use `Adam` by default).
    - progressbar (bool, default = True): whether to use additional bar for displaying training progress.

    Outputs:
    - Es (np.ndarray): array which contains loss for each epoch.
    """
    ###################################################################
    ## Fill out the following then remove
    raise NotImplementedError("Student exercise: complete training of the network.")
    ###################################################################

    # Set up optimizer
    loss_fn = ...
    optimizer = torch.optim.Adam(..., lr = ...)

    # Run training loop
    Es = np.zeros(...)
    for n in (tqdm(range(n_epochs)) if progressbar else range(n_epochs)):
        y_pred = net(...)
        loss = loss_fn(..., y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        Es[n] = float(...)

    return Es

Es = train_model(net, X, y, 10, 1e-3)
np.testing.assert_allclose(Es[0], 0.0, err_msg = "Expected value of loss is different!", atol = 1e-3)

Click for solution

Coding Exercise 4 Discussion#

Why do you think we obtain zero error right away (on the first epoch)?

Click for solution

Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters: the network and the dataset inputs and outputs.

def compute_loss(net, X, y):
    """
    Calculate loss on given network and data.

    Inputs:
    - net (nn.Sequential): network.
    - X (torch.tensor): input data.
    - y (torch.tensor): output data.

    Outputs:
    - loss (float): computed loss.
    """
    ###################################################################
    ## Fill out the following then remove
    raise NotImplementedError("Student exercise: complete loss calculation.")
    ###################################################################
    loss_fn = ...

    y_pred = ...
    loss = loss_fn(..., ...)
    loss = float(...)
    return loss

loss = compute_loss(net, X, y)
np.testing.assert_allclose(loss, 0.0, err_msg = "Expected value of loss is different!", atol = 1e-3)

Click for solution

Submit your feedback#

Section 2: Fitting a deep network with a shallow network#

Estimated timing to here from start of tutorial: 20 minutes

We will now use the functions we’ve created to experiment with deep network fitting. In particular, we will see to what extent it is possible to fit a deep net using a shallow net. Specifically, we will fix a deep teacher and then fit it with a single-hidden-layer net with varying width value. In principle, if the number of hidden units is large enough, the error should be low. Let’s see!

Video 3: Deep network fit with a shallow network#

Submit your feedback#

Coding Exercise 5: Create learning problem#

Create a “deep” teacher network that accepts inputs of size 5. Give the network a width of 5 and a depth of 5. Use this to generate both a training and test set with 4000 examples for training and 1000 for testing. Initialize weights with a standard deviation of 2.0.

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: complete set up.")
###################################################################
torch.manual_seed(-1)

# Create teacher
n_in = ...     # input dimension
W_teacher, D_teacher = ..., ...  # teacher width, depth
sigma_teacher = ...     # teacher weight variance
teacher = make_MLP(..., ..., ...)
initialize_layers(..., ...)

# generate train and test set
N_train, N_test = ..., ...
X_train, y_train = make_data(..., ..., ...)
X_test, y_test = make_data(..., ..., ...)

np.testing.assert_allclose(X_test[0][0].item(), 0.19076240062713623, err_msg = "Expected value of data is different!")

Click for solution

Coding Exercise 5 Discussion#

What is the minimum error achievable by an MLP on the generated problem?
What is the minimum error achievable by a 1-hidden-layer MLP?

Click for solution

Submit your feedback#

Coding Exercise 6: Train net with the same architecture#

Create a student network with the same architecture as the teacher network - that is, the same width and depth. Train it and confirm that a network with the same architecture can indeed achieve low test error. You may need to train for a large number of iterations, and you may need to adjust the learning rate as learning proceeds.

First, let’s confirm that the number of training examples is greater than 3 times the number of parameters, so we have enough data to train the network.

n_in = 5
W_student, D_student = 5, 5
student = make_MLP(n_in, W_student, D_student)

# make sure we have enough data
P = get_num_params(n_in, W_student, D_student)
assert(N_train > 3*P)

Now, let’s train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is training training, you can go to the next coding exercise and return back to observe the results (it will take approximately 5 minutes).

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: train student on the generated data from teacher.")
###################################################################
lr = 0.003
Es_deep = []
for i in range(4):
    Es_deep.append(train_model(..., ..., ..., 50000, ...))
    #observe we reduce learning rate
    lr /= 3
Es_deep = np.array(Es_deep)
Es_deep = Es_deep.ravel()

# evaluate test error
loss_deep = compute_loss(..., ..., ...) / float(y_test.var())
print("Loss of deep student: ",loss_deep)
plot_loss(Es_deep)

Click for solution

Example output:

Submit your feedback#

Coding Exercise 7: Train a 2 layer neural net with varying width#

Let us now try to fit the deep teacher network with a shallow student network. Let’s give the student a single hidden layer, and let’s study the error as a function of the student width \(W_s\). For a range of widths between, say, 5 and 200, create a student network, train it on the training set, and compute its test error. The training time will take approximately 2 minutes.

Then, plot the training and testing errors as a function of width on a log-log plot. How does the error of the shallow network compare to that of the deep network?

D_student = 2  # student depth
Ws_student = np.array([5, 15, 45, 135]) # widths

lr = 1e-3
n_epochs = 20000
Es_shallow_train = np.zeros((len(Ws_student), n_epochs))
Es_shallow_test = np.zeros(len(Ws_student))

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: train different students on the already generated data from teacher.")
###################################################################

for index, W_student in enumerate(tqdm(Ws_student)):

    student = make_MLP(..., ..., ...)

    # make sure we have enough data
    P = get_num_params(n_in, W_student, D_student)
    assert(N_train > 3*P)

    # train
    Es_shallow_train[index] = train_model(..., ..., ..., ..., lr, progressbar=False)
    Es_shallow_train[index] /= y_test.var()

    # evaluate test error
    loss = compute_loss(..., ..., ...)/y_test.var()
    Es_shallow_test[index] = ...

plot_loss_as_function_of_width(Ws_student, Es_shallow_test, Es_shallow_train)

Click for solution

Example output:

Submit your feedback#

Coding Exercise 8: Network size prediction#

Let’s suppose that the test error will continue to improve with increasing width according to the same trend in the previous plot - which is probably too optimistic but will let us do some back-of-the-envelope calculations. Specifically, let us assume there is a linear relationship

\[ \log E=m \log W+b\]

between the log of the width and the log of the error. Fit this linear model from our experiment and use it to predict the number of hidden units needed to achieve a relative error of, say, \(10^{-6}\).

error_target = 1e-6

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: fit linear model and predict the number of hidden units.")
###################################################################

m,b = np.polyfit(np.log(...), np.log(...), 1)
print('Predicted width: ', np.exp((np.log(...) - ...) / ...))

Click for solution

Based on this, do you think that a reasonably sized shallow network could learn this task with low error?

Submit your feedback#

Section 3: Deep networks in the quasilinear regime#

Estimated timing to here from start of tutorial: 45 minutes

We’ve just shown that certain deep networks are difficult to fit. In this section, we will discuss a regime in which a shallow network is able to approximate a deep teacher relatively well.

Video 4: Deep networks in the quasilinear regime#

Submit your feedback#

One of the reasons that shallow nets cannot fit deep nets, in general, is that random deep nets, in certain regimes, behave like chaotic systems: each layer can be thought of as a single step of a dynamical system, and the number of layers plays the role of the number of time steps. A deep network, therefore, effectively subjects its input to long-time chaotic dynamics, which are, almost by definition, very difficult to predict accurately. In particular, shallow nets simply cannot capture the complex mapping implemented by deeper networks without resorting to an astronomical number of hidden units. Another way to interpret this behavior is that the many layers of a deep network repeatedly stretch and fold their inputs, allowing the network to implement a large number of complex functions - an idea known as expressivity (Poole et al. 2016).

However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we’d expect a shallow network to be able to approximate a deep teacher relatively well.

For more on these ideas, see the paper

Exponential expressivity in deep neural networks through transient chaos Poole et al. Neurips (2016).

To test this idea, we’ll repeat the exercise above, this time initializing the teacher weights with a small \(\sigma\), say, \(0.4\), so that the teacher network is quasi-linear.

Coding Exercise 9: Create dataset & Train a student network#

Create training and test sets. Initialize the teacher network with \(\sigma_{t} = 0.4\).

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: complete set up.")
###################################################################
torch.manual_seed(-1)

# Create teacher
n_in = 5     # input dimension
W_teacher, D_teacher = 5, 5  # teacher width, depth
sigma_teacher = ...     # teacher weight variance
teacher = make_MLP(..., ..., ...)
initialize_layers(..., ...)

# generate train and test set
N_train, N_test = 4000, 1000
X_train, y_train = make_data(..., ..., ...)
X_test, y_test = make_data(..., ..., ...)

Click for solution

Give the student network a single hidden layer with \(10\) units. Train it for a similar amount of time as before. Determine the relative MSE.

###################################################################
## Fill out the following then remove
raise NotImplementedError("Student exercise: train student on the generated data from special teacher.")
###################################################################

W_student, D_student = ..., ...  # student width, depth

lr = 1e-3
n_epochs = 20000
Es_shallow_train = np.zeros((len(Ws_student),n_epochs))
Es_shallow_test = np.zeros(len(Ws_student))

student = make_MLP(..., ..., ...)
initialize_layers(student, sigma_teacher)

# make sure we have enough data
P = get_num_params(n_in, W_student, D_student)
assert(N_train > 3*P)

# train
Es_shallow_train = train_model(..., ..., ..., n_epochs, lr, progressbar=True)

# # evaluate test error
Es_shallow_test = compute_loss(..., ..., ...)/float(y_test.var())
print('Shallow student loss: ',Es_shallow_test)
plot_students_predictions_vs_teacher_values(Es_shallow_train, X_test, y_test)

Click for solution

Example output:

Submit your feedback#

Video 5: Conclusion & Interactive Demo#

Submit your feedback#

Interactive Demo 1: Deep networks expressivity#

In this demo, we invite you to explore the expressivity of two distinct deep networks already introduced earlier: one with \(\sigma = 2\) and another (quasi-linear) with \(\sigma = 0.4\).

We initialize two deep networks with \(D=20\) layers with \(W = 100\) hidden units each but different variances in their random parameters. Then, 400 input data points are generated on a unit circle. We will examine how these points are propagated through the networks.

To visualize each layer’s activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you’ll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so.

Execute the cell to observe interactive widget

Interactive Demo 1 Discussion#

What is the qualitative difference between trajectories propagation through these networks? Does it fit what we have seen earlier with wide student approximation?

Click for solution

Submit your feedback#

Summary#

Estimated timing of tutorial: 1 hour

In this tutorial:

We discussed the universal approximator theorem, which guarantees that we can approximate any complex function using a network with a single hidden layer.
To test this idea, we built a deep teacher network and attempted to fit it with a shallow student network.
We found that achieving good performance requires a very wide network - i.e., a very large number of hidden units.
We found that if the teacher network is initialized with very small weights, the fitting becomes very easy.
We discussed how the fitting difficulty is related to whether the teacher is initialized in the chaotic regime.
Chaotic behavior is related to network expressivity, the network’s ability to implement a large number of complex functions.

Tutorial 1: Depth vs width

Contents

Tutorial 1: Depth vs width#

Tutorial Objectives#

Setup#

Install and import feedback gadget#

Imports#

Figure settings#

Plotting functions#

Helper functions#

Set random seed#

Section 1: Introduction#

Video 1: Introduction#

Submit your feedback#

Video 2: Setup#

Submit your feedback#

Coding Exercise 1: Create an MLP#

Network Implementation#

Submit your feedback#

Coding Exercise 2: Initialize model weights#

Submit your feedback#

Coding Exercise 3: Generate a dataset#

Submit your feedback#

Coding Exercise 4: Train model and compute loss#

Coding Exercise 4 Discussion#

Submit your feedback#

Section 2: Fitting a deep network with a shallow network#

Video 3: Deep network fit with a shallow network#

Submit your feedback#

Coding Exercise 5: Create learning problem#

Coding Exercise 5 Discussion#

Submit your feedback#

Coding Exercise 6: Train net with the same architecture#

Submit your feedback#

Coding Exercise 7: Train a 2 layer neural net with varying width#

Submit your feedback#

Coding Exercise 8: Network size prediction#

Submit your feedback#

Section 3: Deep networks in the quasilinear regime#

Video 4: Deep networks in the quasilinear regime#

Submit your feedback#

Coding Exercise 9: Create dataset & Train a student network#

Submit your feedback#

Video 5: Conclusion & Interactive Demo#

Submit your feedback#

Interactive Demo 1: Deep networks expressivity#

Interactive Demo 1 Discussion#

Submit your feedback#

Summary#