Training with logging#

Training of the model in the previous notebook was leveraging the GPU in case it was available. However, after training and closing of this notebook, information about the course of training and the development of the loss was gone. We would like to keep this information as it might be relevant for diagnostic purposes later on, such as convergence or overfitting.

A separate tool, originally developed within the tensorflow ecosystem but now adapted to pytorch, provides a solution for this: Tensorboard. More information is available here.

from torch.utils.tensorboard import SummaryWriter
import torch
cuda_present = torch.cuda.is_available()
ndevices = torch.cuda.device_count()
use_cuda = cuda_present and ndevices > 0
device = torch.device("cuda" if use_cuda else "cpu")  # "cuda:0" ... default device, "cuda:1" would be GPU index 1, "cuda:2" etc
print("number of devices:", ndevices, "\tchosen device:", device, "\tuse_cuda=", use_cuda)
from torch.utils.data import DataLoader
from data import DSBData, get_dsb2018_train_files
from monai.networks.nets import BasicUNet
train_img_files, train_lbl_files = get_dsb2018_train_files()

train_data = DSBData(
    image_files=train_img_files,
    label_files=train_lbl_files,
    target_shape=(256, 256)
)

print(len(train_data))

train_loader = DataLoader(train_data, batch_size=32, shuffle=True, num_workers=1, pin_memory=True)
model = BasicUNet(
    spatial_dims=2,
    in_channels=1,
    out_channels=1,
    features=[16, 16, 32, 64, 128, 16],
    act="relu",
    norm="batch",
    dropout=0.25,
)

# transfer the model to the chosen device
model = model.to(device)

Training of a neural network means updating its parameters (weights) using a strategy that involves the gradients of a loss function with respect to the model parameters in order to adjust model weights to minimize this loss.

optimizer = torch.optim.Adam(model.parameters(), lr=1.e-3)
init_params = list(model.parameters())[0].clone().detach()

Such a training is performed by iterating over the batches of the training dataset multiple times. Each full iteration over the dataset is termed an epoch.

During or after training, the tensorboard logs (which have been collected with the SummaryWriter object) can be visualized. Would you be on your laptop or workstation at home, you could do:

tensorboard --logdir "path/to/logs",

then open a browser using the URL localhost:6006 (or whichever port the tensorboard server outputted as running on). Alternatively, tensorboard can be accessed from jupyter as well:

On Taurus, some special steps need to be taken to visualize the tensorboard logs.

If not done already, spawn a notebook BUT this time make sure to choose production under software environment in the advanced spawn configuration. Then wait until the notebooks open. Run this notebook.

In order to be able to view the tensorboard logs, the tensorboard jupyter lab extension always checks the same location on the computer it is running on. Hence, you need to move your logs in the right location. To do so, run the following command:

!mkdir -p /tmp/$USER/tf-logs 
!ln -s $PWD/logs /tmp/$USER/tf-logs #might fail if the destination already exists

Now run the following cell which performs the model training. While the training runs, you can open the Tensorboad tab from the jupyter lab main page.

max_nepochs = 1
log_interval = 1

writer = SummaryWriter(log_dir="logs", comment="this is the test of SummaryWriter")

model.train(True)

# expects raw unnormalized scores and combines sigmoid + BCELoss for better
# numerical stability.
# expects B x C x W x D
loss_function = torch.nn.BCEWithLogitsLoss(reduction="mean")

for epoch in range(1, max_nepochs + 1):
    for batch_idx, (X, y) in enumerate(train_loader):
        # the inputs and labels have to be on the same device as the model
        X, y = X.to(device), y.to(device)
        
        optimizer.zero_grad()

        prediction_logits = model(X)
        
        batch_loss = loss_function(prediction_logits, y)

        batch_loss.backward()

        optimizer.step()

        if batch_idx % log_interval == 0:
            print(
                "Train Epoch:",
                epoch,
                "Batch:",
                batch_idx,
                "Total samples processed:",
                (batch_idx + 1) * train_loader.batch_size,
                "Loss:",
                batch_loss.item(),
            )
            writer.add_scalar("Loss/train", batch_loss.item(), batch_idx)
writer.close()

When you executed the cell above, you should see a new folder appear in the current directory. This folder is called logs. This is where tensorboard stores all run information.

Exercise: Let’s do this locally#

The nice thing with having all the logs available on disk is, that you can move them around. If you like, try to download the entire logs folder onto your local machine (laptop). Then install tensorboard with pip or conda.

pip install tensorboard #can take awhile

Then run the same code as above on your local machine.

tensorboard --port 6006 --logdir /local/path/logs

You can now open a browser window and type in localhost:6006 as URL. This should open the tensorboard interface.