---
title: monitoring
weight: 10
---

The easiest way to set up logging for ai experiments is to use `mlflow`, which is a ready made `python` package.

## installation
To get started we can add `mlflow` to our project, using reasonable package managers like `poetry` or `uv`

```sh
$ poetry add mlflow
```

and then inside our environment we can run
```sh
$ mlflow server --host 127.0.0.1
```

This sets up a web server on `localhost:5000`, which is only accessible via the computer (for local monitoring).
If you want to make this accessible to other computers (say locally via LAN, or via the internet) use `--host 0.0.0.0`. Just make sure that [you open the proper port in the firewall (by default port 5000)](/self-sufficiency/networking)

{{% hint info %}}
for example, to serve publicly on port 8889, we run
```sh
$ mlflow server --host 127.0.0.1 --port 8889
```
{{% /hint %}}

## docker-compose
In order to use docker and easily handle/manage updates we can create a `docker-compose.yaml`
```yaml
services:
  mlflow:
  image: ghcr.io/mlflow/mlflow
  container_name: mlflow
  ports:
    - '5000:5000'
  environment:
    MLFLOW_TRACKING_URI: http://0.0.0.0:5000
  volumes:
    - ./mlflow:/mlflow/mlruns
  restart: always
  command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000"]
```
This pulls the latest image of `mlflow` from github and sets it to always run so we can access the service from anywhere on port `5000`.
{{% hint info %}}
if we want to serve it on port 8889, we need to set `ports: '8889:5000'`
{{% /hint %}}

## demo

To get a ready-made demo, we will do a basic MNIST setup

```python
import mlflow

import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision.transforms import ToTensor

train_data = datasets.MNIST(
    root='data',
    train=True,
    transform=ToTensor(),
    download=True
    )

test_data = datasets.MNIST(
    root='data',
    train=False,
    transform=ToTensor(),
    download=True
    )

loaders = {
	'train': DataLoader(
		train_data,
		batch_size=params['batch_size'],
		shuffle=True,
		num_workers=1
	),
	
	'test': DataLoader(
		test_data,
		batch_size=params['batch_size'],
		shuffle=True,
		num_workers=1
	)
}
```
and set up an `ImageClassifier`
```python
class ImageClassifier(nn.Module):

    def __init__(self):
        super(ImageClassifier, self).__init__()

        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = self.conv2(x)
        x = self.conv2_drop(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = x.view(-1, 320)
        x = self.fc1(x)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.softmax(x)

model = ImageClassifier().to(device)
optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
loss_func = nn.CrossEntropyLoss()
```

### train/test functions
Using the [official documentation](https://mlflow.org/docs/latest/tracking/), we can build a tracking experiment

We will need two functions, `train` and `test`:
```python
def train(epoch):
    """
    Train the model on a single pass of the dataloader, and send the metrics to mlflow
    """
    model.train()
    for batch_idx, (data, target) in enumerate(loaders['train']):

        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)

        loss = loss_func(output, target)
        loss.backward()
        accuracy = batch_idx/len(loaders['train'].dataset)

        optimizer.step()

        if batch_idx % 20 == 0:
            print(
              f"Train Epoch: {epoch}, [{
                batch_idx*len(data)
              }/{
                len(loaders['train'].dataset)
              } ({
                100*batch_idx/len(loaders['train'].dataset):.0f
              }%)]), Loss: {loss}"
            )

            loss, current = loss.item(), batch_idx
            step = batch_idx // 20 * (epoch + 1)
            mlflow.log_metric("loss", f"{loss:2f}", step=step)
            mlflow.log_metric("accuracy", f"{accuracy:2f}", step=step)

def test():
    """
    Evaluate the model, and log results with mlflow
    """
    model.eval()

    loss = 0
    correct = 0

    with T.no_grad():
        for data, target in loaders['test']:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss += loss_func(output, target).item()
            pred = output.argmax(dim=1,keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    loss /=len(loaders['test'].dataset)
    accuracy = correct/len(loaders['test'].dataset)

    print(
      f"\nTest set: Average Loss: {loss:.4f}, Accuracy: {correct}/{
        len(loaders['test'].dataset)
      } ({
        100*correct/len(loaders['test'].dataset):.0f
      })\n"
    )

    mlflow.log_metric("eval_loss", f"{loss:2f}", step=epoch)
    mlflow.log_metric("eval_accuracy", f"{accuracy:2f}", step=epoch)
```

### parameter logging

In order to log the hyperparameters so we can reference them during finetuning, we first need to inform the script where our `mlflow` instance is at, and to do this we set
```python
mlflow.set_tracking_uri(uri="http://localhost:5000")

mlflow.set_experiment("MNIST mlflow demo")
```
{{% hint info %}}
`set_tracking_uri` points to the `url` we run `mlflow` at. This means that is we run it on `127.0.0.1`, we use `localhost` or `127.0.0.1`. If we set it up as `0.0.0.0`, and the experiment is run outside of the mlflow server (ie another computer), we use the IP that points to that computer; either the LAN IP provided by the router (if we are using a LAN), or the public IP of the server.

`set_experiment` is the name of the experiment inside the mlflow instance, and is used for experiment grouping and comparisons.
{{% /hint %}}

Now we can define the hyperparameters and log them

```python                
params = {
	"batch_size": batch,
	"learning_rate": lr,
	"num_epochs": epochs
}
mlflow.log_params(params)
```

### the loop
We are now ready to let the experiment run.

The main training loop needs to run inside the `mlflow` [***context***](https://realpython.com/python-with-statement/)

```python
with mlflow.start_run():
	for epoch in range(params['num_epochs']):
		train(epoch)
		test()
```
and wait.