content/theses/research/monitoring.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237

---
title: monitoring
weight: 10
---

The easiest way to set up logging for ai experiments is to use `mlflow`, which is a ready made `python` package.

## installation
To get started we can add `mlflow` to our project, using reasonable package managers like `poetry` or `uv`

```sh
$ poetry add mlflow
```

and then inside our environment we can run
```sh
$ mlflow server --host 127.0.0.1
```

This sets up a web server on `localhost:5000`, which is only accessible via the computer (for local monitoring).
If you want to make this accessible to other computers (say locally via LAN, or via the internet) use `--host 0.0.0.0`. Just make sure that [you open the proper port in the firewall (by default port 5000)](/self-sufficiency/networking)

{{% hint info %}}
for example, to serve publicly on port 8889, we run
```sh
$ mlflow server --host 127.0.0.1 --port 8889
```
{{% /hint %}}

## docker-compose
In order to use docker and easily handle/manage updates we can create a `docker-compose.yaml`
```yaml
services:
  mlflow:
  image: ghcr.io/mlflow/mlflow
  container_name: mlflow
  ports:
    - '5000:5000'
  environment:
    MLFLOW_TRACKING_URI: http://0.0.0.0:5000
  volumes:
    - ./mlflow:/mlflow/mlruns
  restart: always
  command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000"]
```
This pulls the latest image of `mlflow` from github and sets it to always run so we can access the service from anywhere on port `5000`.
{{% hint info %}}
if we want to serve it on port 8889, we need to set `ports: '8889:5000'`
{{% /hint %}}

## demo

To get a ready-made demo, we will do a basic MNIST setup

```python
import mlflow

import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision.transforms import ToTensor

train_data = datasets.MNIST(
    root='data',
    train=True,
    transform=ToTensor(),
    download=True
    )

test_data = datasets.MNIST(
    root='data',
    train=False,
    transform=ToTensor(),
    download=True
    )

loaders = {
	'train': DataLoader(
		train_data,
		batch_size=params['batch_size'],
		shuffle=True,
		num_workers=1
	),
	
	'test': DataLoader(
		test_data,
		batch_size=params['batch_size'],
		shuffle=True,
		num_workers=1
	)
}
```
and set up an `ImageClassifier`
```python
class ImageClassifier(nn.Module):

    def __init__(self):
        super(ImageClassifier, self).__init__()

        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = self.conv2(x)
        x = self.conv2_drop(x)
        x = F.max_pool2d(x, 2)
        x = F.relu(x)
        x = x.view(-1, 320)
        x = self.fc1(x)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.softmax(x)

model = ImageClassifier().to(device)
optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
loss_func = nn.CrossEntropyLoss()
```

### train/test functions
Using the [official documentation](https://mlflow.org/docs/latest/tracking/), we can build a tracking experiment

We will need two functions, `train` and `test`:
```python
def train(epoch):
    """
    Train the model on a single pass of the dataloader, and send the metrics to mlflow
    """
    model.train()
    for batch_idx, (data, target) in enumerate(loaders['train']):

        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)

        loss = loss_func(output, target)
        loss.backward()
        accuracy = batch_idx/len(loaders['train'].dataset)

        optimizer.step()

        if batch_idx % 20 == 0:
            print(
              f"Train Epoch: {epoch}, [{
                batch_idx*len(data)
              }/{
                len(loaders['train'].dataset)
              } ({
                100*batch_idx/len(loaders['train'].dataset):.0f
              }%)]), Loss: {loss}"
            )

            loss, current = loss.item(), batch_idx
            step = batch_idx // 20 * (epoch + 1)
            mlflow.log_metric("loss", f"{loss:2f}", step=step)
            mlflow.log_metric("accuracy", f"{accuracy:2f}", step=step)

def test():
    """
    Evaluate the model, and log results with mlflow
    """
    model.eval()

    loss = 0
    correct = 0

    with T.no_grad():
        for data, target in loaders['test']:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss += loss_func(output, target).item()
            pred = output.argmax(dim=1,keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    loss /=len(loaders['test'].dataset)
    accuracy = correct/len(loaders['test'].dataset)

    print(
      f"\nTest set: Average Loss: {loss:.4f}, Accuracy: {correct}/{
        len(loaders['test'].dataset)
      } ({
        100*correct/len(loaders['test'].dataset):.0f
      })\n"
    )

    mlflow.log_metric("eval_loss", f"{loss:2f}", step=epoch)
    mlflow.log_metric("eval_accuracy", f"{accuracy:2f}", step=epoch)
```

### parameter logging

In order to log the hyperparameters so we can reference them during finetuning, we first need to inform the script where our `mlflow` instance is at, and to do this we set
```python
mlflow.set_tracking_uri(uri="http://localhost:5000")

mlflow.set_experiment("MNIST mlflow demo")
```
{{% hint info %}}
`set_tracking_uri` points to the `url` we run `mlflow` at. This means that is we run it on `127.0.0.1`, we use `localhost` or `127.0.0.1`. If we set it up as `0.0.0.0`, and the experiment is run outside of the mlflow server (ie another computer), we use the IP that points to that computer; either the LAN IP provided by the router (if we are using a LAN), or the public IP of the server.

`set_experiment` is the name of the experiment inside the mlflow instance, and is used for experiment grouping and comparisons.
{{% /hint %}}

Now we can define the hyperparameters and log them

```python                
params = {
	"batch_size": batch,
	"learning_rate": lr,
	"num_epochs": epochs
}
mlflow.log_params(params)
```

### the loop
We are now ready to let the experiment run.

The main training loop needs to run inside the `mlflow` [***context***](https://realpython.com/python-with-statement/)

```python
with mlflow.start_run():
	for epoch in range(params['num_epochs']):
		train(epoch)
		test()
```
and wait.