1 files changed, 237 insertions, 0 deletions
diff --git a/content/theses/research/monitoring.md b/content/theses/research/monitoring.md
new file mode 100644
index 0000000..9d7e2c9
--- /dev/null
+++ b/content/theses/research/monitoring.md
@@ -0,0 +1,237 @@
+---
+title: monitoring
+weight: 10
+---
+
+The easiest way to set up logging for ai experiments is to use `mlflow`, which is a ready made `python` package.
+
+## installation
+To get started we can add `mlflow` to our project, using reasonable package managers like `poetry` or `uv`
+
+```sh
+$ poetry add mlflow
+```
+
+and then inside our environment we can run
+```sh
+$ mlflow server --host 127.0.0.1
+```
+
+This sets up a web server on `localhost:5000`, which is only accessible via the computer (for local monitoring).
+If you want to make this accessible to other computers (say locally via LAN, or via the internet) use `--host 0.0.0.0`. Just make sure that [you open the proper port in the firewall (by default port 5000)](/self-sufficiency/networking)
+
+{{% hint info %}}
+for example, to serve publicly on port 8889, we run
+```sh
+$ mlflow server --host 127.0.0.1 --port 8889
+```
+{{% /hint %}}
+
+## docker-compose
+In order to use docker and easily handle/manage updates we can create a `docker-compose.yaml`
+```yaml
+services:
+  mlflow:
+  image: ghcr.io/mlflow/mlflow
+  container_name: mlflow
+  ports:
+    - '5000:5000'
+  environment:
+    MLFLOW_TRACKING_URI: http://0.0.0.0:5000
+  volumes:
+    - ./mlflow:/mlflow/mlruns
+  restart: always
+  command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000"]
+```
+This pulls the latest image of `mlflow` from github and sets it to always run so we can access the service from anywhere on port `5000`.
+{{% hint info %}}
+if we want to serve it on port 8889, we need to set `ports: '8889:5000'`
+{{% /hint %}}
+
+## demo
+
+To get a ready-made demo, we will do a basic MNIST setup
+
+```python
+import mlflow
+
+import torch as T
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+from torch.utils.data import DataLoader
+
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+
+train_data = datasets.MNIST(
+    root='data',
+    train=True,
+    transform=ToTensor(),
+    download=True
+    )
+
+test_data = datasets.MNIST(
+    root='data',
+    train=False,
+    transform=ToTensor(),
+    download=True
+    )
+
+loaders = {
+	'train': DataLoader(
+		train_data,
+		batch_size=params['batch_size'],
+		shuffle=True,
+		num_workers=1
+	),
+	
+	'test': DataLoader(
+		test_data,
+		batch_size=params['batch_size'],
+		shuffle=True,
+		num_workers=1
+	)
+}
+```
+and set up an `ImageClassifier`
+```python
+class ImageClassifier(nn.Module):
+
+    def __init__(self):
+        super(ImageClassifier, self).__init__()
+
+        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
+        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
+        self.conv2_drop = nn.Dropout2d()
+        self.fc1 = nn.Linear(320, 50)
+        self.fc2 = nn.Linear(50, 10)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = F.max_pool2d(x, 2)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = self.conv2_drop(x)
+        x = F.max_pool2d(x, 2)
+        x = F.relu(x)
+        x = x.view(-1, 320)
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = F.dropout(x, training=self.training)
+        x = self.fc2(x)
+        return F.softmax(x)
+
+model = ImageClassifier().to(device)
+optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
+loss_func = nn.CrossEntropyLoss()
+```
+
+### train/test functions
+Using the [official documentation](https://mlflow.org/docs/latest/tracking/), we can build a tracking experiment
+
+We will need two functions, `train` and `test`:
+```python
+def train(epoch):
+    """
+    Train the model on a single pass of the dataloader, and send the metrics to mlflow
+    """
+    model.train()
+    for batch_idx, (data, target) in enumerate(loaders['train']):
+
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+
+        loss = loss_func(output, target)
+        loss.backward()
+        accuracy = batch_idx/len(loaders['train'].dataset)
+
+        optimizer.step()
+
+        if batch_idx % 20 == 0:
+            print(
+              f"Train Epoch: {epoch}, [{
+                batch_idx*len(data)
+              }/{
+                len(loaders['train'].dataset)
+              } ({
+                100*batch_idx/len(loaders['train'].dataset):.0f
+              }%)]), Loss: {loss}"
+            )
+
+            loss, current = loss.item(), batch_idx
+            step = batch_idx // 20 * (epoch + 1)
+            mlflow.log_metric("loss", f"{loss:2f}", step=step)
+            mlflow.log_metric("accuracy", f"{accuracy:2f}", step=step)
+
+def test():
+    """
+    Evaluate the model, and log results with mlflow
+    """
+    model.eval()
+
+    loss = 0
+    correct = 0
+
+    with T.no_grad():
+        for data, target in loaders['test']:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            loss += loss_func(output, target).item()
+            pred = output.argmax(dim=1,keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+
+    loss /=len(loaders['test'].dataset)
+    accuracy = correct/len(loaders['test'].dataset)
+
+    print(
+      f"\nTest set: Average Loss: {loss:.4f}, Accuracy: {correct}/{
+        len(loaders['test'].dataset)
+      } ({
+        100*correct/len(loaders['test'].dataset):.0f
+      })\n"
+    )
+
+    mlflow.log_metric("eval_loss", f"{loss:2f}", step=epoch)
+    mlflow.log_metric("eval_accuracy", f"{accuracy:2f}", step=epoch)
+```
+
+### parameter logging
+
+In order to log the hyperparameters so we can reference them during finetuning, we first need to inform the script where our `mlflow` instance is at, and to do this we set
+```python
+mlflow.set_tracking_uri(uri="http://localhost:5000")
+
+mlflow.set_experiment("MNIST mlflow demo")
+```
+{{% hint info %}}
+`set_tracking_uri` points to the `url` we run `mlflow` at. This means that is we run it on `127.0.0.1`, we use `localhost` or `127.0.0.1`. If we set it up as `0.0.0.0`, and the experiment is run outside of the mlflow server (ie another computer), we use the IP that points to that computer; either the LAN IP provided by the router (if we are using a LAN), or the public IP of the server.
+
+`set_experiment` is the name of the experiment inside the mlflow instance, and is used for experiment grouping and comparisons.
+{{% /hint %}}
+
+Now we can define the hyperparameters and log them
+
+```python                
+params = {
+	"batch_size": batch,
+	"learning_rate": lr,
+	"num_epochs": epochs
+}
+mlflow.log_params(params)
+```
+
+### the loop
+We are now ready to let the experiment run.
+
+The main training loop needs to run inside the `mlflow` [***context***](https://realpython.com/python-with-statement/)
+
+```python
+with mlflow.start_run():
+	for epoch in range(params['num_epochs']):
+		train(epoch)
+		test()
+```
+and wait.