diff options
Diffstat (limited to '')
| -rw-r--r-- | content/theses/research/monitoring.md | 237 |
1 files changed, 237 insertions, 0 deletions
diff --git a/content/theses/research/monitoring.md b/content/theses/research/monitoring.md new file mode 100644 index 0000000..9d7e2c9 --- /dev/null +++ b/content/theses/research/monitoring.md @@ -0,0 +1,237 @@ +--- +title: monitoring +weight: 10 +--- + +The easiest way to set up logging for ai experiments is to use `mlflow`, which is a ready made `python` package. + +## installation +To get started we can add `mlflow` to our project, using reasonable package managers like `poetry` or `uv` + +```sh +$ poetry add mlflow +``` + +and then inside our environment we can run +```sh +$ mlflow server --host 127.0.0.1 +``` + +This sets up a web server on `localhost:5000`, which is only accessible via the computer (for local monitoring). +If you want to make this accessible to other computers (say locally via LAN, or via the internet) use `--host 0.0.0.0`. Just make sure that [you open the proper port in the firewall (by default port 5000)](/self-sufficiency/networking) + +{{% hint info %}} +for example, to serve publicly on port 8889, we run +```sh +$ mlflow server --host 127.0.0.1 --port 8889 +``` +{{% /hint %}} + +## docker-compose +In order to use docker and easily handle/manage updates we can create a `docker-compose.yaml` +```yaml +services: + mlflow: + image: ghcr.io/mlflow/mlflow + container_name: mlflow + ports: + - '5000:5000' + environment: + MLFLOW_TRACKING_URI: http://0.0.0.0:5000 + volumes: + - ./mlflow:/mlflow/mlruns + restart: always + command: ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000"] +``` +This pulls the latest image of `mlflow` from github and sets it to always run so we can access the service from anywhere on port `5000`. +{{% hint info %}} +if we want to serve it on port 8889, we need to set `ports: '8889:5000'` +{{% /hint %}} + +## demo + +To get a ready-made demo, we will do a basic MNIST setup + +```python +import mlflow + +import torch as T +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim + +from torch.utils.data import DataLoader + +from torchvision import datasets +from torchvision.transforms import ToTensor + +train_data = datasets.MNIST( + root='data', + train=True, + transform=ToTensor(), + download=True + ) + +test_data = datasets.MNIST( + root='data', + train=False, + transform=ToTensor(), + download=True + ) + +loaders = { + 'train': DataLoader( + train_data, + batch_size=params['batch_size'], + shuffle=True, + num_workers=1 + ), + + 'test': DataLoader( + test_data, + batch_size=params['batch_size'], + shuffle=True, + num_workers=1 + ) +} +``` +and set up an `ImageClassifier` +```python +class ImageClassifier(nn.Module): + + def __init__(self): + super(ImageClassifier, self).__init__() + + self.conv1 = nn.Conv2d(1, 10, kernel_size=5) + self.conv2 = nn.Conv2d(10, 20, kernel_size=5) + self.conv2_drop = nn.Dropout2d() + self.fc1 = nn.Linear(320, 50) + self.fc2 = nn.Linear(50, 10) + + def forward(self, x): + x = self.conv1(x) + x = F.max_pool2d(x, 2) + x = F.relu(x) + x = self.conv2(x) + x = self.conv2_drop(x) + x = F.max_pool2d(x, 2) + x = F.relu(x) + x = x.view(-1, 320) + x = self.fc1(x) + x = F.relu(x) + x = F.dropout(x, training=self.training) + x = self.fc2(x) + return F.softmax(x) + +model = ImageClassifier().to(device) +optimizer = optim.Adam(model.parameters(), lr=params['learning_rate']) +loss_func = nn.CrossEntropyLoss() +``` + +### train/test functions +Using the [official documentation](https://mlflow.org/docs/latest/tracking/), we can build a tracking experiment + +We will need two functions, `train` and `test`: +```python +def train(epoch): + """ + Train the model on a single pass of the dataloader, and send the metrics to mlflow + """ + model.train() + for batch_idx, (data, target) in enumerate(loaders['train']): + + data, target = data.to(device), target.to(device) + optimizer.zero_grad() + output = model(data) + + loss = loss_func(output, target) + loss.backward() + accuracy = batch_idx/len(loaders['train'].dataset) + + optimizer.step() + + if batch_idx % 20 == 0: + print( + f"Train Epoch: {epoch}, [{ + batch_idx*len(data) + }/{ + len(loaders['train'].dataset) + } ({ + 100*batch_idx/len(loaders['train'].dataset):.0f + }%)]), Loss: {loss}" + ) + + loss, current = loss.item(), batch_idx + step = batch_idx // 20 * (epoch + 1) + mlflow.log_metric("loss", f"{loss:2f}", step=step) + mlflow.log_metric("accuracy", f"{accuracy:2f}", step=step) + +def test(): + """ + Evaluate the model, and log results with mlflow + """ + model.eval() + + loss = 0 + correct = 0 + + with T.no_grad(): + for data, target in loaders['test']: + data, target = data.to(device), target.to(device) + output = model(data) + loss += loss_func(output, target).item() + pred = output.argmax(dim=1,keepdim=True) + correct += pred.eq(target.view_as(pred)).sum().item() + + loss /=len(loaders['test'].dataset) + accuracy = correct/len(loaders['test'].dataset) + + print( + f"\nTest set: Average Loss: {loss:.4f}, Accuracy: {correct}/{ + len(loaders['test'].dataset) + } ({ + 100*correct/len(loaders['test'].dataset):.0f + })\n" + ) + + mlflow.log_metric("eval_loss", f"{loss:2f}", step=epoch) + mlflow.log_metric("eval_accuracy", f"{accuracy:2f}", step=epoch) +``` + +### parameter logging + +In order to log the hyperparameters so we can reference them during finetuning, we first need to inform the script where our `mlflow` instance is at, and to do this we set +```python +mlflow.set_tracking_uri(uri="http://localhost:5000") + +mlflow.set_experiment("MNIST mlflow demo") +``` +{{% hint info %}} +`set_tracking_uri` points to the `url` we run `mlflow` at. This means that is we run it on `127.0.0.1`, we use `localhost` or `127.0.0.1`. If we set it up as `0.0.0.0`, and the experiment is run outside of the mlflow server (ie another computer), we use the IP that points to that computer; either the LAN IP provided by the router (if we are using a LAN), or the public IP of the server. + +`set_experiment` is the name of the experiment inside the mlflow instance, and is used for experiment grouping and comparisons. +{{% /hint %}} + +Now we can define the hyperparameters and log them + +```python +params = { + "batch_size": batch, + "learning_rate": lr, + "num_epochs": epochs +} +mlflow.log_params(params) +``` + +### the loop +We are now ready to let the experiment run. + +The main training loop needs to run inside the `mlflow` [***context***](https://realpython.com/python-with-statement/) + +```python +with mlflow.start_run(): + for epoch in range(params['num_epochs']): + train(epoch) + test() +``` +and wait. |
