MLOps at the Edge: Fleet Management at Scale

Preface: Cloud MLOps is solved; you use Kubernetes and ArgoCD. Edge MLOps is unsolved. Devices are behind NATs, have flaky cellular connections, and limited disk space. This guide proposes a robust architecture for managing thousands of remote AI nodes using K3s and a Pull-Based GitOps workflow.

Table of Contents

1. The Fallacy of "Push" Deployments
2. The Edge Stack (K3s & Arc)
3. Atomic OTA Update Strategy
4. Automated Rollback Logic
5. Observability (Prometheus Remote Write)

1. The Fallacy of "Push" Deployments

In a data center, you "push" code to a server via SSH or kubectl. At the edge, you cannot.

No Public IP: Devices are behind Carrier-Grade NAT (CGNAT).
Sleeping/Offline: Solar-powered devices may only wake up for 10 minutes a day.

We must invert control: The device must Pull its configuration. Ideally, it should be an idempotent reconciliation loop.

2. The Edge Stack (K3s & Arc)

We need a container orchestrator that is lightweight yet API-compatible with standard tools. K3s (Lightweight Kubernetes) is the industry standard.

Binary Size: <100MB
Memory: Runs on 512MB RAM
Feature Parity: Support for Helm, Secrets, ConfigMaps.

GitOps with FluxCD

We install a GitOps agent (like FluxCD) on the K3s cluster. The agent monitors a Git repository for changes.

# fleet-repo/apps/inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: object-detection
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: yolo
        image: registry.netprog.io/models/yolo:v4.2.0 # Update this line in Git
        resources:
          limits:
            nvidia.com/gpu: 1

When you want to update 5,000 devices, you simply commit a change to the `image` tag in the Git repo. The 5,000 agents will wake up, see the change, and start pulling the new image at their own pace.

3. Atomic OTA Update Strategy

Updating AI models involves downloading large files (GBs). If the network cuts out at 99%, you cannot leave the device with a corrupted model.

The A/B Partition (Software) Approach:

Current State: Pod V1 is running.
Update Triggered: K3s pulls Image V2.
Startup Probe: Pod V2 starts up alongside V1. It loads the model into memory and runs a self-test inference on a dummy image.
Switchover: Only if the self-test passes does K3s route traffic to V2 and terminate V1.

4. Automated Rollback Logic

What if the model works but performance is terrible (e.g., memory leak)? We need a watchdog.

# Python Watchdog Script (Sidecar)
import time
import psutil
import requests

FAIL_COUNT = 0

while True:
    # Check 1: Inference API Health
    try:
        r = requests.get("http://localhost:8080/health", timeout=1)
        if r.status_code != 200:
            FAIL_COUNT += 1
    except:
        FAIL_COUNT += 1
        
    # Check 2: Memory Pressure
    if psutil.virtual_memory().percent > 95:
        FAIL_COUNT += 1
        
    if FAIL_COUNT > 5:
        print("CRITICAL: Healthy checks failed. Triggering Self-Heal.")
        # Trigger K8s API to rollback deployment
        rollback_deployment()
        break
        
    time.sleep(10)

5. Observability (Prometheus Remote Write)

You cannot query edge devices. They must push metrics to you. We use Prometheus Agent Mode to scrape local metrics (GPU Temp, Inference Latency) and `remote_write` them to a central Cortex/Thanosh cluster in the cloud.

Bandwidth Tip: Do not send raw metrics. Use 'recording rules' to aggregate data at the edge (e.g., calculate p99 latency locally) and only send the aggregated datapoints to the cloud.

Conclusion: Edge MLOps is about managing chaos. By treating devices as "cattle, not pets" and using pull-based, atomic workflows, one engineer can manage a fleet of 10,000 nodes as easily as managing 10.