Retail Edge Intelligence: Computer Vision at Scale

Preface: The modern retail store is a data factory. Hundreds of shoppers interact with thousands of products every hour. Traditional "Store Analytics" (footfall counters) are insufficient. We need comprehensive understanding: gaze detection, queue analysis, and real-time shrinkage (theft) alerts. This requires running Computer Vision (CV) on-premise.

Table of Contents

1. Hardware Selection (Jetson vs. x86)
2. The DeepStream Pipeline
3. Training Custom Models (YOLOv8)
4. TensorRT Optimization
5. The Event Bus (MQTT)

1. Hardware Selection (Jetson vs. x86)

For a typical store with 16-32 cameras, we need high throughput (FPS) per dollar.

Device	AI Performance	Power	Recommended Use
NVIDIA Jetson Orin Nano	40 TOPS	15W	Small stores (1-4 cameras)
NVIDIA Jetson Orin NX	100 TOPS	25W	Medium retail (8-16 cameras)
x86 Server + A2 GPU	45 TOPS (FP32)	200W	Flagship stores (32+ cameras)

2. The DeepStream Pipeline

Decoding 16 streams of 1080p H.264 video will crush any CPU. We use NVIDIA DeepStream to keep the entire pipeline (Decode -> Pre-process -> Inference -> Tracker) on the GPU.

# deepstream_config.txt

[source0]
enable=1
# Type 4 = RTSP
type=4
uri=rtsp://camera-01.local:554/stream

[streammux]
gpu-id=0
batch-size=16
batched-push-timeout=40000
width=1920
height=1080

[primary-gie]
enable=1
model-engine-file=yolov8_int8.engine
labelfile-path=labels.txt
interval=0 # Process every frame (expensive) or skip (interval=2)

3. Training Custom Models (YOLOv8)

Off-the-shelf models detect "Person" and "Car". Retail needs "Holding Product", "Putting in Pocket", and "Staff Uniform". We fine-tune YOLOv8 on specific store datasets.

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8n.pt')  # load a pretrained model (nano)

# Train the model
results = model.train(
    data='retail_theft_dataset.yaml', 
    epochs=100, 
    imgsz=640,
    device=0
)

4. TensorRT Optimization

PyTorch models are slow. We convert them to TensorRT engines, which perform graph fusion and kernel autotuning. We also use INT8 Calibration to reduce precision while maintaining accuracy.

Result: YOLOv8n jumps from 45 FPS (PyTorch) to 600 FPS (TensorRT INT8) on an Orin Nano.

5. The Event Bus (MQTT)

The edge node does not send video to the cloud. It sends metadata. We use an MQTT Probe in DeepStream to publish JSON events.

# MQTT Payload Example
{
  "sensorId": "cam-04",
  "timestamp": "2024-12-27T14:30:00Z",
  "objects": [
    {
      "class": "person", 
      "confidence": 0.92,
      "bbox": [100, 200, 50, 150],
      "attributes": {
         "action": "dwelling",
         "duration": 45s
      }
    }
  ]
}

This JSON stream is consumed by a local dashboard (Grafana) for the Store Manager and synced to the cloud for Head Office analytics.

Conclusion: By moving the eyes of the AI into the store, we turn "Loss Prevention" from a reactive investigation of yesterday's tapes into a proactive intervention in real-time.