GPU Migration: Accelerating AI Extraction Pipeline with RTX 5060¶

Date: 2026-05-24 Category: Infrastructure, GPU Computing, MLOps Tags: NVIDIA, CUDA, Docker, Whisper, CLIP, YOLO, PyTorch

The Problem: CPU Inference Too Slow for Production¶

Our media platform runs a multimodal AI extraction pipeline — Whisper for audio transcription, CLIP for visual embeddings, and YOLO for object detection. On our production VM (AMD EPYC 7542, 32 GB RAM), a 51-minute video took 75+ minutes to transcribe. That's slower than real-time — the system took longer to process the video than to watch it.

For a content management workflow where editors are waiting on metadata before they can work, this was a blocker.

The Hardware: RTX 5060 on a Dedicated AI Server¶

We identified a dedicated machine with an NVIDIA GeForce RTX 5060 (8 GB VRAM) running Ubuntu 22.04. The challenge: it was already hosting 18 Docker containers for other projects, plus a systemd-managed GPU worker service consuming up to 3.6 GB VRAM.

Component	Specification
CPU	Intel Core i7-6700 @ 3.40 GHz (4C/8T)
RAM	16 GB DDR4
GPU	NVIDIA GeForce RTX 5060 (8 GB VRAM)
NVIDIA Driver	595.71.05
CUDA	13.2 (driver-level)
Docker	29.4.3 with Compose 5.1.3

Step 1: Enabling Docker GPU Access¶

By default, Docker containers can't see the host GPU. The nvidia-container-toolkit bridges this gap:

# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU passthrough
docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi

Lesson learned: Installing the toolkit updated NVIDIA userspace libraries (595.58.03 to 595.71.05) but the kernel module stayed at the old version. nvidia-smi failed with "Driver/library version mismatch". A full server reboot was required. After reboot, all 18+ containers plus the GPU worker restarted simultaneously, spiking the load average to 27 on 8 threads.

Step 2: Building GPU-Accelerated Docker Images¶

We used nvidia/cuda:12.6.3-runtime-ubuntu22.04 as our base image — the runtime variant, not devel, because we run pre-compiled models (no CUDA compilation needed).

Docker Compose requests GPU access through the deploy section:

services:
  audio-extraction:
    image: audio-extraction-gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - DEVICE=cuda
      - COMPUTE_TYPE=float16
      - WHISPER_MODEL=large-v3

The key configuration change: DEVICE=cuda and COMPUTE_TYPE=float16 instead of the CPU defaults (cpu / int8). The Python services already had auto-detection — if CUDA is available, they use it. The Whisper service uses faster-whisper with CTranslate2, while vision uses PyTorch with CLIP and YOLO.

Step 3: Solving the Blackwell sm_120 Compatibility Issue¶

This was the trickiest problem. The RTX 5060 uses NVIDIA's Blackwell architecture (sm_120) — a brand-new compute capability. The initial vision-extraction image used PyTorch with CUDA 12.4 wheels (cu124), which only supports up to sm_90 (Ada Lovelace).

Symptoms: CLIP and YOLO models loaded without errors, but all CUDA kernels silently failed. The models fell back to CPU-like performance inside the GPU container. VRAM showed only model weights loaded (806 MiB) instead of full CUDA context.

The fix: Upgrade to PyTorch 2.11.0 with CUDA 12.8 wheels (cu128), which includes sm_120 Blackwell kernels:

RUN pip install torch==2.11.0+cu128 torchvision==0.22.0+cu128 \
    --index-url https://download.pytorch.org/whl/cu128

After this fix, VRAM usage jumped from 806 MiB to 4,170 MiB — confirming that CUDA kernels were now running properly. CLIP embeddings dropped from 138 ms to 53 ms per frame.

Why audio-extraction wasn't affected: The Whisper service uses faster-whisper built on CTranslate2, which has its own CUDA backend independent of PyTorch. It worked with cu124 out of the box.

Step 4: Managing VRAM Constraints (8 GB Budget)¶

With only 8 GB VRAM, every megabyte counts. We discovered a pre-existing gpu_worker.service (Whisper, BLIP, BART, NLLB, Gemma models) that consumed up to 3,600 MiB when fully loaded.

The conflict: The extraction services needed ~5,368 MiB. Combined with the gpu_worker: 8,968 MiB — exceeding the 8,151 MiB available.

Resolution: Stopped gpu_worker.service by team agreement. Final VRAM allocation:

Service	VRAM Usage	When Active
Whisper large-v3 (float16)	3,856 MiB	During audio transcription
CLIP ViT-B/32 + YOLO v8	746-4,170 MiB	During visual analysis
Combined peak	5,470 MiB / 8,151 MiB (67%)	Concurrent audio + vision

A Bull/Redis task queue serializes jobs, naturally preventing VRAM overflow from concurrent heavy inference.

Step 5: Rolling Migration Strategy¶

The CPU services were never stopped. The migration was a rolling cutover — deploy GPU version alongside, test, then switch one URL at a time:

orchestrator (VM2) ─── audio ── VM2 CPU (fallback)
    switches one URL     |
    at a time           └── GPU server (active)  <- switch first
                        ── vision ── VM2 CPU (fallback)
                            |
                            └── GPU server (active)  <- switch second
                        ── nlp ──── VM2 CPU (stays, no GPU benefit)

Each cutover was a single environment variable change:

# Before (CPU)
AUDIO_SERVICE_URL=http://audio-extraction:3008

# After (GPU)
AUDIO_SERVICE_URL=http://192.168.22.111:3008

Rollback: Change the URL back. Under 1 minute per service. The CPU version is still running.

Results: Production Benchmarks¶

Audio Transcription (Whisper large-v3)¶

Tested on a 51-minute video, end-to-end through the full consumer pipeline:

Phase	CPU (EPYC 7542)	GPU (RTX 5060)	Speedup
Audio load + VAD	18.5s	8.4s	2.2x
Language detection	29.3s	6.0s	4.9x
Whisper transcription	2,940.8s (49.0 min)	546.8s (9.1 min)	5.4x
Total pipeline	4,866.0s (81.1 min)	561.4s (9.4 min)	8.7x

On CPU, the video took longer to transcribe than to watch (0.6x real-time). On GPU, it finishes 5.4x faster than real-time.

Vision Analysis (CLIP + YOLO)¶

Tested on 10 frames of 1920x1080 video:

Metric	CPU (EPYC 7542)	GPU (RTX 5060)	Speedup
CLIP embedding (pure inference)	138 ms	53 ms	2.7x
Full vision pipeline	1,588 ms	1,924 ms	0.8x (CPU-bound OpenCV)

The full vision pipeline is actually slower on the GPU server because it's dominated by CPU-bound OpenCV operations (color analysis, brightness, sharpness). The i7-6700 is slower than the EPYC for these tasks. But CLIP inference — the AI part — is 2.7x faster. At scale across a content library, this matters.

Full Extraction Pipeline¶

Extractor	Model	Time	Confidence
Audio (GPU)	Whisper large-v3	11 min 47 sec	0.93
Vision (GPU)	CLIP ViT-B/32	18.7 sec	0.84
NLP (CPU)	DistilBERT	2.6 sec	0.78
Media	FFmpeg	<1 sec	1.00
Total		~13 min

Down from 90-120 minutes on CPU. A 7-9x improvement.

Architecture: Before and After¶

    Application VMs (192.168.12.x)           GPU Server (192.168.22.x)

    VM1 (Database)                           AI Server
    +-----------------+                      +------------------------+
    | PostgreSQL      |                      | audio-extraction (GPU) |
    | Redis           |<---- network ---->   | vision-extraction (GPU)|
    | MinIO / Kafka   |                      | NVIDIA RTX 5060        |
    +--------+--------+                      +-----------+------------+
             |                                           |
    VM2 (Services)                                       |
    +--------+--------+                                  |
    | orchestrator  --+---- HTTP extraction calls ------+
    | api-gateway     |
    | auth, content   |
    | audio (CPU)     | <- fallback, still running
    | vision (CPU)    | <- fallback, still running
    | nlp (CPU)       | <- stays here (no GPU benefit)
    +-----------------+

Key Takeaways¶

Match CUDA wheels to your GPU architecture. The RTX 5060 (Blackwell, sm_120) needs PyTorch cu128. cu124 loads models but silently falls back to non-accelerated kernels. Check torch.cuda.get_device_capability() — it should return (12, 0) for Blackwell.
VRAM is the bottleneck, not compute. With 8 GB, you can run Whisper large-v3 (3.8 GB) and CLIP+YOLO (4.2 GB) concurrently at 67% utilization. Serialize heavy jobs via a task queue to prevent OOM.
Rolling migration beats big-bang. Deploy GPU alongside CPU, test through the full consumer path, switch one service at a time. Rollback is a 1-minute URL change.
Not everything benefits from GPU. NLP (spaCy, DistilBERT) at our scale uses 0.13% CPU and 1.19 GB RAM — moving it to GPU adds complexity with zero benefit. Vision's OpenCV pipeline is CPU-bound. Know where your bottleneck actually is.
nvidia-container-toolkit installation may require a reboot. Driver/library version mismatches after install are common. Plan for downtime.
Pre-bake models in Docker images. Whisper large-v3 is 3+ GB. Downloading on first container start means a 3+ minute cold start on HDD. Bake it into the image during build.

This migration was executed on production enterprise infrastructure, processing real media content.

View the full project page → | Back to Blog