From Docker Compose to Proxmox: A Developer’s Migration Story

Series: Developer’s Guide to Homelab Infrastructure Part 2 of 7

In Part 1, I told the story of Claude Code autonomously curling my Proxmox API. That got people’s attention. But it raised a question I kept getting: “Wait, why do you even have a Proxmox setup? What was wrong with Docker Compose?”

Fair question. Let me take you back to where this started.

The Starting Point: Docker Compose on a Raspberry Pi

Six months ago, my entire homelab was a Raspberry Pi 5 running Docker Compose. Everything lived there. Forgejo for Git hosting. PostgreSQL with primary/replica for the AI platform. LiteLLM as an API gateway. OpenWebUI as the frontend. Traefik for routing. Monitoring. Logging. All of it, stacked on a single-board computer.

And honestly? It worked. Docker Compose is beautiful in its simplicity. You write a YAML file, run docker compose up -d, and you have a service. Need another one? Add a block to the file. Need to tear it down? docker compose down. No PhD in distributed systems required.

I had somewhere between 10 and 20 services running this way. The AI platform alone was a multi-service stack – OpenWebUI talking to LiteLLM talking to Ollama and PostgreSQL with PGVector for embeddings and MinIO for document storage. Add in monitoring, logging, a reverse proxy, and the usual homelab suspects, and that little RPi5 was punching well above its weight.

What worked:

Speed of iteration. Want to try a new service? Write the compose file, deploy, done.
Familiarity. Every developer knows Docker. Zero learning curve.
The YAML is the documentation. Your entire infrastructure is right there in a few files.

What didn’t:

Single point of failure. One SD card corruption, one kernel panic, and everything goes dark. All of it. At once.
No isolation. Every service sharing the same Docker daemon. No network segmentation. No firewall between services. If one container gets compromised, it has a straight shot to everything else.
Resource constraints. An ARM single-board computer has limits. And when cAdvisor decides to eat 255% CPU while your CI runner is also fighting for resources, those limits become very real very fast.

But the trigger wasn’t any one of these problems. It was the accumulation.

Why I Left (The Triggers)

I looked at my Ansible directory one day and counted 23 playbook files – just for managing Docker stacks. Twenty-three. For what was supposed to be a simple setup. Each playbook deploying a compose file, managing environment variables, handling secrets, configuring networks. The “simple” Docker Compose approach had grown its own complexity layer that nobody talks about.

Here’s the thing: Docker Compose scales vertically just fine. Add more services, add more YAML blocks. But it doesn’t scale operationally. When you need to manage secrets properly, segment networks, enforce resource limits, handle backups, coordinate deployments across services that depend on each other – you end up building all of that yourself. You build a platform around Docker Compose because Docker Compose isn’t a platform.

The isolation problem kept nagging at me. I had services that handled API keys sitting on the same network as services exposed to the internet. No firewall rules between them. No VLAN segmentation. The only thing separating my PostgreSQL database from the public-facing reverse proxy was… Docker’s default bridge network. That’s not security. That’s hope.

And then there was resilience. I had a single RPi5 running production services. Not “production” in the enterprise sense – but production in the sense that real people used these services daily. When that Pi went down for maintenance, everything went down. No failover. No graceful degradation. Just darkness.

I knew I needed something between Docker Compose and… well, the next thing. Whatever that was.

You might be thinking: “Why not Kubernetes?” Yeah, I thought about it too. The mini PC I bought is literally named k8plus – that should tell you where my head was at initially. But here’s my position, and I’ll stand by it: you don’t need a Kubernetes cluster to start and scale. Kubernetes solves problems I don’t have. It introduces complexity I don’t need. For a homelab running 10-20 services? LXC containers on Proxmox give you the isolation and resource management of VMs with the lightweight footprint of containers. Without the YAML nightmares of Kubernetes manifests, the networking abstractions, the control plane overhead.

The answer wasn’t Kubernetes. It was a proper hypervisor.

Choosing the Stack

Every technology choice in this stack was deliberate. And every single one is open source. That’s not an accident.

Proxmox over ESXi: VMware’s licensing situation has been a slow-motion disaster. But even before Broadcom made it worse, Proxmox was the better choice for what I needed. Native LXC support. Built-in SDN with VLANs and subnets. Firewall at the datacenter, host, and container level. A web UI that doesn’t make you want to throw your monitor. And it’s free. Actually free, not “free tier with an asterisk” free.

OpenTofu over Terraform: HashiCorp’s BSL license change was a wake-up call. OpenTofu is the community fork that kept the open-source promise. Same HCL syntax, same provider ecosystem, same workflow. But without the licensing uncertainty. When you’re building infrastructure that you want to maintain for years, you don’t want to worry about whether your IaC tool will change its license on you. Again.

Ansible over Salt or Puppet: Agentless. That’s the word that matters. I don’t want to install and maintain agents on every LXC container and Raspberry Pi. SSH is already there. Ansible connects, does its thing, disconnects. The YAML playbook format is readable. The Galaxy ecosystem gives you roles for almost everything. And it’s the natural bridge between OpenTofu (which provisions the infrastructure) and the actual configuration of what runs on it.

Forgejo over GitHub: Self-hosted Git with built-in CI/CD through Forgejo Actions. Same Actions workflow syntax as GitHub, so the learning curve is zero. But my code, my runner, my infrastructure. No external dependency for the most critical part of my development workflow.

Bitwarden for secrets management: I was already using Bitwarden personally. The CLI integrates cleanly with Ansible lookups. My LiteLLM playbook pulls its database URL from Bitwarden at deploy time. No more secrets.auto.tfvars files sitting in repositories – which, as I confessed in Part 1, is how Claude Code found my Proxmox API credentials in the first place.

The common thread: every choice deliberately avoids vendor lock-in. When your infrastructure depends on tools that can change their terms on you, your infrastructure isn’t really yours. Open source is the only license model where you don’t have to trust someone else’s business decisions.

The Hardware

Let me demystify what “a homelab” actually looks like in physical terms, because the word conjures images of server racks with blinking lights and enterprise cooling. Mine is not that.

The mini PC (k8plus): This is the Proxmox host. A compact, quiet, efficient machine with enough cores and RAM to run 13 LXC containers comfortably. It has link aggregation configured – two network interfaces bonded together for bandwidth and redundancy. It sits on a shelf next to my router. No fans screaming. No dedicated server room. Just a box the size of a thick paperback.

The Raspberry Pi 5 (rpi5): Still in active service. This is where Docker Compose stacks that don’t need LXC isolation still run. Traefik handles service discovery for these. Some services work better as Docker stacks – the DEQ dashboard ended up moving back here from Proxmox, which I’ll get to.

The Raspberry Pi 3 (rpi3): This thing refuses to die. It’s old. It’s slow. I enabled zswap on it for compressed memory because 1GB of RAM is… not generous. But it still runs Vector for log shipping, and it still participates in the monitoring mesh. Sometimes the right infrastructure decision is “it works, leave it alone.”

The 2009 Buffalo LinkStation: Seventeen years old. Assembled in Japan. Still working. Not fast, but working. It connects to the network via SMBv1 – a protocol so outdated and insecure that modern systems refuse to speak it. My solution? An LXC container that acts as a protocol bridge: it mounts the NAS using SMBv1 inside the container, then re-shares the storage using modern SMBv3 to Proxmox. The insecure traffic is air-gapped inside one container. A terabyte of recycled storage for backups and ISOs, zero dollars spent on new hardware.

The Hetzner VPS (vps01): Running Ubuntu 24.04, managed through the same OpenTofu and Ansible stack as everything else. This is the external-facing tier – the edge of the homelab. HAProxy routes public traffic, and Tailscale handles the VPN mesh back to the home network.

This is what “running 19 services behind a reverse proxy” looks like physically: a mini PC, two Raspberry Pis, a NAS that predates the iPhone 4, and a cheap VPS. Total monthly cost: the electricity for the home devices plus about 20 euros for the VPS.

The Migration: 13 Containers, Zero Downtime

The migration itself was a delicate operation. I already had services running on Proxmox that I’d set up manually through the web UI. The challenge wasn’t creating new containers – it was bringing existing containers under Infrastructure as Code control without destroying them.

OpenTofu’s import blocks made this possible. You declare the resource in your .tf file with the configuration it already has, add an import block pointing to its existing ID, and run tofu plan. If the plan shows no changes, your code matches reality. If it shows drift, you fix the code until it matches. Then tofu apply – and suddenly that manually-created container is under version control.

I wrote the migration plan in a design document: rpi5-to-proxmox-migration-design.md. Not because I’m a process purist, but because importing 13 containers into IaC while they’re running production services is exactly the kind of thing you don’t want to improvise.

The OpenTofu structure is deliberately flat:

lxc.tf – All LXC container definitions
vm.tf – Virtual machine definitions
firewall.tf – Firewall rules, security groups, IPSets
hetzner.tf – The external VPS infrastructure
containers.auto.tfvars – Container configuration values

No modules. No nested directories. No abstractions that hide what’s actually being provisioned. When your infrastructure breaks at 11 PM, you want to open one file and see exactly what’s defined. Not chase imports through three levels of module nesting.

Every container has prevent_destroy lifecycle protection. This is non-negotiable. One mistyped tofu destroy command without this protection, and you lose production services. With it, OpenTofu refuses to destroy the resource and throws an error. You have to explicitly remove the protection first, which is exactly the kind of friction that prevents accidents.

The DEQ reversal deserves its own mention. I initially moved the DEQ dashboard into its own LXC container on Proxmox. Seemed logical – everything else was moving there. But it didn’t fit well. The deployment was more complex than it needed to be. The service was lightweight enough that Docker Compose on the RPi5 was actually the better home for it. So I destroyed the LXC container via OpenTofu and retargeted the Ansible playbook back to the RPi5.

Here’s what I learned from that: migration is not a one-way street. Not every service benefits from being an LXC container. The right answer for “where should this run?” is sometimes “where it was before.” The goal isn’t to move everything to the shiny new platform. The goal is to put each service where it makes the most sense.

The Networking Rabbit Hole

If the migration itself was delicate, the networking was a multi-day debugging odyssey. This is the section where I admit I spent more time than I’d like staring at packet captures.

Proxmox SDN gives you enterprise networking features: VLANs, zones, and subnets defined through the web UI or API. My containers live in segmented network zones. Services that need to talk to each other are on the same zone. Services that don’t aren’t. This is the isolation that Docker’s bridge network never gave me.

The HAProxy + Traefik split confused a few people who looked at my setup. Why two reverse proxies? Because they serve different worlds. HAProxy sits on the Proxmox SDN side, routing traffic to 19 services across LXC containers. It also acts as an SSH bastion. Traefik sits on the Docker side, handling service discovery for Docker Compose stacks on the RPi5. They don’t compete. They complement. HAProxy for the hypervisor tier, Traefik for the container tier.

The firewall architecture follows defense-in-depth with a kill-switch pattern. At the datacenter level, a default-deny rule drops everything that isn’t explicitly allowed. Security groups define common access patterns – the sdn-base group, for example, only accepts traffic from IPs in the SDN IPSet. Every new container gets this baseline automatically. Then container-specific rules layer on top.

And then there was the Tailscale routing saga. This one cost me days.

I needed Tailscale subnet routing so I could access SDN containers remotely through the VPN mesh. Sounds straightforward. The k8plus mini PC advertises the SDN subnets to Tailscale. Remote devices connect through Tailscale. Traffic routes to k8plus, which forwards it to the SDN. Easy, right?

Wrong. First attempt: SNAT approach. Masquerade the Tailscale traffic so it appears to come from k8plus. This worked, but it was ugly – you lose the original source IP, which breaks logging and audit trails.

Second attempt: Firewall IPSet modification. Instead of NAT, add the Tailscale IP ranges to the SDN firewall’s accepted IPSet. Cleaner. No NAT. Original IPs preserved. But it didn’t work.

The culprit: br_netfilter conntrack zones. Packets coming through the Tailscale tunnel were being classified as connection tracking state INVALID by the bridge netfilter module and dropped silently. This is one of those Linux networking corners where bridge filtering, connection tracking, and firewall rules interact in ways that the documentation doesn’t prepare you for.

The breakthrough came when I realized something about the network topology: Proxmox has direct SDN access. The traffic didn’t need to go through the bridge the way I thought it did. Once I understood the actual path packets took, the firewall IPSet approach worked perfectly. No SNAT. No hacks. Just correct rules in the correct place.

The lesson: networking problems are almost never about the configuration you think they’re about. They’re about the mental model you have of how traffic flows, and how that model is wrong.

What I Got Wrong

I believe in being honest about mistakes, because the internet has enough “here’s my perfect setup” posts. Here’s what I got wrong.

57% of my Docker containers had no memory limits. I discovered this during a resource audit and winced. More than half my containers were running unconstrained, free to eat as much RAM as they wanted. On a Raspberry Pi with limited memory, this is a ticking time bomb. One memory leak in one container, and the OOM killer starts making decisions for you. Fixed it. Should have been there from day one.

cAdvisor at 255% CPU. The monitoring tool designed to tell you about resource problems was itself the resource problem. cAdvisor was burning through CPU cycles on the docker-main host, likely because of the way it introspects container filesystems. I had to add CPU optimization flags and tune its collection intervals. Monitoring should observe, not participate.

CI runner resource exhaustion. My Forgejo Actions runner uses Docker-in-Docker. Turns out running nested container builds on a resource-constrained host leads to exactly the kind of failures you’d expect. Jobs failing. Builds timing out. DNS resolution breaking inside the nested Docker environment. This required both resource limit tuning and a DNS fix in the runner configuration.

Overcomplicating the DEQ deployment. I covered this above – the lesson about migration not being a one-way street came directly from this mistake.

Multi-day SDN debugging. The Tailscale routing saga I described above took days, not hours. Routing rule priorities. SNAT vs firewall IPSets. Bridge netfilter conntrack zones. Each problem led to another problem. I could have saved time by drawing the actual network path on paper before touching any configuration. Instead, I configured by intuition and debugged by frustration.

What’s Different Now

Standing back and looking at where things are today versus six months ago:

Infrastructure as Code – everything. Every LXC container, every VM, every firewall rule is defined in OpenTofu files and checked into Forgejo. I can rebuild my entire infrastructure from scratch. Not theoretically – actually. The code is the truth.

Secrets in Bitwarden, not in files. Ansible lookups pull secrets from Bitwarden CLI at deploy time. No credentials in repositories. No secrets.auto.tfvars sitting in a directory where an AI agent might read them.

Full monitoring stack. VictoriaMetrics for metrics. Grafana for dashboards. VictoriaLogs for log aggregation. Vector deployed to every host and container, shipping logs to a central store. When something breaks, I can see what happened across every component in a correlated timeline.

Self-hosted CI/CD. Forgejo Actions with a self-hosted runner. Lint checks on Ansible playbooks. Deployment pipelines. Build validation. The same workflow I’d expect in any professional environment, but running on my own hardware, managed by my own code.

AI augmentation. This is the part that makes this homelab different from every other “I migrated to Proxmox” story. I have 6 specialized Claude Code agents for this infrastructure – a CTO orchestrator, plus specialists for OpenTofu, Ansible, networking, documentation, and validation. I have 13 homelab-specific skills: provisioning new containers, importing existing resources, debugging SDN issues, managing firewall rules. When I need to deploy a new service, I don’t start from scratch. I invoke a skill that knows the patterns, the conventions, the gotchas.

I’m not a system guy. I said that in Part 1 and it’s still true. But with the right tools, the right abstractions, and an AI pair programmer that understands my specific setup, I don’t need to be. I need to understand enough to make good decisions and recognize bad ones. The tools handle the rest.

What’s Next

In Part 3, we’ll go deeper into the technical details – the OpenTofu import workflow for bringing existing infrastructure under IaC, Bitwarden secrets integration in Ansible, and the decision framework for what belongs in an LXC container versus a Docker stack. The stuff that’s hard to find in documentation because it sits at the intersection of multiple tools.

If Part 1 was about why AI changes the infrastructure game, and Part 2 is about what the migration actually looks like, Part 3 is about how the individual pieces fit together.

I’m still learning this stuff. Every week I find something I configured wrong, or discover a better pattern, or realize that my mental model of how something works was incomplete. That’s not a bug in the process. That’s the process.

Previous in series: I Let AI Manage My Infrastructure (Part 1)

Next in series: Building an AI Agent Team to Run My Homelab (Part 3)