4 min read

Building a Poor Man's PKI using Terraform and cert-manager

When you're dealing with sensitive data, "secure enough" usually isn't enough. Enforcing TLS for all inter-service communication across our infrastructure has always been a strict requirement for us. Even though Azure uses wire encryption on their data plane, relying solely on cloud provider magic feels a bit uncomfortable when you're handling highly sensitive data. Call it paranoia, but we want explicit, end-to-end encryption. We want to own the trust layer.

Now, TLS for our internal services isn't actually new for us. Until recently, we were happily using Let's Encrypt. We used a public domain and relied on DNS challenges to provision certificates for our internal endpoints. It worked great - until it didn't.

We recently made the architectural decision to move all our internal DNS to a completely private zone. Since Let's Encrypt needs to verify domain ownership externally, moving to a private DNS zone meant our DNS challenges were dead in the water.

We needed a new way to sign certificates.

The Obvious Path vs. The Path I Took

Usually, this is the point in the architecture meeting where someone suggests deploying a full-blown service mesh like Istio, setting up a massive HashiCorp Vault infrastructure, or paying for a managed Private CA from a cloud provider.

All of those are valid, "enterprise" solutions. But they also come with a massive catch. Service meshes add a ton of operational complexity and compute overhead. Managed Private CAs cost a ridiculous amount of money just to sit there. We didn't need a bazooka, and we didn't want to burn cash; we needed a reliable, low-maintenance Swiss Army knife.

So, we decided to build a poor man's PKI using Terraform and cert-manager. Here is how we did it.

Step 1: The Root of All Trust (via Terraform)

The journey started at the very top. We needed a single, self-signed Root Certificate Authority (CA).

Instead of running raw OpenSSL commands on someone's laptop and losing the key, we codified the whole thing in Terraform using the hashicorp/tls provider. To make sure we wouldn't have to deal with the headache of expiring root certs anytime soon, we slapped a 20-year expiration date on it. (Future me in 2046 is going to be very annoyed during the rotation, but that's his problem).

resource "tls_private_key" "root" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "tls_self_signed_cert" "root" {
  private_key_pem       = tls_private_key.root.private_key_pem
  validity_period_hours = 175200 # Roughly 20 years
  is_ca_certificate     = true

  subject {
    common_name  = "MyCompany Internal Root CA"
    organization = "MyCompany"
  }
  
  # ... allowed uses ...
}

The beauty of this is that the Root CA's private key never leaves our highly restricted Terraform state backend, keeping it completely out of the blast radius of our application clusters.

Step 2: The Base Images

Having a Root CA is useless if your services don't actually trust it. We didn't want our developers to have to manually configure TLS validation or mount trust stores every time they spun up a new microservice.

So, we took the public certificate of our new Root CA and baked it directly into our company's base Docker images. We updated the system trust stores at the OS level (e.g., update-ca-certificates in Alpine/Debian).

Now, whenever an engineering team builds a service extending from our base images, their code inherently trusts any certificate signed by our internal PKI. No code changes required.

Step 3: Going Regional

We operate across multiple regional clusters. Naturally, pushing the Root CA's private key into every single cluster to sign leaf certificates is a security nightmare waiting to happen. If one region gets compromised, the whole kingdom falls.

To solve this, we used Terraform to create an Intermediate CA for each specific region. We gave these intermediate certs a slightly shorter 15-year expiration.

In Terraform, this means generating a new private key, creating a Certificate Signing Request (CSR), and then using our previously created Root CA to sign it:

resource "tls_private_key" "regional_intermediate" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "tls_cert_request" "regional_intermediate" {
  private_key_pem = tls_private_key.regional_intermediate.private_key_pem

  subject {
    common_name  = "MyCompany Regional CA - US-East"
    organization = "MyCompany"
  }
}

resource "tls_locally_signed_cert" "regional_intermediate" {
  cert_request_pem   = tls_cert_request.regional_intermediate.cert_request_pem
  
  # Signing using the Root CA from Step 1
  ca_private_key_pem = tls_private_key.root.private_key_pem
  ca_cert_pem        = tls_self_signed_cert.root.cert_pem

  validity_period_hours = 131400 # 15 years
  is_ca_certificate     = true

  # ... allowed uses ...
}

Instead of leaving these highly-privileged intermediate keys lying around, Terraform automatically provisions and securely saves the resulting intermediate certificate and its private key directly into Azure Key Vault in the respective regions.

Step 4: The Kubernetes Magic

So how do the actual services get their TLS certificates? We rely on good ol' cert-manager inside our Kubernetes clusters.

Here is the flow:

  1. Inside each regional k8s cluster, a secret syncing mechanism pulls that region's Intermediate CA (both the cert and the private key) from Azure Key Vault.
  2. We load those credentials into a standard Kubernetes Secret.
  3. We configure a ClusterIssuer in cert-manager to use that Secret as a CA.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: regional-ca-issuer
spec:
  ca:
    secretName: regional-intermediate-ca-secret

Now, the magic happens. Whenever an engineering team deploys a new service, they just add the standard TLS annotations to their Ingress or Certificate resources. cert-manager talks to our regional ClusterIssuer and dynamically generates a short-lived, per-service leaf certificate on the fly.

Step 5: The Out-of-Cluster Oddballs

Not everything runs neatly inside Kubernetes. We have a few legacy services, background workers, and specialized databases running on plain old VMs outside the cluster.

When these VM-based services try to make an API call to a service inside our Kubernetes cluster, they are greeted by a TLS certificate signed by our internal regional CA. If the VM doesn't trust that CA, the connection drops with a nasty x509: certificate signed by unknown authority error.

Since we already solved this for our Docker containers via base images, we simply applied the same logic to our infrastructure automation. Whether it's through cloud-init scripts during provisioning or configuration management tools like Chef, we inject the public Root CA certificate directly into the VM's OS-level trust store.

Just like that, the VMs act like native citizens of our internal network, securely validating HTTPS traffic from our k8s services without skipping a beat.

Why This Works for Us

Is it a massive, enterprise-grade identity system? No. But here is why this perfectly hits the sweet spot for our engineering org:

  • Zero Developer Friction: Because the Root CA is baked into the base images and cert-manager handles the leaf certs, developers literally don't have to think about encryption. It just works.
  • Fully Automated: Terraform handles the scary long-lived keys, and cert-manager rotates the short-lived ones.
  • Low Overhead: No heavy sidecars dragging down performance, and no massive monthly bills for a managed PKI service.
  • Private DNS Ready: We completely decoupled our internal TLS from public DNS records, allowing us to lock down our network exactly how we wanted.

Sometimes, ignoring the "enterprise" advice and taking the poor man's route is exactly what you need to build a system that is simple, secure, and perfectly tailored to your actual problems.