You’ve automated everything else. Infrastructure is Terraform. Deployments are CI/CD. Monitoring is Prometheus + Grafana. Secrets are in Vault. Scaling is automatic.
But certificates? Somebody manually requests them from a portal. Somebody downloads a ZIP file. Somebody SCPs it to a server. Somebody remembers to renew it in 11 months. Maybe.
This is the gap. DevOps teams that deploy 50 services a week still manage certificates like it’s a manual IT process from 2010. And then they’re surprised when an expired cert takes down production at 2 AM on a Saturday.
Here’s how to fix it — treating certificates as infrastructure, not tickets.
The DevOps Certificate Manifesto
Certificates should be:
- Declared in code (not requested via email/portal)
- Provisioned automatically (not downloaded and uploaded manually)
- Renewed without human intervention (not tracked in spreadsheets)
- Monitored like any other infrastructure (not discovered during outages)
- Ephemeral where possible (short-lived, disposable, auto-replaced)
If your certificate process requires a human to do anything other than write the initial configuration, it’s not automated enough.
Pattern 1: Certificates as Code (Terraform)
Declare certificates in your infrastructure code. They’re provisioned alongside the infrastructure that uses them.
AWS (ACM + Route53 + ALB)
# Certificate declared in Terraform
resource "aws_acm_certificate" "api" {
domain_name = "api.example.com"
subject_alternative_names = ["api-v2.example.com"]
validation_method = "DNS"
lifecycle {
create_before_destroy = true
}
}
# DNS validation (automatic)
resource "aws_route53_record" "cert_validation" {
for_each = {
for dvo in aws_acm_certificate.api.domain_validation_options : dvo.domain_name => dvo
}
zone_id = data.aws_route53_zone.main.zone_id
name = each.value.resource_record_name
type = each.value.resource_record_type
records = [each.value.resource_record_value]
ttl = 60
}
# Wait for validation
resource "aws_acm_certificate_validation" "api" {
certificate_arn = aws_acm_certificate.api.arn
validation_record_fqdns = [for r in aws_route53_record.cert_validation : r.fqdn]
}
# Attach to ALB (certificate auto-renews via ACM)
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
certificate_arn = aws_acm_certificate_validation.api.certificate_arn
# ...
}
Result: terraform apply creates the certificate, validates it via DNS, and attaches it to the load balancer. ACM auto-renews. Zero ongoing maintenance.
GCP (Managed Certificate + Load Balancer)
resource "google_compute_managed_ssl_certificate" "api" {
name = "api-cert"
managed {
domains = ["api.example.com"]
}
}
resource "google_compute_target_https_proxy" "api" {
name = "api-proxy"
url_map = google_compute_url_map.api.id
ssl_certificates = [google_compute_managed_ssl_certificate.api.id]
}
Pattern 2: Certificates in Kubernetes (cert-manager)
For Kubernetes workloads, cert-manager is the standard. Certificates are Kubernetes resources — managed the same way as Deployments and Services.
The GitOps Way
# In your Helm chart or Kustomize overlay:
# charts/my-app/templates/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: {{ .Release.Name }}-tls
namespace: {{ .Release.Namespace }}
spec:
secretName: {{ .Release.Name }}-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
{{- range .Values.ingress.hosts }}
- {{ . }}
{{- end }}
privateKey:
algorithm: ECDSA
size: 256
# values.yaml
ingress:
hosts:
- api.example.com
- api-v2.example.com
Result: Deploy the app → certificate is automatically provisioned. Delete the app → certificate is cleaned up. Scale to 10 environments → each gets its own certificate automatically.
Monitoring cert-manager (Prometheus)
# ServiceMonitor for cert-manager metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cert-manager
spec:
selector:
matchLabels:
app: cert-manager
endpoints:
- port: http-metrics
# Alert rules
- alert: CertificateNotReady
expr: certmanager_certificate_ready_status{condition="False"} == 1
for: 15m
annotations:
summary: "Certificate {{ $labels.name }} failed to issue"
- alert: CertificateExpiringSoon
expr: (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
annotations:
summary: "Certificate {{ $labels.name }} expires in < 7 days"
Pattern 3: Certificates in CI/CD Pipelines
For services that aren’t in Kubernetes or cloud-managed load balancers:
GitHub Actions: Request + Deploy + Verify
name: Deploy with Certificate
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Request certificate (if needed)
run: |
# Check if current cert expires within 30 days
EXPIRY=$(ssh deploy@server "openssl x509 -enddate -noout -in /etc/ssl/certs/app.pem" | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))
if [ $DAYS_LEFT -lt 30 ]; then
echo "Certificate expires in $DAYS_LEFT days — renewing"
ssh deploy@server "certbot renew --deploy-hook 'systemctl reload nginx'"
fi
- name: Deploy application
run: |
# Your normal deployment steps
ssh deploy@server "cd /app && git pull && docker-compose up -d"
- name: Verify certificate
run: |
sleep 10
CERT_INFO=$(echo | openssl s_client -connect app.example.com:443 -servername app.example.com 2>/dev/null | openssl x509 -noout -subject -enddate)
echo "$CERT_INFO"
echo "$CERT_INFO" | grep -q "app.example.com" || exit 1
Pattern 4: Certificate Monitoring as Code
Your monitoring stack should treat certificate expiry the same as disk space or memory usage:
Prometheus + Blackbox Exporter
# prometheus.yml — probe all TLS endpoints
scrape_configs:
- job_name: 'tls-certificates'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- api.example.com:443
- app.example.com:443
- admin.example.com:443
- payments.example.com:443
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Alert rules
groups:
- name: certificates
rules:
- alert: TLSCertExpiring30Days
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
labels:
severity: warning
annotations:
summary: "TLS cert for {{ $labels.instance }} expires in < 30 days"
- alert: TLSCertExpiring7Days
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
labels:
severity: critical
annotations:
summary: "CRITICAL: TLS cert for {{ $labels.instance }} expires in < 7 days"
runbook: "https://wiki.internal/runbooks/certificate-renewal"
Grafana Dashboard
# Days until expiry for all monitored endpoints
(probe_ssl_earliest_cert_expiry - time()) / 86400
# Count of certificates expiring within 30 days
count(probe_ssl_earliest_cert_expiry - time() < 86400 * 30)
# Certificate issuer distribution
count by (issuer_cn) (probe_ssl_last_chain_info)
Pattern 5: Internal Certificates with Vault
For internal services that need mTLS or private certificates:
# In your deployment script or Helm chart:
# 1. Authenticate to Vault (using K8s service account or CI JWT)
export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login \
role=my-app jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
# 2. Request a short-lived certificate
vault write -format=json pki/issue/internal-service \
common_name="my-app.production.svc.cluster.local" \
ttl="72h" > /tmp/cert.json
# 3. Extract cert and key
jq -r '.data.certificate' /tmp/cert.json > /etc/ssl/app.pem
jq -r '.data.private_key' /tmp/cert.json > /etc/ssl/app-key.pem
jq -r '.data.ca_chain[]' /tmp/cert.json >> /etc/ssl/app.pem
# 4. Clean up
rm /tmp/cert.json
With Vault Agent (sidecar), this happens automatically with renewal:
# vault-agent-config.hcl
template {
source = "/vault/templates/cert.tpl"
destination = "/etc/ssl/app.pem"
command = "nginx -s reload"
# Vault Agent re-renders template when cert approaches expiry
# Nginx reloads automatically
}
The Anti-Patterns (What NOT to Do)
❌ Certificates in Git
# NEVER commit certificates or keys to source control
# .gitignore should include:
*.pem
*.key
*.crt
*.pfx
*.p12
❌ Long-Lived Certificates for Dynamic Infrastructure
If your infrastructure scales up/down daily, don’t use 1-year certificates that require manual renewal. Use short-lived certificates (hours/days) that are issued at deploy time and expire naturally.
❌ Shared Wildcard Certificates Across Environments
# DON'T: Same wildcard cert on dev, staging, and production
*.example.com → deployed everywhere
# DO: Separate certificates per environment
dev.example.com → cert from Let's Encrypt (auto-renewed)
staging.example.com → cert from Let's Encrypt (auto-renewed)
api.example.com → cert from Let's Encrypt (auto-renewed)
Shared wildcards mean: one compromised environment exposes the key for all environments.
❌ Manual Renewal Reminders
If your certificate management strategy involves calendar reminders or Jira tickets for renewal, it’s not automated — it’s a human process pretending to be managed. Automate it or accept that you’ll have outages.
The Maturity Model for DevOps Certificate Management
| Level | Description | Characteristics |
|---|---|---|
| 0 | Chaos | Manual everything. Certs expire without warning. |
| 1 | Tracked | Spreadsheet/monitoring exists. Still manual renewal. |
| 2 | Automated | ACME/cert-manager handles renewal. Monitoring alerts on failure. |
| 3 | Codified | Certificates declared in IaC. Provisioned with infrastructure. |
| 4 | Ephemeral | Short-lived certs. No renewal needed. Issued at deploy, expire naturally. |
Most DevOps teams are at Level 1-2. The goal is Level 3-4.
FAQ
Q: Should every service have its own certificate? A: Yes. One certificate per service (or per endpoint). Shared certificates (especially wildcards) create shared risk. If one service’s key is compromised, all services sharing that certificate are affected.
Q: How do I handle certificates for local development?
A: Use mkcert — it generates locally-trusted certificates for localhost and custom domains. No browser warnings, no self-signed cert hacks. For team-wide dev environments, use a shared private CA with cert-manager.
Q: What about certificates for non-HTTP services (databases, message queues)?
A: Same principles apply. Use cert-manager Certificate resources (mount the Secret as a volume), Vault PKI (request at startup), or your CLM platform’s agent. The protocol doesn’t matter — the lifecycle management is the same.
Q: How do I convince my team to invest in certificate automation? A: Calculate the cost of your last certificate outage (or the next one). Include: engineer time (emergency response at 2 AM), revenue loss, customer trust impact, and the post-mortem time. Compare to the cost of setting up cert-manager (a few hours) or ACME (an afternoon). The ROI is immediate.