Certificate Monitoring and Alerting — Preventing Expiry Outages

Key Takeaways

Monitor the certificate actually being served (network probe), not just the file on disk — they can differ

Alert at 60, 30, and 7 days before expiry. Escalate at each threshold to different teams/channels.

Monitor intermediate CA certificates too — when they expire, every end-entity cert beneath them becomes unverifiable

The goal isn't dashboards — it's ensuring every expiring certificate has an owner who acts before the deadline

Certificate monitoring is the continuous process of checking certificate health across your infrastructure: expiry dates, chain validity, key strength, configuration correctness, and revocation status. Alerting converts monitoring data into actionable notifications — telling the right person, at the right time, that a certificate needs attention. Together, they’re the safety net that catches certificates falling through the cracks of automation, manual processes, or ownership gaps.

Why it matters

Expiry is the #1 TLS outage cause — certificates expire on a fixed date regardless of whether anyone is watching. Monitoring is the only mechanism that converts a silent countdown into a visible action item.
Automation isn’t infallible — ACME renewal can fail (rate limits, DNS propagation, challenge endpoint blocked). cert-manager can fail (RBAC changes, issuer misconfiguration). Monitoring catches automation failures before they become outages.
Ownership gaps — in large organizations, certificates are deployed by different teams. Without monitoring that assigns ownership, expiring certificates have no clear responsible party. The alert must reach someone who can act.
Chain and configuration drift — a certificate may be valid but misconfigured: incomplete chain, wrong protocol version, weak cipher suites. Monitoring catches configuration degradation, not just expiry.
Compliance evidence — auditors want proof that certificates are tracked and managed. Monitoring dashboards and alert history provide this evidence.

How it works

Data collection — gather certificate data via network probes (TLS handshake to endpoints), agent reports (local file inspection), cloud APIs (ACM, Key Vault), and CT log monitoring
Expiry calculation — compute days remaining for every certificate. Flag certificates crossing alert thresholds (60, 30, 14, 7, 1 day).
Health checks — validate: chain completeness, key algorithm strength, signature algorithm (no SHA-1), certificate matches the hostname, OCSP stapling status
Ownership mapping — associate each certificate with a team, service, or individual responsible for renewal
Alert routing — send notifications to the certificate owner via appropriate channels (email at 60 days, Slack at 30 days, PagerDuty at 7 days)
Escalation — if no action is taken after initial alert, escalate to team lead, then to infrastructure management
Verification after renewal — confirm the new certificate is deployed and serving. Close the alert only after verification.

In real systems

Prometheus + blackbox_exporter (open source):

# blackbox.yml - probe TLS endpoints
modules:
  tls_connect:
    prober: tcp
    tcp:
      tls: true

# prometheus.yml - scrape certificate expiry
- job_name: 'tls-certs'
  metrics_path: /probe
  params:
    module: [tls_connect]
  static_configs:
    - targets:
      - api.example.com:443
      - app.example.com:443
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target

# Alert rule
- alert: CertificateExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate for {{ $labels.instance }} expires in < 30 days"

AWS Config rule (cloud-native):

{
  "ConfigRuleName": "acm-certificate-expiration-check",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "ACM_CERTIFICATE_EXPIRATION_CHECK"
  },
  "InputParameters": {
    "daysToExpiration": "30"
  }
}

Nagios/Icinga check (traditional):

# check_ssl_cert plugin
/usr/lib/nagios/plugins/check_ssl_cert \
  -H api.example.com \
  -w 30 \  # Warning at 30 days
  -c 7  \  # Critical at 7 days
  --check-chain \
  --check-ocsp

Grafana dashboard query:

# Days until expiry for all monitored endpoints
(probe_ssl_earliest_cert_expiry - time()) / 86400

# Count of certificates expiring within 30 days
count(probe_ssl_earliest_cert_expiry - time() < 86400 * 30)

Where it breaks

Monitoring the file, not the endpoint — a script checks the certificate file on disk and reports “valid, 60 days remaining.” But the running Nginx process is still serving the old certificate from memory (never reloaded after renewal). The monitoring shows green while clients see an expired cert. Always monitor by connecting to the actual endpoint and inspecting what’s served over TLS — not by reading files.

Alert fatigue — monitoring sends 200 certificate expiry alerts per week. The team starts ignoring them. When a critical production certificate alert fires, it’s lost in the noise. Fix: tiered alerting (informational at 60 days, actionable at 30 days, critical at 7 days) with different channels. Only page on-call for certificates expiring within 7 days.

Intermediate CA expiry not monitored — teams monitor end-entity certificates but forget that intermediate CA certificates also expire. When the intermediate expires, every end-entity certificate it signed becomes unverifiable — even if those end-entity certs are still within their validity period. The chain breaks at the intermediate. Monitor intermediate and root CA certificate expiry with longer lead times (6-12 months).

Operational insight

The most effective certificate monitoring isn’t the most technically sophisticated — it’s the one with clear ownership mapping. A dashboard showing 500 certificates with expiry dates is useless if nobody knows who owns each one. The monitoring system must answer: “This certificate expires in 14 days — who specifically needs to act?” This requires integrating certificate inventory with service ownership (team directories, CMDB, or service catalogs). Without ownership, monitoring generates alerts that bounce between teams while the clock ticks down to outage.

Key Takeaways

Why it matters

How it works

In real systems

Where it breaks

Operational insight

Related topics

Ready to Secure Your Enterprise?

Stay ahead on cryptography & PKI