# Eliminating SSH Key Sprawl: Short-Lived Certificates for Modern Infrastructure

*A KeyGrid Technical Whitepaper*
*Published April 2026*

---

## Executive Summary

SSH remains the backbone of infrastructure access for every enterprise. Yet most organizations manage SSH authentication the same way it was done in 1995: static public keys copied to `authorized_keys` files across hundreds or thousands of servers. This approach creates an ever-growing attack surface with no expiration, no audit trail, and no practical revocation mechanism.

KeyGrid SSH Certificate Authority replaces static key management with short-lived, identity-bound SSH certificates signed by an HSM-backed Certificate Authority. The result is a zero-trust SSH access model where credentials expire automatically, access decisions are driven by corporate identity, and every certificate issuance is logged.

**Key Findings:**

- The average enterprise has 3-10x more SSH keys than employees, with no inventory of which keys grant access to which servers
- Static SSH keys have an average unmanaged lifetime of 2+ years, creating persistent credential exposure
- Short-lived SSH certificates reduce the credential exposure window from years to hours (a 99.97% reduction)
- Organizations report 90% fewer emergency access revocation incidents after adopting SSH CA
- Server-side configuration drops from per-user key management to a single trust anchor

---

## Table of Contents

1. [The SSH Key Management Crisis](#the-ssh-key-management-crisis)
2. [How SSH Certificates Work](#how-ssh-certificates-work)
3. [Architecture: KeyGrid SSH CA](#architecture-keygrid-ssh-ca)
4. [Use Cases](#use-cases)
5. [Identity Integration: OIDC and SSO](#identity-integration-oidc-and-sso)
6. [Revocation and KRL Distribution](#revocation-and-krl-distribution)
7. [Deployment Models](#deployment-models)
8. [Security and Compliance Benefits](#security-and-compliance-benefits)
9. [Implementation Guide](#implementation-guide)
10. [Conclusion](#conclusion)

---

## The SSH Key Management Crisis

### The Hidden Attack Surface

SSH key sprawl is one of the most underestimated security risks in enterprise infrastructure. A typical organization with 500 engineers and 2,000 servers will have:

- **5,000-15,000 SSH public keys** distributed across server `authorized_keys` files
- **No centralized inventory** of which keys grant access to which servers
- **No expiration mechanism** -- keys remain valid until manually removed
- **No audit trail** for when keys were added, by whom, or why
- **Orphaned keys** from departed employees, decommissioned laptops, and forgotten service accounts

When an employee leaves, IT must hunt through every server to remove their keys. In practice, this rarely happens completely. Security audits routinely find active keys belonging to employees who left years ago.

### The Cost of Static Keys

| Challenge | Impact |
|-----------|--------|
| Key rotation | Requires touching every server per user -- organizations simply don't do it |
| Employee offboarding | Keys persist on servers indefinitely after account deactivation |
| Service account keys | Shared across teams, stored in CI/CD secrets, never rotated |
| Compliance | No way to prove who accessed what, when, or demonstrate key lifecycle management |
| Incident response | Compromised key requires emergency rotation across all servers |
| Lateral movement | A stolen key often grants access to every server where it was deployed |

### Why Traditional Approaches Fail

**LDAP/AD-backed SSH** solves centralized authentication but adds a critical runtime dependency -- if the directory server is unreachable, SSH access fails entirely.

**SSH key management tools** provide inventory and rotation workflows but fundamentally still manage static keys. They reduce the chaos but don't eliminate the underlying problem.

**Bastion hosts and jump servers** consolidate access points but don't address key management on the bastion itself. They add latency and create a single point of failure.

The fundamental issue is that SSH public key authentication was designed for a world of a few users and a handful of servers. It was never intended to scale to thousands of keys across dynamic cloud infrastructure.

---

## How SSH Certificates Work

### OpenSSH Certificate Protocol

OpenSSH has supported certificate-based authentication since version 5.4 (2010), yet adoption remains low because operating a Certificate Authority has historically been complex. SSH certificates work differently from X.509 TLS certificates:

1. **A CA generates a signing key pair** (the CA's private key is used to sign certificates)
2. **Users generate an ephemeral key pair** (standard `ssh-keygen`)
3. **The user submits their public key to the CA** along with a request for specific principals (usernames)
4. **The CA signs the public key** with metadata: principals, validity window, extensions, and a unique serial number
5. **The signed certificate is returned to the user** as a `-cert.pub` file
6. **SSH servers trust any certificate signed by the CA** via a single `TrustedUserCAKeys` configuration line

The critical difference from static keys: **the server never needs to know about individual users.** It trusts the CA, and the CA decides who gets access based on identity.

### Certificate Fields

An SSH certificate contains:

| Field | Purpose |
|-------|---------|
| **Serial** | Unique identifier for revocation tracking |
| **Key ID** | Human-readable label (e.g., "jdoe-laptop-2026-04-05") |
| **Principals** | Allowed usernames (e.g., ["jdoe", "deploy"]) |
| **Valid After / Valid Before** | Time-bounded validity window |
| **Extensions** | Permitted capabilities (PTY, port forwarding, agent forwarding) |
| **Critical Options** | Constraints (source-address, force-command) |
| **Signature** | CA's cryptographic signature over all fields |

### Why Short-Lived Certificates Change Everything

When certificates are valid for 8 hours instead of forever:

- **Revocation becomes optional** -- most certificates expire before you'd need to revoke them
- **Stolen credentials are time-limited** -- a compromised cert is useless tomorrow
- **No server-side state to manage** -- no `authorized_keys`, no LDAP queries at authentication time
- **Access is always current** -- certificates reflect the user's identity at issuance time, not when their key was first deployed

---

## Architecture: KeyGrid SSH CA

### Platform Integration

KeyGrid SSH CA is integrated into the KeyGrid PKI platform, reusing the existing infrastructure for HSM-backed key management, multi-tenant isolation, audit logging, and license-based feature gating. It is not a standalone tool -- it is a first-class module alongside X.509 CA operations, ACME, SCEP, EST, and SPIFFE.

### Components

**SSH CA Service** handles certificate lifecycle:
- CA creation with HSM-backed signing keys (Ed25519, ECDSA P-256/P-384, RSA-4096)
- User and host certificate signing
- Serial management with atomic increment
- Certificate record storage for audit and search

**OIDC Integration** enables browser-based certificate issuance:
- Redirects to your corporate Identity Provider (Okta, Azure AD, Google Workspace)
- Maps OIDC claims (email, groups) to SSH principals
- Issues a certificate automatically after successful authentication

**KRL Generator** handles revocation:
- Builds OpenSSH Key Revocation Lists in binary format
- Auto-regenerates when a certificate is revoked
- Serves KRLs via REST endpoint for server consumption

**Policy Engine** controls access:
- Maps identity sources (OIDC claims, LDAP groups, static assignments) to allowed principals
- Enforces maximum validity periods per policy
- Restricts which users can request which principals

### HSM-Backed Security

The SSH CA's signing key never leaves the HSM. All certificate signing operations go through the HSM API:

1. Application builds the SSH certificate body (principals, validity, extensions)
2. Certificate bytes are sent to the HSM for signing
3. HSM returns the signature
4. Application assembles the final signed certificate

This means even a complete compromise of the application server does not expose the CA's private key. The attacker would need to compromise the HSM itself.

### Multi-Tenant Isolation

Every SSH CA is scoped to a tenant. Database queries are filtered by `tenant_id` at every layer. A certificate issued by Tenant A's CA cannot authenticate to servers trusting Tenant B's CA. This is the same isolation model used across all KeyGrid services.

---

## Use Cases

### Developer Access to Production Servers

**Before:** Each developer's SSH public key is added to `authorized_keys` on every server they need access to. When they change laptops, a new key must be deployed. When they leave, keys must be removed from every server.

**After:** Servers trust one CA public key. Developers run `ssh-keygen`, submit their public key to KeyGrid (or authenticate via SSO), receive an 8-hour certificate, and SSH normally. When they leave, their IdP account is disabled. No server-side changes needed.

### CI/CD Pipeline Access

**Before:** A static SSH private key is stored as a CI/CD secret (GitHub Actions, Jenkins, GitLab). This key often has root-level access and is shared across all pipelines. It never rotates because updating it would break all builds.

**After:** The pipeline calls the KeyGrid API with an API key scoped to its function. It receives a certificate valid for 10 minutes with only the `deploy` principal. The certificate expires before the pipeline finishes cleanup. No long-lived secrets in CI/CD configuration.

### Contractor and Vendor Access

**Before:** Contractors receive SSH keys that persist long after their engagement ends. Revoking access requires identifying every server they had access to.

**After:** Contractors authenticate via their company's IdP (federated via OIDC). Their certificates are limited to specific principals with a 4-hour maximum validity. When the engagement ends, the IdP group is removed. Access stops immediately with no server-side cleanup.

### Host Identity Verification

**Before:** Users connecting to a new server see "The authenticity of host... can't be established. Are you sure you want to continue connecting?" and type "yes" without verifying the fingerprint (TOFU -- trust on first use).

**After:** New servers receive a host certificate from a host CA during provisioning (cloud-init, Ansible, Terraform). Client machines trust the host CA. Connections to certified hosts are verified automatically -- no TOFU prompt, no risk of connecting to an impersonated server.

### Emergency Access Revocation

**Before:** A developer's laptop is stolen. The security team must identify every server the developer had access to, connect to each one, and remove the compromised key from `authorized_keys`. This process takes hours or days.

**After:** The security team revokes the certificate serial in KeyGrid. The KRL auto-regenerates. Servers fetching the KRL endpoint immediately reject the compromised certificate. Total response time: seconds.

---

## Identity Integration: OIDC and SSO

### Browser-Based Certificate Issuance

KeyGrid SSH CA integrates with any OIDC-compliant Identity Provider. The flow:

1. Developer generates an ephemeral key pair locally
2. Developer opens `https://keygrid.example.com/ssh/auth/login/{ca-id}?pubkey=ssh-ed25519+AAAA...`
3. Browser redirects to the corporate IdP (Okta, Azure AD, Google)
4. Developer authenticates (password + MFA as configured in the IdP)
5. IdP returns an ID token to KeyGrid
6. KeyGrid maps claims to SSH principals and signs the certificate
7. Developer saves the certificate and uses it for SSH

### Claims-to-Principals Mapping

The mapping between IdP identity and SSH principals is fully configurable per CA:

```json
{
  "principal_sources": [
    {
      "claim": "email",
      "transform": "local_part",
      "comment": "jdoe@corp.com becomes principal 'jdoe'"
    },
    {
      "claim": "groups",
      "mapping": {
        "ssh-admins": ["root", "admin"],
        "developers": ["deploy"],
        "sre-team": ["root"]
      }
    }
  ]
}
```

This means:
- A user with email `jdoe@corp.com` in group `developers` gets principals `["jdoe", "deploy"]`
- A user in `ssh-admins` gets principals `["root", "admin"]` -- no manual key deployment needed
- Group membership changes in the IdP are reflected in the next certificate issuance

### Principal Policies

Beyond claims mapping, KeyGrid supports explicit principal policies that define:
- Which identity sources can request which principals
- Maximum validity overrides per policy
- Enable/disable controls for fine-grained access management

---

## Revocation and KRL Distribution

### The Revocation Problem

Traditional SSH key revocation requires modifying `authorized_keys` on every server -- an O(users x servers) operation. SSH certificates solve this with Key Revocation Lists (KRLs).

### OpenSSH KRL Format

A KRL is a compact binary file listing revoked certificate serials. OpenSSH servers configured with `RevokedKeys` check incoming certificates against the KRL before granting access. KeyGrid generates KRLs in the standard OpenSSH binary format.

### Automatic KRL Regeneration

When a certificate is revoked in KeyGrid:
1. The certificate status is updated to `revoked` in the database
2. A new KRL is automatically generated containing all revoked serials
3. The KRL is available at `GET /api/v1/tenants/{tid}/ssh-cas/{cid}/krl`
4. Servers periodically fetch the updated KRL (via cron, systemd timer, or configuration management)

### Why Short-Lived Certificates Minimize Revocation Need

With 8-hour certificates, the window where revocation is necessary is small. If a certificate is compromised at 2 PM and expires at 5 PM, the exposure window is 3 hours. In most cases, the certificate expires before revocation infrastructure would even propagate to all servers. KRL-based revocation exists as a safety net for the rare cases where immediate revocation is critical.

---

## Deployment Models

### Cloud-Hosted

KeyGrid SSH CA runs as part of the KeyGrid cloud platform. Organizations configure their SSH servers to trust KeyGrid's CA public key and fetch KRLs from the KeyGrid API endpoint. No on-premises infrastructure required.

### On-Premises

For organizations with strict data residency or air-gapped requirements, KeyGrid deploys on-premises. The SSH CA uses the organization's own HSM infrastructure (PKCS#11, AWS CloudHSM, Azure Key Vault, or embedded). All certificate issuance and key storage remains within the organization's perimeter.

### Hybrid

Organizations can run KeyGrid on-premises for production infrastructure while using the cloud-hosted version for development and staging environments. Multi-tenant isolation ensures complete separation between environments.

---

## Security and Compliance Benefits

### Reduced Attack Surface

| Metric | Static Keys | SSH Certificates |
|--------|-------------|-----------------|
| Credential lifetime | Permanent (until manually removed) | 8 hours (configurable) |
| Exposure window | Years | Hours |
| Revocation speed | Hours to days (manual) | Seconds (automatic KRL) |
| Server-side state | O(users) keys per server | 1 CA trust anchor per server |
| Audit trail | None (key added to file) | Full (who, when, what principals, how authenticated) |

### Compliance Alignment

**SOC 2:** SSH CA provides the access control evidence that auditors require -- every access credential has a creation time, expiration time, requestor identity, and approval chain.

**ISO 27001:** Certificate-based access with automatic expiration satisfies access control requirements (A.9) without manual review cycles.

**PCI DSS:** Short-lived credentials with strong authentication (MFA via IdP) address requirements for unique IDs, authentication mechanisms, and access monitoring.

**NIST 800-53:** SSH CA implements IA-5 (Authenticator Management) with automated lifecycle, AC-2 (Account Management) with identity-driven access, and AU-2 (Audit Events) with comprehensive logging.

### Audit Trail

Every certificate issuance in KeyGrid is logged with:
- Tenant ID, CA ID, and certificate serial number
- Requestor identity (user ID, email, or OIDC subject)
- Authentication method (JWT, API key, or OIDC)
- Granted principals and extensions
- Validity window
- Timestamp

This provides a complete, searchable record of SSH access credentials -- something that is fundamentally impossible with static key management.

---

## Implementation Guide

### Server Configuration

Configuring an OpenSSH server to trust KeyGrid-issued certificates requires two lines in `sshd_config`:

```
# Trust user certificates signed by KeyGrid
TrustedUserCAKeys /etc/ssh/keygrid-user-ca.pub

# Revocation list (optional but recommended)
RevokedKeys /etc/ssh/keygrid-krl
```

The CA public key is fetched once from `GET /api/v1/tenants/{tid}/ssh-cas/{cid}/public-key` and written to the server. The KRL can be refreshed periodically via cron:

```
*/5 * * * * curl -s https://keygrid.example.com/api/v1/tenants/.../ssh-cas/.../krl > /etc/ssh/keygrid-krl
```

### Client Workflow

For interactive users with OIDC:
1. `ssh-keygen -t ed25519 -f /tmp/ephemeral` (generate ephemeral key)
2. Open KeyGrid OIDC login URL with public key
3. Authenticate via corporate SSO
4. Save returned certificate as `/tmp/ephemeral-cert.pub`
5. `ssh -i /tmp/ephemeral server.example.com`

For API-based automation:
1. Call `POST /api/v1/tenants/{tid}/ssh-cas/{cid}/sign` with public key and desired principals
2. Receive signed certificate in response
3. Use certificate for SSH connections

### Migration Strategy

SSH certificates can be adopted incrementally:

1. **Phase 1:** Deploy the CA public key to servers alongside existing `authorized_keys`. Both methods work simultaneously.
2. **Phase 2:** Issue certificates to early adopters. Static keys continue to work for everyone else.
3. **Phase 3:** Remove static keys for users who have been successfully using certificates.
4. **Phase 4:** Disable `AuthorizedKeysFile` on servers, making certificates the only authentication method.

This phased approach ensures zero disruption during migration.

---

## Conclusion

SSH key sprawl is a solvable problem. The technology -- SSH certificates -- has existed in OpenSSH since 2010. What has been missing is a production-grade Certificate Authority that integrates with enterprise identity, provides HSM-backed security, and operates at scale with multi-tenant isolation.

KeyGrid SSH CA fills this gap as part of a comprehensive Modern PKI platform. Whether you are issuing X.509 certificates for TLS, signing documents with TrustSign, authenticating network devices with RADIUS, managing workload identity with SPIFFE, or now securing SSH access -- KeyGrid provides a single platform for all your cryptographic identity needs.

Available in Professional and Enterprise editions, deployed in the cloud or on your own infrastructure.

---

*KeyGrid is a product of Cloudfragments. For more information, visit [keygrid.io](https://www.keygrid.io).*
