Operational Runbooks#

Audience: Operations Administrators
Prerequisites: Kleidia deployed
Outcome: Resolve common operational issues

Runbook Overview#

This document provides step-by-step procedures for common operational scenarios.

Agent Pairing Issues#

Symptom#

Users cannot pair agents with the system.

Diagnosis#

# Check agent is running
curl http://127.0.0.1:56123/health

# Check agent discovery
curl http://127.0.0.1:56123/.well-known/kleidia-agent

# Check backend logs
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i agent

Resolution#

Agent Not Running

# On user workstation
# Start agent service
sudo systemctl start kleidia-agent
# Or run manually
./kleidia-agent

Agent Not Detected
- Verify agent is running on localhost:56123
- Check browser console for CORS errors
- Verify user is logged in

Key Registration Failed

# Check backend logs for registration errors
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i register

# Check database for agent keys
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c "SELECT * FROM user_sessions WHERE agent_pubkey IS NOT NULL;"

Device Revocation#

Symptom#

Device needs to be revoked (lost, stolen, compromised, or user departure).

Procedure#

Via Admin UI:
- Navigate to Admin Panel → YubiKeys
- Select device to revoke
- Click “Revoke Device”
- Review confirmation dialog (shows device serial, owner, warning)
- Confirm revocation

Verify Revocation:

# Check device status in database
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c \
  "SELECT id, serial, is_active, deleted_at FROM yubikeys WHERE serial = '<serial-number>';"

Verify Secrets Removed:

# Check Vault secrets (should be removed)
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
  vault kv list yubikeys/data/ | grep <serial-number>

Verify Certificates Revoked:

# Check audit logs for certificate revocation
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i "revoke.*certificate"

Automatic Wipe Behavior#

Important: When a revoked device is connected to an admin workstation (where an agent is running), the system automatically attempts to wipe the PIV application. This ensures the device cannot be used even if physically recovered.

To verify wipe attempt:

Check backend logs for PIV reset attempts
Check agent logs (if available) for reset operations
Verify device PIV status if device is accessible

Troubleshooting#

Device Not Wiped Automatically:

Verify agent is running on admin workstation
Check device is actually connected to admin workstation
Review backend logs for PIV reset errors
Manually reset PIV if needed (via agent or ykman CLI)

Revocation Failed:

Check backend logs for errors
Verify database connectivity
Verify Vault connectivity
Check user permissions (admin role required)

Vault 403 Errors#

Symptom#

Backend returns 403 errors when accessing Vault.

Diagnosis#

# Check Vault status
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- vault status

# Check backend Vault authentication
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i vault

# Check AppRole credentials
kubectl get secret vault-approle -n kleidia

# Test Vault authentication
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
  vault write auth/approle/login \
    role_id=<role-id> \
    secret_id=<secret-id>

Resolution#

Policy Issues

# Check backend policy
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
  vault policy read kleidia-backend

# Update policy if needed
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
  vault policy write kleidia-backend - <<EOF
path "pki/sign/*" {
  capabilities = ["create", "read", "update"]
}
path "yubikeys/data/*" {
  capabilities = ["create", "read", "update", "delete", "list"]
}
EOF

AppRole Credentials

# Regenerate AppRole credentials
# See Vault Setup documentation

Token Expired

# Restart backend to get new token
kubectl rollout restart deployment/kleidia-services-backend -n kleidia

TLS Certificate Expiry#

Symptom#

Browser shows certificate errors or certificate expired warnings.

Diagnosis#

# Check certificate expiration
echo | openssl s_client -connect kleidia.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Check Let's Encrypt certificates
sudo certbot certificates

Resolution#

Certificate Expired
- Renew certificate through your external load balancer
- Verify certificate is properly configured
Certificate Not Renewing
- Check certificate renewal configuration in your load balancer
- Verify DNS records are correct
- Test certificate renewal manually

Agent Connection Issues#

Symptom#

Agents cannot connect or communicate with backend.

Diagnosis#

# Check agent is running on workstation
curl http://127.0.0.1:56123/health

# Check agent discovery
curl http://127.0.0.1:56123/.well-known/kleidia-agent

# Check backend logs
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i agent

Resolution#

Agent Not Running
- Verify agent is installed on workstation
- Start agent service or run manually
- Check agent logs for errors
Connection Refused
- Verify agent is running on localhost:56123
- Check browser console for CORS errors
- Verify user is logged in
- Check backend is accessible

High Disk Usage#

Symptom#

System running out of disk space.

Diagnosis#

# Check disk usage
df -h

# Check Docker disk usage
docker system df

# Check Kubernetes disk usage
kubectl top nodes

Resolution#

Clean Docker

# Remove unused containers, images, volumes
docker system prune -af
docker image prune -af

Clean Kubernetes

# Remove completed jobs
kubectl delete jobs --field-selector status.successful=1 -n kleidia

# Remove old logs (if log rotation not configured)

Expand Storage
- Add additional disk
- Expand persistent volumes
- Archive old data

Database Performance Issues#

Symptom#

Slow queries, high database load.

Diagnosis#

# Check database connections
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c "SELECT count(*) FROM pg_stat_activity;"

# Check slow queries
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c "
    SELECT query, calls, total_time, mean_time
    FROM pg_stat_statements
    ORDER BY mean_time DESC
    LIMIT 10;
  "

Resolution#

Too Many Connections

# Check connection pool settings
# Reduce connection pool size if needed

Slow Queries

# Vacuum database
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c "VACUUM ANALYZE;"

# Check for missing indexes
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia -c "
    SELECT schemaname, tablename, attname, n_distinct, correlation
    FROM pg_stats
    WHERE schemaname = 'public'
    ORDER BY abs(correlation) DESC;
  "

Pod CrashLoopBackOff#

Symptom#

Pods restarting repeatedly.

Diagnosis#

# Check pod status
kubectl get pods -n kleidia

# Check pod logs
kubectl logs -f <pod-name> -n kleidia

# Check pod events
kubectl describe pod <pod-name> -n kleidia

Resolution#

Application Errors
- Check application logs for errors
- Verify configuration
- Check dependencies

Resource Constraints

# Check resource limits
kubectl describe pod <pod-name> -n kleidia | grep -A 5 "Limits"

# Increase resources if needed
# Update Helm values and upgrade

Configuration Errors
- Verify environment variables
- Check secrets exist
- Verify service connectivity

Emergency Procedures#

Complete System Restart#

# Restart all pods
kubectl rollout restart deployment -n kleidia

Database Recovery#

# Stop backend
kubectl scale deployment/kleidia-services-backend --replicas=0 -n kleidia

# Restore from backup
gunzip -c backups/20250115/database.sql.gz | \
  kubectl exec -i kleidia-data-postgres-cluster-0 -n kleidia -- \
  psql -U yubiuser -d kleidia

# Restart backend
kubectl scale deployment/kleidia-services-backend --replicas=2 -n kleidia

Vault Recovery#

# Stop backend
kubectl scale deployment/kleidia-services-backend --replicas=0 -n kleidia

# Restore Vault snapshot
kubectl cp backups/20250115/vault-backup.snap \
  kleidia-platform-openbao-0:/tmp/vault-backup.snap -n kleidia

kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
  vault operator raft snapshot restore /tmp/vault-backup.snap

# Restart backend
kubectl scale deployment/kleidia-services-backend --replicas=2 -n kleidia

Operational Runbooks#

Runbook Overview#

Agent Pairing Issues#

Symptom#

Diagnosis#

Resolution#

Device Revocation#

Symptom#

Procedure#

Automatic Wipe Behavior#

Troubleshooting#

Vault 403 Errors#

Symptom#

Diagnosis#

Resolution#

TLS Certificate Expiry#

Symptom#

Diagnosis#

Resolution#

Agent Connection Issues#

Symptom#

Diagnosis#

Resolution#

High Disk Usage#

Symptom#

Diagnosis#

Resolution#

Database Performance Issues#

Symptom#

Diagnosis#

Resolution#

Pod CrashLoopBackOff#

Symptom#

Diagnosis#

Resolution#

Emergency Procedures#

Complete System Restart#

Database Recovery#

Vault Recovery#

Related Documentation#