Operational Runbooks#
Audience: Operations Administrators
Prerequisites: Kleidia deployed
Outcome: Resolve common operational issues
Runbook Overview#
This document provides step-by-step procedures for common operational scenarios.
Agent Pairing Issues#
Symptom#
Users cannot pair agents with the system.
Diagnosis#
# Check agent is running
curl http://127.0.0.1:56123/health
# Check agent discovery
curl http://127.0.0.1:56123/.well-known/kleidia-agent
# Check backend logs
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i agentResolution#
Agent Not Running
# On user workstation # Start agent service sudo systemctl start kleidia-agent # Or run manually ./kleidia-agentAgent Not Detected
- Verify agent is running on localhost:56123
- Check browser console for CORS errors
- Verify user is logged in
Key Registration Failed
# Check backend logs for registration errors kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i register # Check database for agent keys kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \ psql -U yubiuser -d kleidia -c "SELECT * FROM user_sessions WHERE agent_pubkey IS NOT NULL;"
Device Revocation#
Symptom#
Device needs to be revoked (lost, stolen, compromised, or user departure).
Procedure#
Via Admin UI:
- Navigate to Admin Panel → YubiKeys
- Select device to revoke
- Click “Revoke Device”
- Review confirmation dialog (shows device serial, owner, warning)
- Confirm revocation
Verify Revocation:
# Check device status in database kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \ psql -U yubiuser -d kleidia -c \ "SELECT id, serial, is_active, deleted_at FROM yubikeys WHERE serial = '<serial-number>';"Verify Secrets Removed:
# Check Vault secrets (should be removed) kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \ vault kv list yubikeys/data/ | grep <serial-number>Verify Certificates Revoked:
# Check audit logs for certificate revocation kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i "revoke.*certificate"
Automatic Wipe Behavior#
Important: When a revoked device is connected to an admin workstation (where an agent is running), the system automatically attempts to wipe the PIV application. This ensures the device cannot be used even if physically recovered.
To verify wipe attempt:
- Check backend logs for PIV reset attempts
- Check agent logs (if available) for reset operations
- Verify device PIV status if device is accessible
Troubleshooting#
Device Not Wiped Automatically:
- Verify agent is running on admin workstation
- Check device is actually connected to admin workstation
- Review backend logs for PIV reset errors
- Manually reset PIV if needed (via agent or ykman CLI)
Revocation Failed:
- Check backend logs for errors
- Verify database connectivity
- Verify Vault connectivity
- Check user permissions (admin role required)
Vault 403 Errors#
Symptom#
Backend returns 403 errors when accessing Vault.
Diagnosis#
# Check Vault status
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- vault status
# Check backend Vault authentication
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i vault
# Check AppRole credentials
kubectl get secret vault-approle -n kleidia
# Test Vault authentication
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
vault write auth/approle/login \
role_id=<role-id> \
secret_id=<secret-id>Resolution#
Policy Issues
# Check backend policy kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \ vault policy read kleidia-backend # Update policy if needed kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \ vault policy write kleidia-backend - <<EOF path "pki/sign/*" { capabilities = ["create", "read", "update"] } path "yubikeys/data/*" { capabilities = ["create", "read", "update", "delete", "list"] } EOFAppRole Credentials
# Regenerate AppRole credentials # See Vault Setup documentationToken Expired
# Restart backend to get new token kubectl rollout restart deployment/kleidia-services-backend -n kleidia
TLS Certificate Expiry#
Symptom#
Browser shows certificate errors or certificate expired warnings.
Diagnosis#
# Check certificate expiration
echo | openssl s_client -connect kleidia.example.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Check Let's Encrypt certificates
sudo certbot certificatesResolution#
Certificate Expired
- Renew certificate through your external load balancer
- Verify certificate is properly configured
Certificate Not Renewing
- Check certificate renewal configuration in your load balancer
- Verify DNS records are correct
- Test certificate renewal manually
Agent Connection Issues#
Symptom#
Agents cannot connect or communicate with backend.
Diagnosis#
# Check agent is running on workstation
curl http://127.0.0.1:56123/health
# Check agent discovery
curl http://127.0.0.1:56123/.well-known/kleidia-agent
# Check backend logs
kubectl logs -f deployment/kleidia-services-backend -n kleidia | grep -i agentResolution#
Agent Not Running
- Verify agent is installed on workstation
- Start agent service or run manually
- Check agent logs for errors
Connection Refused
- Verify agent is running on localhost:56123
- Check browser console for CORS errors
- Verify user is logged in
- Check backend is accessible
High Disk Usage#
Symptom#
System running out of disk space.
Diagnosis#
# Check disk usage
df -h
# Check Docker disk usage
docker system df
# Check Kubernetes disk usage
kubectl top nodesResolution#
Clean Docker
# Remove unused containers, images, volumes docker system prune -af docker image prune -afClean Kubernetes
# Remove completed jobs kubectl delete jobs --field-selector status.successful=1 -n kleidia # Remove old logs (if log rotation not configured)Expand Storage
- Add additional disk
- Expand persistent volumes
- Archive old data
Database Performance Issues#
Symptom#
Slow queries, high database load.
Diagnosis#
# Check database connections
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
psql -U yubiuser -d kleidia -c "SELECT count(*) FROM pg_stat_activity;"
# Check slow queries
kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \
psql -U yubiuser -d kleidia -c "
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
"Resolution#
Too Many Connections
# Check connection pool settings # Reduce connection pool size if neededSlow Queries
# Vacuum database kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \ psql -U yubiuser -d kleidia -c "VACUUM ANALYZE;" # Check for missing indexes kubectl exec -it kleidia-data-postgres-cluster-0 -n kleidia -- \ psql -U yubiuser -d kleidia -c " SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname = 'public' ORDER BY abs(correlation) DESC; "
Pod CrashLoopBackOff#
Symptom#
Pods restarting repeatedly.
Diagnosis#
# Check pod status
kubectl get pods -n kleidia
# Check pod logs
kubectl logs -f <pod-name> -n kleidia
# Check pod events
kubectl describe pod <pod-name> -n kleidiaResolution#
Application Errors
- Check application logs for errors
- Verify configuration
- Check dependencies
Resource Constraints
# Check resource limits kubectl describe pod <pod-name> -n kleidia | grep -A 5 "Limits" # Increase resources if needed # Update Helm values and upgradeConfiguration Errors
- Verify environment variables
- Check secrets exist
- Verify service connectivity
Emergency Procedures#
Complete System Restart#
# Restart all pods
kubectl rollout restart deployment -n kleidiaDatabase Recovery#
# Stop backend
kubectl scale deployment/kleidia-services-backend --replicas=0 -n kleidia
# Restore from backup
gunzip -c backups/20250115/database.sql.gz | \
kubectl exec -i kleidia-data-postgres-cluster-0 -n kleidia -- \
psql -U yubiuser -d kleidia
# Restart backend
kubectl scale deployment/kleidia-services-backend --replicas=2 -n kleidiaVault Recovery#
# Stop backend
kubectl scale deployment/kleidia-services-backend --replicas=0 -n kleidia
# Restore Vault snapshot
kubectl cp backups/20250115/vault-backup.snap \
kleidia-platform-openbao-0:/tmp/vault-backup.snap -n kleidia
kubectl exec -it kleidia-platform-openbao-0 -n kleidia -- \
vault operator raft snapshot restore /tmp/vault-backup.snap
# Restart backend
kubectl scale deployment/kleidia-services-backend --replicas=2 -n kleidia