Operations Guide¶
This section covers operational aspects of running VirtRigaud in production environments.
Overview¶
Operating VirtRigaud involves monitoring, security, maintenance, and troubleshooting. This guide provides best practices and procedures for production deployments.
Core Topics¶
Observability¶
Monitor VirtRigaud with metrics, logs, and alerts:
- Prometheus metrics integration
- Grafana dashboards
- Log aggregation
- Alert rules
- Performance monitoring
Security¶
Secure your VirtRigaud deployment:
- RBAC configuration
- Secret management
- Network policies
- mTLS setup
- Security best practices
See also the Security subsection below for detailed security configurations.
Resilience¶
Build fault-tolerant VirtRigaud deployments:
- High availability setup
- Disaster recovery
- Backup strategies
- Failure scenarios
- Recovery procedures
Upgrade Guide¶
Safely upgrade VirtRigaud:
- Version compatibility
- Upgrade procedures
- CRD migrations
- Rollback strategies
- Breaking changes
Infrastructure-Specific Topics¶
vSphere Hardware Versions¶
Manage VMware hardware compatibility:
- Hardware version selection
- Compatibility matrix
- Upgrade procedures
- Feature availability
Libvirt Host Preparation¶
Prepare hosts for Libvirt provider:
- Host requirements
- KVM configuration
- Network setup
- Storage configuration
- Security hardening
Security Configuration¶
Detailed security configuration guides:
Bearer Token Authentication¶
Configure token-based authentication:
- Token generation
- Token rotation
- Service accounts
- Token best practices
mTLS Configuration¶
Enable mutual TLS:
- Certificate generation
- Certificate management
- Provider configuration
- Troubleshooting
External Secrets¶
Integrate with secret management systems:
- External Secrets Operator
- Vault integration
- AWS Secrets Manager
- Secret rotation
Network Policies¶
Restrict network communication:
- Policy examples
- Provider isolation
- Egress rules
- Troubleshooting
Production Checklist¶
Before deploying to production:
- Monitoring: Set up metrics and alerts
- Security: Configure RBAC and network policies
- Secrets: Use external secret management
- High Availability: Deploy with multiple replicas
- Backups: Configure backup procedures
- Documentation: Document your configuration
- Testing: Validate in staging environment
- Runbooks: Create incident response procedures
Common Operational Tasks¶
Scaling Providers¶
# Scale provider deployment
kubectl scale deployment vsphere-provider \
-n virtrigaud-system \
--replicas=3
Rotating Credentials¶
# Update provider credentials
kubectl create secret generic vsphere-creds \
--from-literal=password=new-password \
--dry-run=client -o yaml | \
kubectl apply -f -
# Restart provider to pick up new credentials
kubectl rollout restart deployment vsphere-provider \
-n virtrigaud-system
Checking Provider Health¶
# Check provider status
kubectl get providers
# Check provider pod status
kubectl get pods -n virtrigaud-system -l app=virtrigaud-provider
# View provider logs
kubectl logs -n virtrigaud-system \
-l app=virtrigaud-provider \
--tail=100
Monitoring VM Operations¶
# List all VMs
kubectl get vms -A
# Watch VM status
kubectl get vms -w
# Check VM events
kubectl get events --field-selector involvedObject.kind=VirtualMachine
Troubleshooting¶
Manager Not Starting¶
# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager
# Check webhook configuration
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Verify CRDs are installed
kubectl get crds | grep virtrigaud
Provider Connection Issues¶
# Test provider connectivity
kubectl exec -n virtrigaud-system deployment/vsphere-provider -- \
curl -k https://vcenter.example.com
# Check credentials
kubectl get secret vsphere-creds -o yaml
# Verify provider configuration
kubectl describe provider vsphere-provider
VM Creation Failures¶
# Check VM status
kubectl describe vm my-vm
# Check provider logs
kubectl logs -n virtrigaud-system \
-l app=virtrigaud-provider \
--tail=100 | grep my-vm
# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager | grep my-vm
Best Practices¶
Resource Management¶
- Set appropriate resource requests/limits for providers
- Use PodDisruptionBudgets for manager and providers
- Configure autoscaling for high-load scenarios
Security¶
- Use least-privilege RBAC roles
- Rotate credentials regularly
- Enable audit logging
- Use NetworkPolicies to restrict traffic
- Enable mTLS for provider communication
Monitoring¶
- Set up Prometheus ServiceMonitor
- Configure alerting rules
- Create Grafana dashboards
- Enable structured logging
- Track key metrics (VM operations, errors, latency)
High Availability¶
- Run manager with multiple replicas
- Deploy providers redundantly
- Use topology spread constraints
- Configure pod anti-affinity
- Test failover scenarios
Performance Tuning¶
Manager Optimization¶
# Increase concurrent reconcilers
spec:
template:
spec:
containers:
- name: manager
env:
- name: MAX_CONCURRENT_RECONCILES
value: "10"
Provider Optimization¶
# Tune provider connection pool
spec:
template:
spec:
containers:
- name: provider
env:
- name: MAX_CONNECTIONS
value: "20"
- name: CONNECTION_TIMEOUT
value: "30s"
Maintenance Windows¶
Planning Maintenance¶
- Notify users of maintenance window
- Scale down non-critical workloads
- Backup critical resources
- Test in staging environment
- Execute maintenance tasks
- Verify system health
- Document changes made
During Maintenance¶
# Prevent new VM operations (example using labels)
kubectl label namespace production maintenance=true
# Drain nodes if needed
kubectl drain node-1 --ignore-daemonsets
# Perform upgrades
helm upgrade virtrigaud virtrigaud/virtrigaud \
--namespace virtrigaud-system \
--version 0.2.3
# Verify health
kubectl get pods -n virtrigaud-system
kubectl get vms -A
Support and Resources¶
- GitHub Issues - Report bugs
- Slack Channel - Community support
- Documentation - Comprehensive guides
- Security Guide - Security best practices
- Observability Guide - Monitoring setup
Next Steps¶
- Set up observability for your deployment
- Configure security policies
- Plan for high availability
- Review the upgrade guide