Operations Guide¶

This section covers operational aspects of running VirtRigaud in production environments.

Overview¶

Operating VirtRigaud involves monitoring, security, maintenance, and troubleshooting. This guide provides best practices and procedures for production deployments.

Core Topics¶

Observability ¶

Monitor VirtRigaud with metrics, logs, and alerts:

Prometheus metrics integration
Grafana dashboards
Log aggregation
Alert rules
Performance monitoring

Security ¶

Secure your VirtRigaud deployment:

RBAC configuration
Secret management
Network policies
mTLS setup
Security best practices

See also the Security subsection below for detailed security configurations.

Resilience ¶

Build fault-tolerant VirtRigaud deployments:

High availability setup
Disaster recovery
Backup strategies
Failure scenarios
Recovery procedures

Upgrade Guide ¶

Safely upgrade VirtRigaud:

Version compatibility
Upgrade procedures
CRD migrations
Rollback strategies
Breaking changes

Infrastructure-Specific Topics¶

vSphere Hardware Versions ¶

Manage VMware hardware compatibility:

Hardware version selection
Compatibility matrix
Upgrade procedures
Feature availability

Libvirt Host Preparation ¶

Prepare hosts for Libvirt provider:

Host requirements
KVM configuration
Network setup
Storage configuration
Security hardening

Security Configuration¶

Detailed security configuration guides:

Bearer Token Authentication ¶

Configure token-based authentication:

Token generation
Token rotation
Service accounts
Token best practices

mTLS Configuration ¶

Enable mutual TLS:

Certificate generation
Certificate management
Provider configuration
Troubleshooting

External Secrets ¶

Integrate with secret management systems:

External Secrets Operator
Vault integration
AWS Secrets Manager
Secret rotation

Network Policies ¶

Restrict network communication:

Policy examples
Provider isolation
Egress rules
Troubleshooting

Production Checklist¶

Before deploying to production:

Monitoring: Set up metrics and alerts
Security: Configure RBAC and network policies
Secrets: Use external secret management
High Availability: Deploy with multiple replicas
Backups: Configure backup procedures
Documentation: Document your configuration
Testing: Validate in staging environment
Runbooks: Create incident response procedures

Common Operational Tasks¶

Scaling Providers¶

# Scale provider deployment
kubectl scale deployment vsphere-provider \
  -n virtrigaud-system \
  --replicas=3

Rotating Credentials¶

# Update provider credentials
kubectl create secret generic vsphere-creds \
  --from-literal=password=new-password \
  --dry-run=client -o yaml | \
  kubectl apply -f -

# Restart provider to pick up new credentials
kubectl rollout restart deployment vsphere-provider \
  -n virtrigaud-system

Checking Provider Health¶

# Check provider status
kubectl get providers

# Check provider pod status
kubectl get pods -n virtrigaud-system -l app=virtrigaud-provider

# View provider logs
kubectl logs -n virtrigaud-system \
  -l app=virtrigaud-provider \
  --tail=100

Monitoring VM Operations¶

# List all VMs
kubectl get vms -A

# Watch VM status
kubectl get vms -w

# Check VM events
kubectl get events --field-selector involvedObject.kind=VirtualMachine

Troubleshooting¶

Manager Not Starting¶

# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

# Check webhook configuration
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Verify CRDs are installed
kubectl get crds | grep virtrigaud

Provider Connection Issues¶

# Test provider connectivity
kubectl exec -n virtrigaud-system deployment/vsphere-provider -- \
  curl -k https://vcenter.example.com

# Check credentials
kubectl get secret vsphere-creds -o yaml

# Verify provider configuration
kubectl describe provider vsphere-provider

VM Creation Failures¶

# Check VM status
kubectl describe vm my-vm

# Check provider logs
kubectl logs -n virtrigaud-system \
  -l app=virtrigaud-provider \
  --tail=100 | grep my-vm

# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager | grep my-vm

Best Practices¶

Resource Management¶

Set appropriate resource requests/limits for providers
Use PodDisruptionBudgets for manager and providers
Configure autoscaling for high-load scenarios

Security¶

Use least-privilege RBAC roles
Rotate credentials regularly
Enable audit logging
Use NetworkPolicies to restrict traffic
Enable mTLS for provider communication

Monitoring¶

Set up Prometheus ServiceMonitor
Configure alerting rules
Create Grafana dashboards
Enable structured logging
Track key metrics (VM operations, errors, latency)

High Availability¶

Run manager with multiple replicas
Deploy providers redundantly
Use topology spread constraints
Configure pod anti-affinity
Test failover scenarios

Performance Tuning¶

Manager Optimization¶

# Increase concurrent reconcilers
spec:
  template:
    spec:
      containers:
      - name: manager
        env:
        - name: MAX_CONCURRENT_RECONCILES
          value: "10"

Provider Optimization¶

# Tune provider connection pool
spec:
  template:
    spec:
      containers:
      - name: provider
        env:
        - name: MAX_CONNECTIONS
          value: "20"
        - name: CONNECTION_TIMEOUT
          value: "30s"

Maintenance Windows¶

Planning Maintenance¶

Notify users of maintenance window
Scale down non-critical workloads
Backup critical resources
Test in staging environment
Execute maintenance tasks
Verify system health
Document changes made

During Maintenance¶

# Prevent new VM operations (example using labels)
kubectl label namespace production maintenance=true

# Drain nodes if needed
kubectl drain node-1 --ignore-daemonsets

# Perform upgrades
helm upgrade virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --version 0.2.3

# Verify health
kubectl get pods -n virtrigaud-system
kubectl get vms -A

Support and Resources¶

GitHub Issues - Report bugs
Slack Channel - Community support
Documentation - Comprehensive guides
Security Guide - Security best practices
Observability Guide - Monitoring setup

Next Steps¶

Set up observability for your deployment
Configure security policies
Plan for high availability
Review the upgrade guide