VirtRigaud Observability Guide¶

This document describes the comprehensive observability features of VirtRigaud, including structured logging, metrics, tracing, and monitoring.

Overview¶

VirtRigaud provides production-grade observability through:

Structured JSON Logging with correlation IDs and automatic secret redaction
Comprehensive Prometheus Metrics for all components and operations
OpenTelemetry Tracing with gRPC instrumentation
Health Endpoints for liveness and readiness probes
Grafana Dashboards for visualization
Prometheus Alerts for proactive monitoring

Logging¶

Configuration¶

Configure logging via environment variables:

LOG_LEVEL=info              # debug, info, warn, error
LOG_FORMAT=json             # json or console
LOG_SAMPLING=true           # Enable log sampling
LOG_DEVELOPMENT=false       # Development mode

Correlation IDs¶

All log entries include correlation fields:

{
  "level": "info",
  "ts": "2025-01-27T10:30:45.123Z",
  "msg": "VM operation started",
  "correlationID": "req-12345",
  "vm": "default/web-server-1",
  "provider": "default/vsphere-prod",
  "providerType": "vsphere",
  "taskRef": "task-67890",
  "reconcile": "uuid-abcdef"
}

Secret Redaction¶

Sensitive information is automatically redacted:

{
  "msg": "Connecting to provider",
  "endpoint": "vcenter://user:[REDACTED]@vc.example.com/Datacenter",
  "userData": "[REDACTED]"
}

Metrics Catalog¶

Manager Metrics¶

Metric	Type	Description	Labels
`virtrigaud_manager_reconcile_total`	Counter	Total reconcile operations	`kind`, `outcome`
`virtrigaud_manager_reconcile_duration_seconds`	Histogram	Reconcile duration	`kind`
`virtrigaud_queue_depth`	Gauge	Work queue depth	`kind`

Provider Metrics¶

Metric	Type	Description	Labels
`virtrigaud_provider_rpc_requests_total`	Counter	RPC requests	`provider_type`, `method`, `code`
`virtrigaud_provider_rpc_latency_seconds`	Histogram	RPC latency	`provider_type`, `method`
`virtrigaud_provider_tasks_inflight`	Gauge	Inflight tasks	`provider_type`, `provider`

VM Operation Metrics¶

Metric	Type	Description	Labels
`virtrigaud_vm_operations_total`	Counter	VM operations	`operation`, `provider_type`, `provider`, `outcome`
`virtrigaud_ip_discovery_duration_seconds`	Histogram	IP discovery time	`provider_type`

Circuit Breaker Metrics¶

Metric	Type	Description	Labels
`virtrigaud_circuit_breaker_state`	Gauge	CB state (0=closed, 1=half-open, 2=open)	`provider_type`, `provider`
`virtrigaud_circuit_breaker_failures_total`	Counter	CB failures	`provider_type`, `provider`

Error Metrics¶

Metric	Type	Description	Labels
`virtrigaud_errors_total`	Counter	Errors by reason	`reason`, `component`

Tracing¶

Configuration¶

Enable OpenTelemetry tracing:

VIRTRIGAUD_TRACING_ENABLED=true
VIRTRIGAUD_TRACING_ENDPOINT=http://jaeger:14268/api/traces
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

Span Structure¶

Key spans include:

vm.reconcile - Full VM reconciliation
vm.create - VM creation operation
provider.validate - Provider validation
rpc.Create - gRPC calls to providers

Trace Attributes¶

Standard attributes:

vm.namespace = "default"
vm.name = "web-server-1"
provider.type = "vsphere"
operation = "Create"
task.ref = "task-12345"

Health Endpoints¶

HTTP Endpoints¶

All components expose health endpoints on port 8080:

GET /healthz - Liveness probe (always returns 200)
GET /readyz - Readiness probe (checks dependencies)
GET /health - Detailed health status (JSON)

gRPC Health¶

Providers implement grpc.health.v1.Health service for health checks.

Grafana Dashboards¶

Manager Dashboard¶

Reconcile rates and duration
Queue depth monitoring
Error rate tracking
Resource usage (CPU/memory)

Provider Dashboard¶

RPC latency and error rates
Task monitoring
Circuit breaker status
Provider-specific metrics

VM Lifecycle Dashboard¶

Creation success rates
IP discovery times
Failure analysis
Provider comparison

Prometheus Alerts¶

Critical Alerts¶

VirtrigaudProviderDown - Provider unavailable
VirtrigaudManagerDown - Manager unavailable

Warning Alerts¶

VirtrigaudProviderErrorRateHigh - High error rate (>50%)
VirtrigaudReconcileStuck - Slow reconciles (>5min)
VirtrigaudQueueBackedUp - Queue depth >100
VirtrigaudCircuitBreakerOpen - CB protection active

Configuration Reference¶

Complete Environment Variables¶

# Logging
LOG_LEVEL=info
LOG_FORMAT=json
LOG_SAMPLING=true
LOG_DEVELOPMENT=false

# Tracing
VIRTRIGAUD_TRACING_ENABLED=false
VIRTRIGAUD_TRACING_ENDPOINT=""
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

# RPC Timeouts
RPC_TIMEOUT_DESCRIBE=30s
RPC_TIMEOUT_MUTATING=4m
RPC_TIMEOUT_VALIDATE=10s
RPC_TIMEOUT_TASK_STATUS=10s

# Retry Configuration
RETRY_MAX_ATTEMPTS=5
RETRY_BASE_DELAY=500ms
RETRY_MAX_DELAY=30s
RETRY_MULTIPLIER=2.0
RETRY_JITTER=true

# Circuit Breaker
CB_FAILURE_THRESHOLD=10
CB_RESET_SECONDS=60s
CB_HALF_OPEN_MAX_CALLS=3

# Rate Limiting
RATE_LIMIT_QPS=10
RATE_LIMIT_BURST=20

# Workers
WORKERS_PER_KIND=2
MAX_INFLIGHT_TASKS=100

# Feature Gates
FEATURE_GATES=""

# Performance
VIRTRIGAUD_PPROF_ENABLED=false
VIRTRIGAUD_PPROF_ADDR=:6060

Deployment¶

ServiceMonitor¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-manager
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
  endpoints:
  - port: metrics
    interval: 30s

PrometheusRule¶

Deploy alerts:

kubectl apply -f deploy/observability/prometheus/alerts.yaml

Grafana Dashboards¶

Import dashboards from deploy/observability/grafana/

Troubleshooting¶

High Error Rates¶

Check provider health: kubectl get providers
Review error metrics: virtrigaud_errors_total
Check circuit breaker state
Review provider logs

Slow Operations¶

Check RPC latency metrics
Review reconcile duration
Check resource constraints
Monitor task queue depth

Memory Issues¶

Monitor process_resident_memory_bytes
Check for goroutine leaks: go_goroutines
Review heap usage: go_memstats_heap_inuse_bytes