Metrics Catalog VirtRigaud exposes comprehensive metrics for monitoring and observability. All metrics are available at the /metrics endpoint on port 8080.
Manager Metrics Reconciliation Metrics Metric Name Type Labels Description virtrigaud_manager_reconcile_total Counter kind, outcome Total number of reconcile loops virtrigaud_manager_reconcile_duration_seconds Histogram kind Time spent in reconcile loops virtrigaud_queue_depth Gauge kind Current queue depth for each resource kind
VM Operation Metrics Metric Name Type Labels Description virtrigaud_vm_operations_total Counter operation, provider_type, provider, outcome Total VM operations virtrigaud_vm_reconfigure_total Counter provider_type, outcome Total VM reconfiguration operations virtrigaud_vm_snapshot_total Counter action, provider_type, outcome Total VM snapshot operations virtrigaud_vm_clone_total Counter linked, provider_type, outcome Total VM clone operations virtrigaud_vm_image_prepare_total Counter provider_type, outcome Total VM image preparation operations
Metric Name Type Labels Description virtrigaud_build_info Gauge version, git_sha, go_version Build information
Provider Metrics gRPC Metrics Metric Name Type Labels Description virtrigaud_provider_rpc_requests_total Counter provider_type, method, code Total gRPC requests virtrigaud_provider_rpc_latency_seconds Histogram provider_type, method gRPC request latency virtrigaud_provider_tasks_inflight Gauge provider_type, provider Number of inflight tasks
Provider-Specific Metrics Metric Name Type Labels Description virtrigaud_ip_discovery_duration_seconds Histogram provider_type Time to discover VM IP addresses
Error Metrics Metric Name Type Labels Description virtrigaud_errors_total Counter reason, component Total errors by reason and component
Label Definitions Common Labels provider_type: The type of provider (vsphere, libvirt) provider: The name of the provider instance outcome: The result of an operation (success, failure, error) kind: The Kubernetes resource kind (VirtualMachine, VMClass, etc.) component: The component generating the metric (manager, provider) Operation-Specific Labels operation: Type of VM operation (Create, Delete, Power, Describe, Reconfigure) method: gRPC method name (CreateVM, DeleteVM, PowerVM, etc.) code: gRPC status code (OK, INVALID_ARGUMENT, DEADLINE_EXCEEDED, etc.) action: Snapshot action (create, delete, revert) linked: Whether a clone is linked (true, false) reason: Error reason (ConnectionFailed, AuthenticationError, etc.) Histogram Buckets Duration histograms use the following buckets (in seconds):
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300
Example Queries Prometheus Queries Error Rate # Overall error rate
rate ( virtrigaud_vm_operations_total { outcome = " failure "}[ 5m ] ) /
rate ( virtrigaud_vm_operations_total [ 5m ] )
# Provider-specific error rate
rate ( virtrigaud_provider_rpc_requests_total { code != " OK "}[ 5m ] ) /
rate ( virtrigaud_provider_rpc_requests_total [ 5m ] )
Latency # 95th percentile VM creation time
histogram_quantile ( 0.95 ,
rate ( virtrigaud_vm_operations_duration_seconds_bucket { operation = " Create "}[ 5m ] )
)
# gRPC request latency by method
histogram_quantile ( 0.95 ,
rate ( virtrigaud_provider_rpc_latency_seconds_bucket [ 5m ] )
) by ( method )
Throughput # VM operations per second
rate ( virtrigaud_vm_operations_total [ 5m ] )
# Operations by provider
rate ( virtrigaud_vm_operations_total [ 5m ] ) by ( provider_type , provider )
Queue Depth # Current queue depth
virtrigaud_queue_depth
# Average queue depth over time
avg_over_time ( virtrigaud_queue_depth [ 5m ] )
Inflight Tasks # Current inflight tasks
virtrigaud_provider_tasks_inflight
# Inflight tasks by provider
virtrigaud_provider_tasks_inflight by ( provider_type , provider )
Grafana Dashboard Queries VM Creation Success Rate Panel sum ( rate ( virtrigaud_vm_operations_total { operation = " Create ", outcome = " success "}[ 5m ] )) /
sum ( rate ( virtrigaud_vm_operations_total { operation = " Create "}[ 5m ] )) * 100
Provider Health Panel up { job = " virtrigaud-provider "}
Error Rate by Provider Panel sum ( rate ( virtrigaud_errors_total [ 5m ] )) by ( component , provider_type )
ServiceMonitor Configuration Example ServiceMonitor for Prometheus Operator:
apiVersion : monitoring.coreos.com/v1
kind : ServiceMonitor
metadata :
name : virtrigaud-manager
namespace : virtrigaud-system
spec :
selector :
matchLabels :
app.kubernetes.io/name : virtrigaud
app.kubernetes.io/component : manager
endpoints :
- port : metrics
interval : 30s
path : /metrics
---
apiVersion : monitoring.coreos.com/v1
kind : ServiceMonitor
metadata :
name : virtrigaud-providers
namespace : virtrigaud-system
spec :
selector :
matchLabels :
app.kubernetes.io/name : virtrigaud
app.kubernetes.io/component : provider
endpoints :
- port : metrics
interval : 30s
path : /metrics
Alert Rules Example PrometheusRule for common alerts:
apiVersion : monitoring.coreos.com/v1
kind : PrometheusRule
metadata :
name : virtrigaud-alerts
namespace : virtrigaud-system
spec :
groups :
- name : virtrigaud.rules
rules :
- alert : VirtrigaudProviderDown
expr : up{job="virtrigaud-provider"} == 0
for : 5m
labels :
severity : critical
annotations :
summary : "Virtrigaud provider is down"
description : "Provider {{ $labels.instance }} has been down for more than 5 minutes"
- alert : VirtrigaudHighErrorRate
expr : |
rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
rate(virtrigaud_vm_operations_total[5m]) > 0.1
for : 10m
labels :
severity : warning
annotations :
summary : "High error rate in VM operations"
description : "Error rate is {{ $value | humanizePercentage }} for {{ $labels.provider }}"
- alert : VirtrigaudSlowVMCreation
expr : |
histogram_quantile(0.95,
rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
) > 600
for : 15m
labels :
severity : warning
annotations :
summary : "Slow VM creation times"
description : "95th percentile VM creation time is {{ $value }}s"
- alert : VirtrigaudQueueBacklog
expr : virtrigaud_queue_depth > 100
for : 10m
labels :
severity : warning
annotations :
summary : "Queue backlog detected"
description : "Queue depth for {{ $labels.kind }} is {{ $value }}"
Custom Metrics Providers can expose additional custom metrics specific to their implementation:
vSphere Provider Metrics Metric Name Type Labels Description virtrigaud_vsphere_sessions_total Counter datacenter Total vSphere sessions created virtrigaud_vsphere_api_calls_total Counter method, datacenter Total vSphere API calls
Libvirt Provider Metrics Metric Name Type Labels Description virtrigaud_libvirt_connections_total Counter host Total Libvirt connections virtrigaud_libvirt_domains_total Gauge host, state Current number of domains by state
Metric Collection Best Practices Scrape Interval : Use 30s interval for most metrics Retention : Keep metrics for at least 30 days for trending High Cardinality : Be careful with VM names and IDs in labels Aggregation : Use recording rules for frequently queried metrics Alerting : Set up alerts for SLI/SLO violations