VirtRigaud Resilience Guide¶
This document describes the resilience patterns and error handling mechanisms in VirtRigaud.
Overview¶
VirtRigaud implements comprehensive resilience patterns:
- Error Taxonomy - Structured error classification
- Circuit Breakers - Protection against cascading failures
- Exponential Backoff - Intelligent retry strategies
- Timeout Policies - Prevent resource exhaustion
- Rate Limiting - Provider protection
Error Taxonomy¶
Error Types¶
VirtRigaud classifies all errors into specific categories:
| Type | Retryable | Description | Example |
|---|---|---|---|
NotFound | No | Resource doesn't exist | VM not found |
InvalidSpec | No | Invalid configuration | Malformed VM spec |
Unauthorized | No | Authentication failed | Invalid credentials |
NotSupported | No | Unsupported operation | Feature not available |
Retryable | Yes | Transient error | Network timeout |
Unavailable | Yes | Service unavailable | Provider down |
RateLimit | Yes | Rate limited | API quota exceeded |
Timeout | Yes | Operation timeout | Long-running task |
QuotaExceeded | No | Resource quota hit | Storage full |
Conflict | No | Resource conflict | Duplicate name |
Error Creation¶
import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"
// Create specific error types
err := contracts.NewNotFoundError("VM not found", originalErr)
err := contracts.NewRetryableError("Network timeout", originalErr)
err := contracts.NewUnavailableError("Provider unavailable", originalErr)
// Check if error is retryable
if providerErr, ok := err.(*contracts.ProviderError); ok {
if providerErr.IsRetryable() {
// Retry the operation
}
}
Circuit Breaker Pattern¶
Configuration¶
import "github.com/projectbeskar/virtrigaud/internal/resilience"
config := &resilience.Config{
FailureThreshold: 10, // Open after 10 failures
ResetTimeout: 60 * time.Second, // Try again after 60s
HalfOpenMaxCalls: 3, // Allow 3 test calls
}
cb := resilience.NewCircuitBreaker("provider-vsphere", "vsphere", "prod", config)
Usage¶
err := cb.Call(ctx, func(ctx context.Context) error {
// Call the potentially failing operation
return provider.Create(ctx, request)
})
if err != nil {
// Handle error (may be circuit breaker protection)
log.Error(err, "Operation failed")
}
States¶
- Closed - Normal operation, failures are counted
- Open - Fast-fail mode, requests are rejected immediately
- Half-Open - Testing mode, limited requests allowed
Metrics¶
Circuit breaker state is exposed via metrics:
virtrigaud_circuit_breaker_state{provider_type="vsphere",provider="prod"} 0
virtrigaud_circuit_breaker_failures_total{provider_type="vsphere",provider="prod"} 5
Retry Strategies¶
Exponential Backoff¶
import "github.com/projectbeskar/virtrigaud/internal/resilience"
config := &resilience.RetryConfig{
MaxAttempts: 5,
BaseDelay: 500 * time.Millisecond,
MaxDelay: 30 * time.Second,
Multiplier: 2.0,
Jitter: true,
}
err := resilience.Retry(ctx, config, func(ctx context.Context, attempt int) error {
return provider.Describe(ctx, vmID)
})
Backoff Calculation¶
For attempt n:
delay = BaseDelay × Multiplier^n
delay = min(delay, MaxDelay)
if Jitter:
delay += random(0, delay * 0.1)
Example delays with BaseDelay=500ms, Multiplier=2.0: - Attempt 0: 500ms - Attempt 1: 1s - Attempt 2: 2s
- Attempt 3: 4s - Attempt 4: 8s
Predefined Configurations¶
// For frequent, low-latency operations
aggressive := resilience.AggressiveRetryConfig()
// MaxAttempts: 10, BaseDelay: 100ms, Multiplier: 1.5
// For expensive operations
conservative := resilience.ConservativeRetryConfig()
// MaxAttempts: 3, BaseDelay: 1s, Multiplier: 3.0
// Disable retries
none := resilience.NoRetryConfig()
// MaxAttempts: 1
Combined Resilience Policies¶
Policy Builder¶
policy := resilience.NewPolicyBuilder("vm-operations").
WithRetry(resilience.DefaultRetryConfig()).
WithCircuitBreaker(circuitBreaker).
Build()
err := policy.Execute(ctx, func(ctx context.Context) error {
return provider.Create(ctx, request)
})
Integration Example¶
// In VirtualMachine controller
func (r *VirtualMachineReconciler) createVM(ctx context.Context, vm *v1beta1.VirtualMachine) error {
// Get circuit breaker for this provider
cb := r.CircuitBreakerRegistry.GetOrCreate(
"vm-operations",
provider.Spec.Type,
provider.Name,
)
// Create resilience policy
policy := resilience.NewPolicyBuilder("create-vm").
WithRetry(&resilience.RetryConfig{
MaxAttempts: 3,
BaseDelay: 1 * time.Second,
MaxDelay: 30 * time.Second,
Multiplier: 2.0,
Jitter: true,
}).
WithCircuitBreaker(cb).
Build()
// Execute with resilience
return policy.Execute(ctx, func(ctx context.Context) error {
resp, err := provider.Create(ctx, createReq)
if err != nil {
return err
}
vm.Status.ID = resp.ID
vm.Status.TaskRef = resp.TaskRef
return nil
})
}
Timeout Policies¶
RPC Timeouts¶
Different operations have different timeout requirements:
// Operation-specific timeouts
config := &config.RPCConfig{
TimeoutDescribe: 30 * time.Second, // Quick status check
TimeoutMutating: 4 * time.Minute, // Create/Delete/Power
TimeoutValidate: 10 * time.Second, // Provider validation
TimeoutTaskStatus: 10 * time.Second, // Task polling
}
// Usage in gRPC client
timeout := config.GetRPCTimeout("Create")
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
resp, err := client.Create(ctx, request)
Context Propagation¶
Always respect context deadlines:
func (p *Provider) Create(ctx context.Context, req CreateRequest) error {
// Check if context is already cancelled
select {
case <-ctx.Done():
return ctx.Err()
default:
}
// Perform operation with context
return p.performCreate(ctx, req)
}
Rate Limiting¶
Provider Protection¶
import "golang.org/x/time/rate"
// Configure rate limiter
limiter := rate.NewLimiter(
rate.Limit(config.RateLimit.QPS), // 10 requests per second
config.RateLimit.Burst, // Allow bursts of 20
)
// Check rate limit before operation
if !limiter.Allow() {
return contracts.NewRateLimitError("Rate limit exceeded", nil)
}
// Proceed with operation
return provider.Create(ctx, request)
Per-Provider Limits¶
Each provider instance has its own rate limiter:
type ProviderManager struct {
limiters map[string]*rate.Limiter
}
func (pm *ProviderManager) getLimiter(providerType, provider string) *rate.Limiter {
key := fmt.Sprintf("%s:%s", providerType, provider)
if limiter, exists := pm.limiters[key]; exists {
return limiter
}
// Create new limiter
limiter := rate.NewLimiter(rate.Limit(10), 20)
pm.limiters[key] = limiter
return limiter
}
Condition Mapping¶
VM Conditions¶
VirtRigaud sets standard conditions based on operations:
| Condition | Status | Reason | Description |
|---|---|---|---|
Ready | True | VMReady | VM is ready for use |
Ready | False | ProviderError | Provider operation failed |
Ready | False | ValidationError | Spec validation failed |
Provisioning | True | Creating | VM creation in progress |
Provisioning | False | CreateFailed | VM creation failed |
Provider Conditions¶
| Condition | Status | Reason | Description |
|---|---|---|---|
ProviderRuntimeReady | True | DeploymentReady | Remote runtime ready |
ProviderRuntimeReady | False | DeploymentError | Deployment failed |
ProviderAvailable | True | HealthCheckPassed | Provider healthy |
ProviderAvailable | False | HealthCheckFailed | Provider unhealthy |
Error to Condition Mapping¶
func mapErrorToCondition(err error) metav1.Condition {
if providerErr, ok := err.(*contracts.ProviderError); ok {
switch providerErr.Type {
case contracts.ErrorTypeNotFound:
return metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "ResourceNotFound",
Message: providerErr.Message,
}
case contracts.ErrorTypeUnauthorized:
return metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "AuthenticationFailed",
Message: providerErr.Message,
}
case contracts.ErrorTypeUnavailable:
return metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "ProviderUnavailable",
Message: providerErr.Message,
}
}
}
// Default error condition
return metav1.Condition{
Type: "Ready",
Status: metav1.ConditionFalse,
Reason: "InternalError",
Message: err.Error(),
}
}
Best Practices¶
Error Handling¶
- Always classify errors - Use appropriate error types
- Preserve context - Wrap errors with additional context
- Avoid retrying non-retryable errors - Check error type first
- Set meaningful conditions - Help users understand state
Circuit Breakers¶
- Per-provider instances - Isolate failures
- Appropriate thresholds - Balance protection vs availability
- Monitor state changes - Alert on circuit breaker trips
- Manual override - Provide way to reset if needed
Timeouts¶
- Operation-appropriate - Different timeouts for different ops
- Propagate context - Always pass context through
- Handle cancellation - Check context.Done() regularly
- Resource cleanup - Ensure resources are freed on timeout
Rate Limiting¶
- Provider protection - Prevent overwhelming providers
- Burst handling - Allow reasonable bursts
- Back-pressure - Surface rate limits to users
- Fair sharing - Consider tenant isolation
Configuration Examples¶
Development Environment¶
apiVersion: v1
kind: ConfigMap
metadata:
name: virtrigaud-config
data:
# Relaxed timeouts for development
RPC_TIMEOUT_MUTATING: "10m"
# Aggressive retries for flaky dev environments
RETRY_MAX_ATTEMPTS: "10"
RETRY_BASE_DELAY: "100ms"
# Lower circuit breaker threshold
CB_FAILURE_THRESHOLD: "5"
CB_RESET_SECONDS: "30s"
Production Environment¶
apiVersion: v1
kind: ConfigMap
metadata:
name: virtrigaud-config
data:
# Strict timeouts
RPC_TIMEOUT_MUTATING: "4m"
RPC_TIMEOUT_DESCRIBE: "30s"
# Conservative retries
RETRY_MAX_ATTEMPTS: "3"
RETRY_BASE_DELAY: "1s"
RETRY_MAX_DELAY: "60s"
# Higher circuit breaker threshold
CB_FAILURE_THRESHOLD: "15"
CB_RESET_SECONDS: "120s"
# Rate limiting
RATE_LIMIT_QPS: "20"
RATE_LIMIT_BURST: "50"