VirtRigaud Resilience Guide¶

This document describes the resilience patterns and error handling mechanisms in VirtRigaud.

Overview¶

VirtRigaud implements comprehensive resilience patterns:

Error Taxonomy - Structured error classification
Circuit Breakers - Protection against cascading failures
Exponential Backoff - Intelligent retry strategies
Timeout Policies - Prevent resource exhaustion
Rate Limiting - Provider protection

Error Taxonomy¶

Error Types¶

VirtRigaud classifies all errors into specific categories:

Type	Retryable	Description	Example
`NotFound`	No	Resource doesn't exist	VM not found
`InvalidSpec`	No	Invalid configuration	Malformed VM spec
`Unauthorized`	No	Authentication failed	Invalid credentials
`NotSupported`	No	Unsupported operation	Feature not available
`Retryable`	Yes	Transient error	Network timeout
`Unavailable`	Yes	Service unavailable	Provider down
`RateLimit`	Yes	Rate limited	API quota exceeded
`Timeout`	Yes	Operation timeout	Long-running task
`QuotaExceeded`	No	Resource quota hit	Storage full
`Conflict`	No	Resource conflict	Duplicate name

Error Creation¶

import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"

// Create specific error types
err := contracts.NewNotFoundError("VM not found", originalErr)
err := contracts.NewRetryableError("Network timeout", originalErr)
err := contracts.NewUnavailableError("Provider unavailable", originalErr)

// Check if error is retryable
if providerErr, ok := err.(*contracts.ProviderError); ok {
    if providerErr.IsRetryable() {
        // Retry the operation
    }
}

Circuit Breaker Pattern¶

Configuration¶

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.Config{
    FailureThreshold: 10,              // Open after 10 failures
    ResetTimeout:     60 * time.Second, // Try again after 60s
    HalfOpenMaxCalls: 3,               // Allow 3 test calls
}

cb := resilience.NewCircuitBreaker("provider-vsphere", "vsphere", "prod", config)

Usage¶

err := cb.Call(ctx, func(ctx context.Context) error {
    // Call the potentially failing operation
    return provider.Create(ctx, request)
})

if err != nil {
    // Handle error (may be circuit breaker protection)
    log.Error(err, "Operation failed")
}

States¶

Closed - Normal operation, failures are counted
Open - Fast-fail mode, requests are rejected immediately
Half-Open - Testing mode, limited requests allowed

Metrics¶

Circuit breaker state is exposed via metrics:

virtrigaud_circuit_breaker_state{provider_type="vsphere",provider="prod"} 0
virtrigaud_circuit_breaker_failures_total{provider_type="vsphere",provider="prod"} 5

Retry Strategies¶

Exponential Backoff¶

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.RetryConfig{
    MaxAttempts: 5,
    BaseDelay:   500 * time.Millisecond,
    MaxDelay:    30 * time.Second,
    Multiplier:  2.0,
    Jitter:      true,
}

err := resilience.Retry(ctx, config, func(ctx context.Context, attempt int) error {
    return provider.Describe(ctx, vmID)
})

Backoff Calculation¶

For attempt n:

delay = BaseDelay × Multiplier^n
delay = min(delay, MaxDelay)
if Jitter:
    delay += random(0, delay * 0.1)

Example delays with BaseDelay=500ms, Multiplier=2.0: - Attempt 0: 500ms - Attempt 1: 1s - Attempt 2: 2s
- Attempt 3: 4s - Attempt 4: 8s

Predefined Configurations¶

// For frequent, low-latency operations
aggressive := resilience.AggressiveRetryConfig()
// MaxAttempts: 10, BaseDelay: 100ms, Multiplier: 1.5

// For expensive operations
conservative := resilience.ConservativeRetryConfig()
// MaxAttempts: 3, BaseDelay: 1s, Multiplier: 3.0

// Disable retries
none := resilience.NoRetryConfig()
// MaxAttempts: 1

Combined Resilience Policies¶

Policy Builder¶

policy := resilience.NewPolicyBuilder("vm-operations").
    WithRetry(resilience.DefaultRetryConfig()).
    WithCircuitBreaker(circuitBreaker).
    Build()

err := policy.Execute(ctx, func(ctx context.Context) error {
    return provider.Create(ctx, request)
})

Integration Example¶

// In VirtualMachine controller
func (r *VirtualMachineReconciler) createVM(ctx context.Context, vm *v1beta1.VirtualMachine) error {
    // Get circuit breaker for this provider
    cb := r.CircuitBreakerRegistry.GetOrCreate(
        "vm-operations", 
        provider.Spec.Type, 
        provider.Name,
    )

    // Create resilience policy
    policy := resilience.NewPolicyBuilder("create-vm").
        WithRetry(&resilience.RetryConfig{
            MaxAttempts: 3,
            BaseDelay:   1 * time.Second,
            MaxDelay:    30 * time.Second,
            Multiplier:  2.0,
            Jitter:      true,
        }).
        WithCircuitBreaker(cb).
        Build()

    // Execute with resilience
    return policy.Execute(ctx, func(ctx context.Context) error {
        resp, err := provider.Create(ctx, createReq)
        if err != nil {
            return err
        }

        vm.Status.ID = resp.ID
        vm.Status.TaskRef = resp.TaskRef
        return nil
    })
}

Timeout Policies¶

RPC Timeouts¶

Different operations have different timeout requirements:

// Operation-specific timeouts
config := &config.RPCConfig{
    TimeoutDescribe:   30 * time.Second,  // Quick status check
    TimeoutMutating:   4 * time.Minute,   // Create/Delete/Power
    TimeoutValidate:   10 * time.Second,  // Provider validation
    TimeoutTaskStatus: 10 * time.Second,  // Task polling
}

// Usage in gRPC client
timeout := config.GetRPCTimeout("Create")
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()

resp, err := client.Create(ctx, request)

Context Propagation¶

Always respect context deadlines:

func (p *Provider) Create(ctx context.Context, req CreateRequest) error {
    // Check if context is already cancelled
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
    }

    // Perform operation with context
    return p.performCreate(ctx, req)
}

Rate Limiting¶

Provider Protection¶

import "golang.org/x/time/rate"

// Configure rate limiter
limiter := rate.NewLimiter(
    rate.Limit(config.RateLimit.QPS),    // 10 requests per second
    config.RateLimit.Burst,              // Allow bursts of 20
)

// Check rate limit before operation
if !limiter.Allow() {
    return contracts.NewRateLimitError("Rate limit exceeded", nil)
}

// Proceed with operation
return provider.Create(ctx, request)

Per-Provider Limits¶

Each provider instance has its own rate limiter:

type ProviderManager struct {
    limiters map[string]*rate.Limiter
}

func (pm *ProviderManager) getLimiter(providerType, provider string) *rate.Limiter {
    key := fmt.Sprintf("%s:%s", providerType, provider)
    if limiter, exists := pm.limiters[key]; exists {
        return limiter
    }

    // Create new limiter
    limiter := rate.NewLimiter(rate.Limit(10), 20)
    pm.limiters[key] = limiter
    return limiter
}

Condition Mapping¶

VM Conditions¶

VirtRigaud sets standard conditions based on operations:

Condition	Status	Reason	Description
`Ready`	True	`VMReady`	VM is ready for use
`Ready`	False	`ProviderError`	Provider operation failed
`Ready`	False	`ValidationError`	Spec validation failed
`Provisioning`	True	`Creating`	VM creation in progress
`Provisioning`	False	`CreateFailed`	VM creation failed

Provider Conditions¶

Condition	Status	Reason	Description
`ProviderRuntimeReady`	True	`DeploymentReady`	Remote runtime ready
`ProviderRuntimeReady`	False	`DeploymentError`	Deployment failed
`ProviderAvailable`	True	`HealthCheckPassed`	Provider healthy
`ProviderAvailable`	False	`HealthCheckFailed`	Provider unhealthy

Error to Condition Mapping¶

func mapErrorToCondition(err error) metav1.Condition {
    if providerErr, ok := err.(*contracts.ProviderError); ok {
        switch providerErr.Type {
        case contracts.ErrorTypeNotFound:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ResourceNotFound",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnauthorized:
            return metav1.Condition{
                Type:    "Ready", 
                Status:  metav1.ConditionFalse,
                Reason:  "AuthenticationFailed",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnavailable:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ProviderUnavailable", 
                Message: providerErr.Message,
            }
        }
    }

    // Default error condition
    return metav1.Condition{
        Type:    "Ready",
        Status:  metav1.ConditionFalse,
        Reason:  "InternalError",
        Message: err.Error(),
    }
}

Best Practices¶

Error Handling¶

Always classify errors - Use appropriate error types
Preserve context - Wrap errors with additional context
Avoid retrying non-retryable errors - Check error type first
Set meaningful conditions - Help users understand state

Circuit Breakers¶

Per-provider instances - Isolate failures
Appropriate thresholds - Balance protection vs availability
Monitor state changes - Alert on circuit breaker trips
Manual override - Provide way to reset if needed

Timeouts¶

Operation-appropriate - Different timeouts for different ops
Propagate context - Always pass context through
Handle cancellation - Check context.Done() regularly
Resource cleanup - Ensure resources are freed on timeout

Rate Limiting¶

Provider protection - Prevent overwhelming providers
Burst handling - Allow reasonable bursts
Back-pressure - Surface rate limits to users
Fair sharing - Consider tenant isolation

Configuration Examples¶

Development Environment¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Relaxed timeouts for development
  RPC_TIMEOUT_MUTATING: "10m"

  # Aggressive retries for flaky dev environments  
  RETRY_MAX_ATTEMPTS: "10"
  RETRY_BASE_DELAY: "100ms"

  # Lower circuit breaker threshold
  CB_FAILURE_THRESHOLD: "5"
  CB_RESET_SECONDS: "30s"

Production Environment¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Strict timeouts
  RPC_TIMEOUT_MUTATING: "4m"
  RPC_TIMEOUT_DESCRIBE: "30s"

  # Conservative retries
  RETRY_MAX_ATTEMPTS: "3"
  RETRY_BASE_DELAY: "1s"
  RETRY_MAX_DELAY: "60s"

  # Higher circuit breaker threshold
  CB_FAILURE_THRESHOLD: "15" 
  CB_RESET_SECONDS: "120s"

  # Rate limiting
  RATE_LIMIT_QPS: "20"
  RATE_LIMIT_BURST: "50"