Skip to main content

Architecture & Key Concepts

System Overview

GoCloudera is a three-component system: a lightweight agent deployed in your infrastructure, a multi-tenant backend API, and a dashboard frontend.

Key Concepts

Multi-Tenancy

Every piece of data in GoCloudera is scoped to a tenant. The tenantIsolation middleware automatically filters all database queries by tenant_id. Tenants cannot see each other’s data. Each tenant gets their own API key, notification channels, enforcement policies, alert rules, and customization settings.

The Three-Layer Optimization Engine

GoCloudera uses three complementary engines, orchestrated every 5 minutes: Layer 1 — PolicyEngine (Proactive) Evaluates enforcement policies against current metrics. Supports composite AND/OR conditions with nesting, range operators, schedule-aware enforcement, maintenance window suppression, and cooldown periods. When conditions are met, it executes actions (stop, scale down, resize) or queues them for approval. Layer 2 — AnomalyDetectionService (Contextual) Runs three independent statistical methods on every metric time series: Z-score analysis (> 3 standard deviations), IQR method (outside 1.5x interquartile range), and rate-of-change detection (delta exceeds 3 sigma of historical deltas). An anomaly is confirmed when 2 of 3 methods agree (composite confidence >= 0.5). Uses day-of-week seasonal baselines to avoid false positives on predictable patterns. Layer 3 — RemediationEngine (Reactive) Responds to active alerts and incidents with automated remediation strategies: cost overrun mitigation, performance degradation response, security incident isolation, resource exhaustion cleanup, and network connectivity failover.

Agent Communication

The agent communicates with the backend in two modes: HTTP mode — Agent POSTs data to /api/sync every 5 minutes. Polls /api/actions every 15 seconds for pending commands. Simple, works through firewalls, no persistent connection needed. gRPC mode — Bidirectional streaming on port 50051. Agent pushes data continuously and receives commands instantly (no polling delay). Uses a bounded write queue (max 1000 messages) with exponential backoff on reconnect. Falls back to HTTP automatically if the stream drops. You can run both simultaneously (COMM_MODE=both) for redundancy.

Action Queue & Approval Workflow

When the PolicyEngine, RemediationEngine, or a user triggers an action (stop, start, resize, restart, terminate), it enters the ActionQueue: Actions can be executed two ways:
  1. Platform execution — Backend assumes your cloud IAM role via STS (AWS) or service credentials (Azure/GCP) and executes directly.
  2. Agent execution — Backend pushes the command to the agent via gRPC or HTTP polling; the agent executes in your VPC.

Cost Data Sources

GoCloudera collects real cost data from cloud billing APIs, not estimates:
  • AWS — Cost Explorer API (ce:GetCostAndUsage), Spot Price History, Reserved Instance utilization, Savings Plans coverage
  • Azure — Retail Prices API (https://prices.azure.com/api/retail/prices), Consumption API for actual usage, cost-by-tag grouping via Consumption API
  • GCP — Cloud Billing Catalog API for list prices, BigQuery billing export for actual costs, label-based cost allocation via BigQuery
When billing APIs aren’t configured, the agent falls back to hourly rate estimation based on instance type pricing.

LLM Spend Tracking

The AI Spend module tracks costs from LLM providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI) at the token level. It records input tokens, output tokens, model name, provider, latency, and maps costs to AI workloads (training, inference, fine-tuning, embedding). Unit economics are calculated automatically: cost per 1K tokens, cost per inference request, cost per training run. Budget tracking per workload shows burn rate and projected overage.

Alert Rules vs Enforcement Policies

These are distinct systems that serve different purposes: Alert Rules are monitoring thresholds that notify humans. “Tell me when GPU utilization exceeds 95% for 10 minutes.” They support per-metric thresholds, duration conditions, scope filtering (all instances, by tag, or specific instances), and route to notification channels. Enforcement Policies are automated actions that respond to conditions. “When daily cost exceeds $1000, scale down the lowest-utilized 50% of instances.” They support composite AND/OR logic, budget-aware metrics, schedule constraints, maintenance window suppression, and three execution modes (auto, approval, notify-only). You can use alert rules to monitor and enforcement policies to act, or use enforcement policies in notify-only mode for both.

Anomaly Detection Audit Trail

Every anomaly detection run writes a row to the analysis_audit_log table with: baseline statistics used (mean, stddev, Q1, Q3, IQR), whether the baseline was day-specific, current metric value, each method’s raw score and trigger status, the composite confidence, and whether it was classified as an anomaly. This enables debugging false positives, tuning thresholds, and regulatory audit compliance.

Inference Feedback Loop

When the platform generates a recommendation (resize suggestion, anomaly alert, cost optimization), users can mark it as accepted, rejected, or modified. This feedback is stored in the inference_feedback table with the user’s reason and the outcome metrics (cost/utilization before and after). Over time, this creates a labeled dataset for training ML models to improve recommendation quality.

Data Model

Core entities and their relationships:

Infrastructure

  • Backend: Node.js (Express) on AWS App Runner
  • Database: PostgreSQL on AWS RDS
  • Cache/Events: Redis on AWS ElastiCache
  • Agent: Python, deployed via Docker/systemd/K8s in customer infrastructure
  • CI/CD: GitHub Actions → ECR → App Runner with PostgreSQL service containers for testing
  • Auth: JWT with Cognito-compatible token format, API key auth for agents
  • gRPC: Port 50051 with optional TLS