Skip to main content

API Reference

Base URL: https://api.gocloudera.com/api All endpoints require authentication via Bearer token (JWT) or API key (X-API-Key header). All responses return JSON.

Authentication

Login

POST /auth/login
Body:
{
  "email": "user@company.com",
  "password": "your-password",
  "tenant_id": "uuid-of-tenant"
}
Response:
{
  "success": true,
  "data": {
    "accessToken": "eyJhbG...",
    "refreshToken": "eyJhbG...",
    "user": { "id": 1, "email": "user@company.com", "role": "admin" }
  }
}

Refresh Tokens

POST /auth/refresh-tokens
Body:
{ "refreshToken": "eyJhbG..." }

List Available Tenants (Public)

GET /auth/tenants
Returns active tenants for the login tenant picker. No authentication required.

GPU Instances

List Instances

GET /instances
Query Parameters:
ParamTypeDescription
statestringFilter by state: running, stopped, terminated
instance_typestringFilter by instance type (e.g., p3.2xlarge)
gpu_typestringFilter by GPU type (e.g., V100, A100)
Response:
{
  "success": true,
  "data": [
    {
      "id": 1,
      "instance_id": "i-0abc123def456",
      "cloud_provider": "aws",
      "region": "us-east-1",
      "instance_type": "p3.2xlarge",
      "gpu_type": "V100",
      "gpu_count": 1,
      "state": "running",
      "tags": { "environment": "production", "team": "ml-training" },
      "hourly_cost": 3.06,
      "created_at": "2026-03-01T00:00:00.000Z"
    }
  ],
  "count": 1
}

Get Instance Details

GET /instances/:instanceId
Returns instance with latest metrics.

Get Instance Metrics

GET /instances/:instanceId/metrics
Query Parameters:
ParamTypeDefaultDescription
hoursint24Lookback period in hours
limitint1000Max data points

Get Idle Instances

GET /instances/status/idle
Returns all instances with GPU utilization below idle threshold.

Start / Stop Instance

POST /instances/:id/start
POST /instances/:id/stop
Creates an entry in the action queue. Returns the action ID for tracking.

Costs

Get Cost Data

GET /costs
Query Parameters:
ParamTypeDefaultDescription
daysint30Lookback period
servicestring-Filter by service
instance_typestring-Filter by instance type
start_datedate-Start date (YYYY-MM-DD)
end_datedate-End date (YYYY-MM-DD)

Get Cost Summary

GET /costs/summary
Returns total cost, cost by service, cost by instance type, and daily breakdown.
GET /costs/trends
Returns cost trend data over time for charting.

Get Budget Status

GET /costs/budget-status
Returns current month budget utilization, burn rate, projected spend, and projected overage.

Get Costs by Tag

GET /costs/by-tags
Query Parameters:
ParamTypeRequiredDescription
tag_keystringYesTag key to group by (e.g., environment, team)
Response:
{
  "success": true,
  "data": {
    "groups": [
      { "tag_value": "production", "total_cost": 15240.50, "record_count": 450 },
      { "tag_value": "staging", "total_cost": 3200.00, "record_count": 120 }
    ],
    "time_series": [
      { "date": "2026-03-01", "tag_value": "production", "total_cost": 520.00 },
      { "date": "2026-03-01", "tag_value": "staging", "total_cost": 110.00 }
    ]
  }
}

AI Spend (LLM Cost Tracking)

Get AI Spend

GET /ai-spend
Query Parameters:
ParamTypeDescription
spend_typestringinference, training, fine_tuning, embedding
providerstringopenai, anthropic, aws_bedrock, azure_openai
model_namestringFilter by model (e.g., gpt-4, claude-3-opus)
workload_idintFilter by AI workload
project_idstringFilter by project

Get AI Spend Summary

GET /ai-spend/summary
Returns total spend, spend by provider, by model, and unit economics (cost per token, per training run).

Get Unit Economics

GET /ai-spend/unit-economics
Returns cost per token, cost per inference, cost per training run across providers and models.

Get Spend by Dimension

GET /ai-spend/by-dimension
Query Parameters:
ParamTypeDescription
dimensionstringproject_id, team_id, cost_center, business_unit

Get Budget Burn Rate

GET /ai-spend/budget-status
Returns per-workload budget tracking with burn rate and projected overage.

Alerts

List Alerts

GET /alerts
Query Parameters:
ParamTypeDescription
statusstringactive, resolved, ignored
alert_typestringFilter by type
instance_idstringFilter by instance
limitintMax results (default 50)

Get Alert Summary

GET /alerts/summary
Returns counts by status and by alert type.

Resolve / Ignore Alert

PATCH /alerts/:alertId/resolve
PATCH /alerts/:alertId/ignore

Acknowledge Alert (Stops Escalation)

POST /alerts/:id/acknowledge

Alert Rules

List Rules

GET /alert-rules
Query Parameters:
ParamTypeDescription
metricstringFilter by metric type
enabledbooleanFilter by enabled state
severitystringFilter by severity
pageintPage number
limitintItems per page

Create Rule

POST /alert-rules
Body:
{
  "rule_name": "High GPU Temperature",
  "description": "Alert when GPU temperature exceeds 85C for 10 minutes",
  "metric": "temperature",
  "operator": "gt",
  "threshold": 85,
  "duration_minutes": 10,
  "severity": "high",
  "scope": "tagged",
  "scope_filter": { "tag_key": "environment", "tag_value": "production" },
  "notification_channel_ids": [1, 3],
  "cooldown_minutes": 30
}
Supported Metrics: gpu_utilization, cpu_utilization, memory_utilization, daily_cost, hourly_cost, temperature, error_rate Supported Operators: gt, lt, gte, lte, eq, not_eq Scope Options:
  • all — monitors all instances
  • tagged — monitors instances matching a tag filter
  • specific_instance — monitors specific instance IDs

Get Rule Trigger History

GET /alert-rules/:id/history
Returns when the rule triggered, on which instances, and what actions were taken.

Update / Delete Rule

PUT /alert-rules/:id
DELETE /alert-rules/:id

Enforcement Policies

List Policies

GET /enforcement-policies

Get Policy Templates

GET /enforcement-policies/templates
Returns pre-built policy templates that can be cloned and customized.

Create from Template

POST /enforcement-policies/from-template/:templateId
Body (optional overrides):
{
  "policy_name": "My Custom Idle Policy",
  "conditions": { "operator": "AND", "rules": [...] },
  "execution_mode": "approval_required"
}

Simulate Policy (Dry Run)

POST /enforcement-policies/simulate
Body:
{
  "conditions": {
    "operator": "AND",
    "rules": [
      { "metric": "gpu_utilization", "operator": "lt", "threshold": 10, "duration": 30 },
      { "metric": "daily_cost", "operator": "gt", "threshold": 100 }
    ]
  },
  "lookback_days": 7
}
Response:
{
  "success": true,
  "data": {
    "triggers": [
      { "timestamp": "2026-03-20T14:30:00Z", "instance_id": "i-0abc123", "conditions_met": [...] }
    ],
    "total_triggers": 12,
    "affected_instances": ["i-0abc123", "i-0def456"]
  }
}

Create Policy

POST /enforcement-policies
Body:
{
  "policy_name": "Budget Guard",
  "description": "Scale down when monthly budget is 80% consumed",
  "severity": "high",
  "policy_type": "budget_threshold",
  "execution_mode": "approval_required",
  "cooldown_minutes": 120,
  "schedule": {
    "timezone": "America/New_York",
    "active_hours": { "start": "08:00", "end": "22:00" },
    "active_days": [1, 2, 3, 4, 5]
  },
  "conditions": {
    "operator": "AND",
    "rules": [
      { "metric": "monthly_budget_utilization", "operator": "gte", "threshold": 80 },
      { "metric": "days_remaining_in_month", "operator": "gte", "threshold": 5 }
    ]
  },
  "actions": [
    { "type": "scale_down_instances", "target": "lowest_utilization", "percentage": 30 },
    { "type": "notify_finance_team", "channels": ["slack", "email"] }
  ]
}

Toggle Policy

PATCH /enforcement-policies/:id/toggle

Notification Channels

List Channels

GET /notification-channels

Create Channel

POST /notification-channels
Body examples by type: Slack:
{
  "name": "Engineering Slack",
  "channel_type": "slack",
  "config": { "webhook_url": "https://hooks.slack.com/services/T.../B.../..." },
  "alert_types": ["cost_threshold", "security_incident"],
  "min_priority": "medium",
  "digest_mode": "instant"
}
PagerDuty:
{
  "name": "Ops PagerDuty",
  "channel_type": "pagerduty",
  "config": { "routing_key": "your-integration-key" },
  "alert_types": ["security_incident"],
  "min_priority": "critical"
}
Email (with digest):
{
  "name": "Finance Team Email",
  "channel_type": "email",
  "config": { "recipients": ["finance@company.com", "cfo@company.com"] },
  "alert_types": ["cost_threshold", "budget_exceeded"],
  "min_priority": "low",
  "digest_mode": "batched",
  "digest_interval_minutes": 30
}

Test Channel

POST /notification-channels/:id/test
Sends a test notification to verify the channel is configured correctly.

Maintenance Windows

Create Window

POST /maintenance-windows
Body:
{
  "name": "Saturday Deploy Window",
  "start_time": "2026-03-28T02:00:00Z",
  "end_time": "2026-03-28T04:00:00Z",
  "suppress_alerts": true,
  "suppress_enforcement": true,
  "scope": { "tags": { "environment": "production" } }
}

Get Active Windows

GET /maintenance-windows/active

Exports

Export any data as CSV or PDF.
GET /exports/instances?format=csv&state=running
GET /exports/costs?format=pdf&days=30
GET /exports/ai-spend?format=csv&provider=openai
GET /exports/alerts?format=csv&status=active
GET /exports/metrics?format=csv&instance_id=i-0abc123&hours=168

Data Sync (Agent Endpoint)

Sync Data from Agent

POST /sync
Headers:
X-API-Key: your-tenant-api-key
Body:
{
  "instances": [...],
  "metrics": [...],
  "costs": [...],
  "alerts": [...],
  "ai_spend": [...],
  "errors": [...]
}
The agent also supports gRPC bidirectional streaming on port 50051 for real-time data delivery and instant command execution.

Error Responses

All errors follow this format:
{
  "success": false,
  "error": "Human-readable error message"
}
Common HTTP status codes:
  • 400 — Bad request (missing/invalid parameters)
  • 401 — Unauthorized (invalid or missing token)
  • 403 — Forbidden (insufficient role)
  • 404 — Resource not found
  • 429 — Rate limited
  • 500 — Internal server error

Rate Limits

API rate limits are configurable per tenant (default: 1000 requests/minute). Rate limit headers are included in responses:
  • X-RateLimit-Limit — Max requests per window
  • X-RateLimit-Remaining — Remaining requests
  • X-RateLimit-Reset — Window reset timestamp