API Reference

Base URL: https://api.gocloudera.com/api All endpoints require authentication via Bearer token (JWT) or API key (X-API-Key header). All responses return JSON.

Authentication

POST /auth/login

Body:

{
  "email": "user@company.com",
  "password": "your-password",
  "tenant_id": "uuid-of-tenant"
}

Response:

{
  "success": true,
  "data": {
    "accessToken": "eyJhbG...",
    "refreshToken": "eyJhbG...",
    "user": { "id": 1, "email": "user@company.com", "role": "admin" }
  }
}

Refresh Tokens

POST /auth/refresh-tokens

Body:

{ "refreshToken": "eyJhbG..." }

List Available Tenants (Public)

GET /auth/tenants

Returns active tenants for the login tenant picker. No authentication required.

GPU Instances

List Instances

GET /instances

Query Parameters:

Param	Type	Description
`state`	string	Filter by state: `running`, `stopped`, `terminated`
`instance_type`	string	Filter by instance type (e.g., `p3.2xlarge`)
`gpu_type`	string	Filter by GPU type (e.g., `V100`, `A100`)

Response:

{
  "success": true,
  "data": [
    {
      "id": 1,
      "instance_id": "i-0abc123def456",
      "cloud_provider": "aws",
      "region": "us-east-1",
      "instance_type": "p3.2xlarge",
      "gpu_type": "V100",
      "gpu_count": 1,
      "state": "running",
      "tags": { "environment": "production", "team": "ml-training" },
      "hourly_cost": 3.06,
      "created_at": "2026-03-01T00:00:00.000Z"
    }
  ],
  "count": 1
}

Get Instance Details

GET /instances/:instanceId

Returns instance with latest metrics.

Get Instance Metrics

GET /instances/:instanceId/metrics

Query Parameters:

Param	Type	Default	Description
`hours`	int	24	Lookback period in hours
`limit`	int	1000	Max data points

Get Idle Instances

GET /instances/status/idle

Returns all instances with GPU utilization below idle threshold.

Start / Stop Instance

POST /instances/:id/start
POST /instances/:id/stop

Creates an entry in the action queue. Returns the action ID for tracking.

Costs

Get Cost Data

GET /costs

Query Parameters:

Param	Type	Default	Description
`days`	int	30	Lookback period
`service`	string	-	Filter by service
`instance_type`	string	-	Filter by instance type
`start_date`	date	-	Start date (YYYY-MM-DD)
`end_date`	date	-	End date (YYYY-MM-DD)

Get Cost Summary

GET /costs/summary

Returns total cost, cost by service, cost by instance type, and daily breakdown.

Get Cost Trends

GET /costs/trends

Returns cost trend data over time for charting.

Get Budget Status

GET /costs/budget-status

Returns current month budget utilization, burn rate, projected spend, and projected overage.

Get Costs by Tag

GET /costs/by-tags

Query Parameters:

Param	Type	Required	Description
`tag_key`	string	Yes	Tag key to group by (e.g., `environment`, `team`)

Response:

{
  "success": true,
  "data": {
    "groups": [
      { "tag_value": "production", "total_cost": 15240.50, "record_count": 450 },
      { "tag_value": "staging", "total_cost": 3200.00, "record_count": 120 }
    ],
    "time_series": [
      { "date": "2026-03-01", "tag_value": "production", "total_cost": 520.00 },
      { "date": "2026-03-01", "tag_value": "staging", "total_cost": 110.00 }
    ]
  }
}

AI Spend (LLM Cost Tracking)

Get AI Spend

GET /ai-spend

Query Parameters:

Param	Type	Description
`spend_type`	string	`inference`, `training`, `fine_tuning`, `embedding`
`provider`	string	`openai`, `anthropic`, `aws_bedrock`, `azure_openai`
`model_name`	string	Filter by model (e.g., `gpt-4`, `claude-3-opus`)
`workload_id`	int	Filter by AI workload
`project_id`	string	Filter by project

Get AI Spend Summary

GET /ai-spend/summary

Returns total spend, spend by provider, by model, and unit economics (cost per token, per training run).

Get Unit Economics

GET /ai-spend/unit-economics

Returns cost per token, cost per inference, cost per training run across providers and models.

Get Spend by Dimension

GET /ai-spend/by-dimension

Query Parameters:

Param	Type	Description
`dimension`	string	`project_id`, `team_id`, `cost_center`, `business_unit`

Get Budget Burn Rate

GET /ai-spend/budget-status

Returns per-workload budget tracking with burn rate and projected overage.

Alerts

List Alerts

GET /alerts

Query Parameters:

Param	Type	Description
`status`	string	`active`, `resolved`, `ignored`
`alert_type`	string	Filter by type
`instance_id`	string	Filter by instance
`limit`	int	Max results (default 50)

Get Alert Summary

GET /alerts/summary

Returns counts by status and by alert type.

Resolve / Ignore Alert

PATCH /alerts/:alertId/resolve
PATCH /alerts/:alertId/ignore

Acknowledge Alert (Stops Escalation)

POST /alerts/:id/acknowledge

Alert Rules

List Rules

GET /alert-rules

Query Parameters:

Param	Type	Description
`metric`	string	Filter by metric type
`enabled`	boolean	Filter by enabled state
`severity`	string	Filter by severity
`page`	int	Page number
`limit`	int	Items per page

Create Rule

POST /alert-rules

Body:

{
  "rule_name": "High GPU Temperature",
  "description": "Alert when GPU temperature exceeds 85C for 10 minutes",
  "metric": "temperature",
  "operator": "gt",
  "threshold": 85,
  "duration_minutes": 10,
  "severity": "high",
  "scope": "tagged",
  "scope_filter": { "tag_key": "environment", "tag_value": "production" },
  "notification_channel_ids": [1, 3],
  "cooldown_minutes": 30
}

Supported Metrics: gpu_utilization, cpu_utilization, memory_utilization, daily_cost, hourly_cost, temperature, error_rate Supported Operators: gt, lt, gte, lte, eq, not_eq Scope Options:

all — monitors all instances
tagged — monitors instances matching a tag filter
specific_instance — monitors specific instance IDs

Get Rule Trigger History

GET /alert-rules/:id/history

Returns when the rule triggered, on which instances, and what actions were taken.

Update / Delete Rule

PUT /alert-rules/:id
DELETE /alert-rules/:id

Enforcement Policies

List Policies

GET /enforcement-policies

Get Policy Templates

GET /enforcement-policies/templates

Returns pre-built policy templates that can be cloned and customized.

Create from Template

POST /enforcement-policies/from-template/:templateId

Body (optional overrides):

{
  "policy_name": "My Custom Idle Policy",
  "conditions": { "operator": "AND", "rules": [...] },
  "execution_mode": "approval_required"
}

Simulate Policy (Dry Run)

POST /enforcement-policies/simulate

Body:

{
  "conditions": {
    "operator": "AND",
    "rules": [
      { "metric": "gpu_utilization", "operator": "lt", "threshold": 10, "duration": 30 },
      { "metric": "daily_cost", "operator": "gt", "threshold": 100 }
    ]
  },
  "lookback_days": 7
}

Response:

{
  "success": true,
  "data": {
    "triggers": [
      { "timestamp": "2026-03-20T14:30:00Z", "instance_id": "i-0abc123", "conditions_met": [...] }
    ],
    "total_triggers": 12,
    "affected_instances": ["i-0abc123", "i-0def456"]
  }
}

Create Policy

POST /enforcement-policies

Body:

{
  "policy_name": "Budget Guard",
  "description": "Scale down when monthly budget is 80% consumed",
  "severity": "high",
  "policy_type": "budget_threshold",
  "execution_mode": "approval_required",
  "cooldown_minutes": 120,
  "schedule": {
    "timezone": "America/New_York",
    "active_hours": { "start": "08:00", "end": "22:00" },
    "active_days": [1, 2, 3, 4, 5]
  },
  "conditions": {
    "operator": "AND",
    "rules": [
      { "metric": "monthly_budget_utilization", "operator": "gte", "threshold": 80 },
      { "metric": "days_remaining_in_month", "operator": "gte", "threshold": 5 }
    ]
  },
  "actions": [
    { "type": "scale_down_instances", "target": "lowest_utilization", "percentage": 30 },
    { "type": "notify_finance_team", "channels": ["slack", "email"] }
  ]
}

Toggle Policy

PATCH /enforcement-policies/:id/toggle

Notification Channels

List Channels

GET /notification-channels

Create Channel

POST /notification-channels

Body examples by type: Slack:

{
  "name": "Engineering Slack",
  "channel_type": "slack",
  "config": { "webhook_url": "https://hooks.slack.com/services/T.../B.../..." },
  "alert_types": ["cost_threshold", "security_incident"],
  "min_priority": "medium",
  "digest_mode": "instant"
}

PagerDuty:

{
  "name": "Ops PagerDuty",
  "channel_type": "pagerduty",
  "config": { "routing_key": "your-integration-key" },
  "alert_types": ["security_incident"],
  "min_priority": "critical"
}

Email (with digest):

{
  "name": "Finance Team Email",
  "channel_type": "email",
  "config": { "recipients": ["finance@company.com", "cfo@company.com"] },
  "alert_types": ["cost_threshold", "budget_exceeded"],
  "min_priority": "low",
  "digest_mode": "batched",
  "digest_interval_minutes": 30
}

Test Channel

POST /notification-channels/:id/test

Sends a test notification to verify the channel is configured correctly.

Maintenance Windows

Create Window

POST /maintenance-windows

Body:

{
  "name": "Saturday Deploy Window",
  "start_time": "2026-03-28T02:00:00Z",
  "end_time": "2026-03-28T04:00:00Z",
  "suppress_alerts": true,
  "suppress_enforcement": true,
  "scope": { "tags": { "environment": "production" } }
}

Get Active Windows

GET /maintenance-windows/active

Exports

Export any data as CSV or PDF.

GET /exports/instances?format=csv&state=running
GET /exports/costs?format=pdf&days=30
GET /exports/ai-spend?format=csv&provider=openai
GET /exports/alerts?format=csv&status=active
GET /exports/metrics?format=csv&instance_id=i-0abc123&hours=168

Data Sync (Agent Endpoint)

Sync Data from Agent

POST /sync

Headers:

X-API-Key: your-tenant-api-key

Body:

{
  "instances": [...],
  "metrics": [...],
  "costs": [...],
  "alerts": [...],
  "ai_spend": [...],
  "errors": [...]
}

The agent also supports gRPC bidirectional streaming on port 50051 for real-time data delivery and instant command execution.

Error Responses

All errors follow this format:

{
  "success": false,
  "error": "Human-readable error message"
}

Common HTTP status codes:

400 — Bad request (missing/invalid parameters)
401 — Unauthorized (invalid or missing token)
403 — Forbidden (insufficient role)
404 — Resource not found
429 — Rate limited
500 — Internal server error

Rate Limits

API rate limits are configurable per tenant (default: 1000 requests/minute). Rate limit headers are included in responses:

X-RateLimit-Limit — Max requests per window
X-RateLimit-Remaining — Remaining requests
X-RateLimit-Reset — Window reset timestamp

Getting Started

Setup & Operations

Reference

​API Reference

​Authentication

​Login

​Refresh Tokens

​List Available Tenants (Public)

​GPU Instances

​List Instances

​Get Instance Details

​Get Instance Metrics

​Get Idle Instances

​Start / Stop Instance

​Costs

​Get Cost Data

​Get Cost Summary

​Get Cost Trends

​Get Budget Status

​Get Costs by Tag

​AI Spend (LLM Cost Tracking)

​Get AI Spend

​Get AI Spend Summary

​Get Unit Economics

​Get Spend by Dimension

​Get Budget Burn Rate

​Alerts

​List Alerts

​Get Alert Summary

​Resolve / Ignore Alert

​Acknowledge Alert (Stops Escalation)

​Alert Rules

​List Rules

​Create Rule

​Get Rule Trigger History

​Update / Delete Rule

​Enforcement Policies

​List Policies

​Get Policy Templates

​Create from Template

​Simulate Policy (Dry Run)

​Create Policy

​Toggle Policy

​Notification Channels

​List Channels

​Create Channel

​Test Channel

​Maintenance Windows

​Create Window

​Get Active Windows

​Exports

​Data Sync (Agent Endpoint)

​Sync Data from Agent

​Error Responses

​Rate Limits

API Reference

Authentication

Login

Refresh Tokens

List Available Tenants (Public)

GPU Instances

List Instances

Get Instance Details

Get Instance Metrics

Get Idle Instances

Start / Stop Instance

Costs

Get Cost Data

Get Cost Summary

Get Cost Trends

Get Budget Status

Get Costs by Tag

AI Spend (LLM Cost Tracking)

Get AI Spend

Get AI Spend Summary

Get Unit Economics

Get Spend by Dimension

Get Budget Burn Rate

Alerts

List Alerts

Get Alert Summary

Resolve / Ignore Alert

Acknowledge Alert (Stops Escalation)

Alert Rules

List Rules

Create Rule

Get Rule Trigger History

Update / Delete Rule

Enforcement Policies

List Policies

Get Policy Templates

Create from Template

Simulate Policy (Dry Run)

Create Policy

Toggle Policy

Notification Channels

List Channels

Create Channel

Test Channel

Maintenance Windows

Create Window

Get Active Windows

Exports

Data Sync (Agent Endpoint)

Sync Data from Agent

Error Responses

Rate Limits