Skip to main content

Getting Started with GoCloudera

GoCloudera is an AI infrastructure cost intelligence platform that monitors GPU and LLM spend across AWS, Azure, GCP, Kubernetes, and SageMaker. This guide walks you through setup from zero to cost visibility in under 30 minutes.

Prerequisites

  • A GoCloudera account (sign up at gocloudera.com)
  • GPU instances running on at least one supported cloud provider
  • Python 3.9+ (for the agent) or Docker

Step 1: Log In and Get Your API Key

  1. Log in to the GoCloudera dashboard at https://app.gocloudera.com
  2. Navigate to Cloud Accounts in the left sidebar (admin section)
  3. Click Platform Info to see your tenant’s API key
  4. Copy your API Key and Tenant ID — you’ll need these to configure the agent

Step 2: Install the GPU Agent

The GoCloudera agent runs in your infrastructure and collects GPU metrics, instance data, and cost information. It sends data to the dashboard via HTTPS or gRPC.
docker run -d \
  --name gocloudera-agent \
  --restart unless-stopped \
  -e BACKEND_API_URL=https://api.gocloudera.com \
  -e BACKEND_API_KEY=your-api-key-here \
  -e TENANT_ID=your-tenant-id \
  -e AWS_ENABLED=true \
  -e AWS_REGION=us-east-1 \
  gocloudera/gpu-agent:latest

Option B: Python

git clone https://github.com/gocloudera/unified-gpu-agent.git
cd unified-gpu-agent
pip install -r requirements.txt

# Create config
cat > config.yaml << EOF
backend:
  api_url: https://api.gocloudera.com
  api_key: your-api-key-here
  tenant_id: your-tenant-id

monitoring:
  interval: 300  # 5 minutes

clouds:
  aws:
    enabled: true
    region: us-east-1
EOF

python main.py --config config.yaml

Option C: Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gocloudera-agent
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gocloudera-agent
  template:
    metadata:
      labels:
        app: gocloudera-agent
    spec:
      serviceAccountName: gocloudera-agent
      containers:
      - name: agent
        image: gocloudera/gpu-agent:latest
        env:
        - name: BACKEND_API_URL
          value: "https://api.gocloudera.com"
        - name: BACKEND_API_KEY
          valueFrom:
            secretKeyRef:
              name: gocloudera-credentials
              key: api-key
        - name: TENANT_ID
          value: "your-tenant-id"
        - name: K8S_ENABLED
          value: "true"
        - name: K8S_GPU_RESOURCE_NAME
          value: "nvidia.com/gpu"
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi

Step 3: Configure Cloud Access

The agent needs read-only access to your cloud accounts to collect metrics and cost data.

AWS

Create an IAM role with these managed policies:
  • AmazonEC2ReadOnlyAccess
  • AWSCostExplorerReadOnlyAccess (for cost data)
  • AmazonSageMakerReadOnly (if using SageMaker)
Or use this minimal custom policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeSpotPriceHistory",
        "ce:GetCostAndUsage",
        "ce:GetReservationUtilization",
        "ce:GetSavingsPlansCoverage",
        "pricing:GetProducts"
      ],
      "Resource": "*"
    }
  ]
}
Set your credentials via environment variables or IAM role (if running on EC2/ECS):
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...

Azure

Create a service principal with Reader and Cost Management Reader roles:
az ad sp create-for-rbac \
  --name "gocloudera-agent" \
  --role Reader \
  --scopes /subscriptions/YOUR_SUBSCRIPTION_ID

# Add cost management access
az role assignment create \
  --assignee APP_ID \
  --role "Cost Management Reader" \
  --scope /subscriptions/YOUR_SUBSCRIPTION_ID
Set environment variables:
export AZURE_ENABLED=true
export AZURE_SUBSCRIPTION_ID=your-subscription-id
export AZURE_TENANT_ID=your-azure-tenant-id
export AZURE_CLIENT_ID=your-client-id
export AZURE_CLIENT_SECRET=your-client-secret

Google Cloud

Create a service account with Compute Viewer and BigQuery Data Viewer (for billing export) roles:
gcloud iam service-accounts create gocloudera-agent \
  --display-name="GoCloudera Agent"

gcloud projects add-iam-policy-binding YOUR_PROJECT \
  --member="serviceAccount:gocloudera-agent@YOUR_PROJECT.iam.gserviceaccount.com" \
  --role="roles/compute.viewer"

gcloud iam service-accounts keys create key.json \
  --iam-account=gocloudera-agent@YOUR_PROJECT.iam.gserviceaccount.com
Set environment variables:
export GCP_ENABLED=true
export GCP_PROJECT_ID=your-project-id
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

Step 4: Verify Data Is Flowing

Once the agent is running, data should appear in the dashboard within 5 minutes (one monitoring cycle).
  1. Open the GoCloudera dashboard
  2. Check the Dashboard page — you should see GPU instance counts and utilization charts
  3. Check GPU Instances — your instances should appear with state, type, and utilization
  4. Check Costs — cost data appears after the first cost collection cycle
If no data appears after 10 minutes:
  • Check agent logs: docker logs gocloudera-agent or check ./logs/unified_agent.log
  • Verify your API key is correct
  • Verify cloud credentials have the required permissions
  • Check the agent health: python main.py --health

Step 5: Set Up Alerts

  1. Navigate to Alert Rules in the sidebar
  2. Click Create Rule
  3. Configure your first alert:
    • Metric: GPU Utilization
    • Operator: Less than
    • Threshold: 10%
    • Duration: 30 minutes
    • Severity: Medium
  4. Set up a notification channel in Customization → Notifications:
    • Slack webhook URL for real-time alerts
    • Email for daily digests
    • PagerDuty for critical alerts

Step 6: Configure Enforcement Policies

  1. Navigate to Enforcement in the sidebar
  2. Use a template to get started:
    • “Idle GPU Auto-Stop” — automatically stops instances idle for 15+ minutes
    • “Weekend Cost Saver” — scales down 75% on Sat/Sun
    • “Dev/Test Auto-Shutdown” — stops dev instances at 7pm
  3. Set the execution mode:
    • Notify Only — sends alerts but takes no action (start here)
    • Approval Required — queues actions for admin approval
    • Auto — executes immediately (use once you trust the rules)

What’s Next