Skip to main content

Agent Installation & Configuration

The GoCloudera agent is a lightweight Python process that collects GPU metrics, instance metadata, cost data, and LLM spend from your infrastructure and sends it to the GoCloudera dashboard.

Requirements

  • Python 3.9+ (or Docker)
  • Network access to https://api.gocloudera.com (HTTPS, port 443)
  • Optional: port 50051 outbound for gRPC streaming
  • Read-only cloud credentials for the providers you want to monitor

Installation Methods

docker run -d \
  --name gocloudera-agent \
  --restart unless-stopped \
  -e BACKEND_API_URL=https://api.gocloudera.com \
  -e BACKEND_API_KEY=your-api-key \
  -e TENANT_ID=your-tenant-id \
  -e AWS_ENABLED=true \
  -e AWS_REGION=us-east-1 \
  gocloudera/gpu-agent:latest
The Docker image is based on Python 3.9-slim, runs as a non-root user, and includes a health check endpoint on port 8000.

Python (Direct)

git clone https://github.com/gocloudera/unified-gpu-agent.git
cd unified-gpu-agent
pip install -r requirements.txt
python main.py --config config.yaml

Systemd Service

sudo ./scripts/configure.sh
sudo systemctl start gpu-agent
sudo systemctl enable gpu-agent

Kubernetes DaemonSet

For monitoring GPU nodes in a Kubernetes cluster:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gocloudera-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: gocloudera-agent
  template:
    metadata:
      labels:
        app: gocloudera-agent
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: agent
        image: gocloudera/gpu-agent:latest
        env:
        - name: BACKEND_API_URL
          value: "https://api.gocloudera.com"
        - name: BACKEND_API_KEY
          valueFrom:
            secretKeyRef:
              name: gocloudera
              key: api-key
        - name: K8S_ENABLED
          value: "true"
        - name: NVML_ENABLED
          value: "true"
        volumeMounts:
        - name: nvidia
          mountPath: /usr/lib/x86_64-linux-gnu
          readOnly: true
      volumes:
      - name: nvidia
        hostPath:
          path: /usr/lib/x86_64-linux-gnu

Configuration Reference

Configuration can be set via YAML file (--config config.yaml), environment variables, or both. Environment variables take precedence.

Core Settings

Environment VariableYAML KeyDefaultDescription
BACKEND_API_URLbackend.api_url(required)Dashboard API endpoint
BACKEND_API_KEYbackend.api_key(required)Your tenant API key
TENANT_IDbackend.tenant_iddefaultTenant identifier
AGENT_IDbackend.agent_idagent-{TENANT_ID}Unique agent ID
LOG_LEVELlogging.levelINFODEBUG, INFO, WARNING, ERROR
MONITORING_INTERVALmonitoring.interval300Seconds between monitoring cycles
COMM_MODEcommunication.modehttphttp, grpc, or both

AWS Settings

Environment VariableYAML KeyDefaultDescription
AWS_ENABLEDclouds.aws.enabledfalseEnable AWS monitoring
AWS_REGIONclouds.aws.regionus-east-1AWS region
AWS_ACCESS_KEY_ID-(IAM role)AWS access key
AWS_SECRET_ACCESS_KEY-(IAM role)AWS secret key
AWS_SERVICESclouds.aws.servicesec2,sagemakerServices to monitor

Azure Settings

Environment VariableYAML KeyDefaultDescription
AZURE_ENABLEDclouds.azure.enabledfalseEnable Azure monitoring
AZURE_SUBSCRIPTION_IDclouds.azure.subscription_id(required)Subscription ID
AZURE_RESOURCE_GROUPclouds.azure.resource_group(all)Specific resource group
AZURE_TENANT_ID-(DefaultAzureCredential)Azure AD tenant
AZURE_CLIENT_ID-(DefaultAzureCredential)Service principal client ID
AZURE_CLIENT_SECRET-(DefaultAzureCredential)Service principal secret

GCP Settings

Environment VariableYAML KeyDefaultDescription
GCP_ENABLEDclouds.gcp.enabledfalseEnable GCP monitoring
GCP_PROJECT_IDclouds.gcp.project_id(required)Project ID
GCP_REGIONclouds.gcp.regionus-central1Default region
GCP_BILLING_DATASETclouds.gcp.billing_dataset(none)BigQuery dataset for real billing
GOOGLE_APPLICATION_CREDENTIALS-(ADC)Path to service account JSON

Kubernetes Settings

Environment VariableYAML KeyDefaultDescription
K8S_ENABLEDclouds.kubernetes.enabledfalseEnable K8s monitoring
KUBECONFIG_PATHclouds.kubernetes.kubeconfig~/.kube/configKubeconfig path
KUBERNETES_NAMESPACEclouds.kubernetes.namespacedefaultNamespace to monitor
K8S_GPU_RESOURCE_NAMEclouds.kubernetes.gpu_resourcenvidia.com/gpuGPU resource name

SageMaker Settings

Environment VariableYAML KeyDefaultDescription
SAGEMAKER_ENABLEDclouds.sagemaker.enabledfalseEnable SageMaker monitoring
SAGEMAKER_SERVICESclouds.sagemaker.servicestraining_jobs,endpointsServices to monitor

Local GPU Monitoring (NVML)

Environment VariableYAML KeyDefaultDescription
NVML_ENABLEDnvml.enabledfalseEnable local GPU sampling
NVML_SAMPLE_INTERVALnvml.sample_interval10Seconds between GPU samples
NVML_IDLE_GPU_THRESHOLDnvml.idle_gpu_threshold10GPU% below this = idle
NVML_IDLE_MEMORY_THRESHOLDnvml.idle_memory_threshold10Memory% below this = idle
NVML_IDLE_DURATIONnvml.idle_duration120Seconds idle to confirm status

gRPC Settings

Environment VariableYAML KeyDefaultDescription
GRPC_TARGETcommunication.grpc.target(none)gRPC server host:port
GRPC_USE_TLScommunication.grpc.use_tlsfalseEnable TLS
GRPC_CA_CERTcommunication.grpc.ca_cert(none)CA certificate path

AI Spend Tracking

Environment VariableYAML KeyDefaultDescription
AI_SPEND_ENABLEDai_spend.enabledfalseEnable LLM cost tracking

Recommendations

Environment VariableYAML KeyDefaultDescription
ENABLE_RECOMMENDATIONSrecommendations.enabledfalseEnable RI/SP/Spot recommendations
RECOMMENDATION_INTERVALrecommendations.interval21600Seconds between recommendation cycles

YAML Configuration Example

backend:
  api_url: https://api.gocloudera.com
  api_key: gc_live_abc123def456
  tenant_id: my-company

logging:
  level: INFO

monitoring:
  interval: 300

communication:
  mode: both
  grpc:
    target: grpc.gocloudera.com:50051
    use_tls: true

clouds:
  aws:
    enabled: true
    region: us-east-1
    services: ec2,sagemaker

  azure:
    enabled: true
    subscription_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

  gcp:
    enabled: false

  kubernetes:
    enabled: true
    namespace: ml-workloads
    gpu_resource: nvidia.com/gpu

nvml:
  enabled: true
  sample_interval: 10
  idle_gpu_threshold: 10
  idle_duration: 120

ai_spend:
  enabled: true

recommendations:
  enabled: true
  interval: 21600

Multi-Cloud Setup

You can monitor multiple clouds simultaneously. Enable each provider and provide credentials:
docker run -d \
  --name gocloudera-agent \
  -e BACKEND_API_URL=https://api.gocloudera.com \
  -e BACKEND_API_KEY=your-key \
  -e AWS_ENABLED=true \
  -e AWS_REGION=us-east-1 \
  -e AZURE_ENABLED=true \
  -e AZURE_SUBSCRIPTION_ID=xxx \
  -e GCP_ENABLED=true \
  -e GCP_PROJECT_ID=my-project \
  -e GOOGLE_APPLICATION_CREDENTIALS=/creds/gcp-key.json \
  -v /path/to/gcp-key.json:/creds/gcp-key.json:ro \
  gocloudera/gpu-agent:latest

Health & Diagnostics

Check agent status:
python main.py --status   # Show current status
python main.py --health   # Show health check
Check Docker logs:
docker logs gocloudera-agent --tail 50
Common issues:
  • “Connection refused” — Check BACKEND_API_URL is reachable. Verify firewall allows outbound HTTPS.
  • “401 Unauthorized” — API key is wrong or expired. Regenerate in Cloud Accounts.
  • “No instances found” — Cloud credentials lack required permissions. Check IAM policies.
  • “NVML not available” — NVIDIA drivers not installed, or Docker container doesn’t have GPU access. Mount the NVIDIA libraries.

Data Collected

The agent collects and sends:
Data TypeFrequencyDescription
InstancesEvery cycle (5 min)Instance ID, type, state, GPU type, region, tags
MetricsEvery cycle (5 min)GPU utilization, memory, temperature, CPU, network I/O
CostsEvery cycle (5 min)Hourly/daily cost from billing APIs or estimation
RecommendationsEvery 6 hoursRI/SP/Spot suggestions with projected savings
AI SpendEvery cycle (5 min)LLM token counts, costs by provider/model
HealthEvery cycle (5 min)Agent uptime, last collection time, error counts
All data is sent over HTTPS (encrypted in transit). The agent stores no data locally — it’s stateless and can be restarted at any time without data loss.