Agent Installation & Configuration

The GoCloudera agent is a lightweight Python process that collects GPU metrics, instance metadata, cost data, and LLM spend from your infrastructure and sends it to the GoCloudera dashboard.

Requirements

Python 3.9+ (or Docker)
Network access to https://api.gocloudera.com (HTTPS, port 443)
Optional: port 50051 outbound for gRPC streaming
Read-only cloud credentials for the providers you want to monitor

Installation Methods

Docker (Recommended)

docker run -d \
  --name gocloudera-agent \
  --restart unless-stopped \
  -e BACKEND_API_URL=https://api.gocloudera.com \
  -e BACKEND_API_KEY=your-api-key \
  -e TENANT_ID=your-tenant-id \
  -e AWS_ENABLED=true \
  -e AWS_REGION=us-east-1 \
  gocloudera/gpu-agent:latest

The Docker image is based on Python 3.9-slim, runs as a non-root user, and includes a health check endpoint on port 8000.

Python (Direct)

git clone https://github.com/gocloudera/unified-gpu-agent.git
cd unified-gpu-agent
pip install -r requirements.txt
python main.py --config config.yaml

Systemd Service

sudo ./scripts/configure.sh
sudo systemctl start gpu-agent
sudo systemctl enable gpu-agent

Kubernetes DaemonSet

For monitoring GPU nodes in a Kubernetes cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gocloudera-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: gocloudera-agent
  template:
    metadata:
      labels:
        app: gocloudera-agent
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: agent
        image: gocloudera/gpu-agent:latest
        env:
        - name: BACKEND_API_URL
          value: "https://api.gocloudera.com"
        - name: BACKEND_API_KEY
          valueFrom:
            secretKeyRef:
              name: gocloudera
              key: api-key
        - name: K8S_ENABLED
          value: "true"
        - name: NVML_ENABLED
          value: "true"
        volumeMounts:
        - name: nvidia
          mountPath: /usr/lib/x86_64-linux-gnu
          readOnly: true
      volumes:
      - name: nvidia
        hostPath:
          path: /usr/lib/x86_64-linux-gnu

Configuration Reference

Configuration can be set via YAML file (--config config.yaml), environment variables, or both. Environment variables take precedence.

Core Settings

Environment Variable	YAML Key	Default	Description
`BACKEND_API_URL`	`backend.api_url`	(required)	Dashboard API endpoint
`BACKEND_API_KEY`	`backend.api_key`	(required)	Your tenant API key
`TENANT_ID`	`backend.tenant_id`	`default`	Tenant identifier
`AGENT_ID`	`backend.agent_id`	`agent-{TENANT_ID}`	Unique agent ID
`LOG_LEVEL`	`logging.level`	`INFO`	`DEBUG`, `INFO`, `WARNING`, `ERROR`
`MONITORING_INTERVAL`	`monitoring.interval`	`300`	Seconds between monitoring cycles
`COMM_MODE`	`communication.mode`	`http`	`http`, `grpc`, or `both`

AWS Settings

Environment Variable	YAML Key	Default	Description
`AWS_ENABLED`	`clouds.aws.enabled`	`false`	Enable AWS monitoring
`AWS_REGION`	`clouds.aws.region`	`us-east-1`	AWS region
`AWS_ACCESS_KEY_ID`	-	(IAM role)	AWS access key
`AWS_SECRET_ACCESS_KEY`	-	(IAM role)	AWS secret key
`AWS_SERVICES`	`clouds.aws.services`	`ec2,sagemaker`	Services to monitor

Azure Settings

Environment Variable	YAML Key	Default	Description
`AZURE_ENABLED`	`clouds.azure.enabled`	`false`	Enable Azure monitoring
`AZURE_SUBSCRIPTION_ID`	`clouds.azure.subscription_id`	(required)	Subscription ID
`AZURE_RESOURCE_GROUP`	`clouds.azure.resource_group`	(all)	Specific resource group
`AZURE_TENANT_ID`	-	(DefaultAzureCredential)	Azure AD tenant
`AZURE_CLIENT_ID`	-	(DefaultAzureCredential)	Service principal client ID
`AZURE_CLIENT_SECRET`	-	(DefaultAzureCredential)	Service principal secret

GCP Settings

Environment Variable	YAML Key	Default	Description
`GCP_ENABLED`	`clouds.gcp.enabled`	`false`	Enable GCP monitoring
`GCP_PROJECT_ID`	`clouds.gcp.project_id`	(required)	Project ID
`GCP_REGION`	`clouds.gcp.region`	`us-central1`	Default region
`GCP_BILLING_DATASET`	`clouds.gcp.billing_dataset`	(none)	BigQuery dataset for real billing
`GOOGLE_APPLICATION_CREDENTIALS`	-	(ADC)	Path to service account JSON

Kubernetes Settings

Environment Variable	YAML Key	Default	Description
`K8S_ENABLED`	`clouds.kubernetes.enabled`	`false`	Enable K8s monitoring
`KUBECONFIG_PATH`	`clouds.kubernetes.kubeconfig`	`~/.kube/config`	Kubeconfig path
`KUBERNETES_NAMESPACE`	`clouds.kubernetes.namespace`	`default`	Namespace to monitor
`K8S_GPU_RESOURCE_NAME`	`clouds.kubernetes.gpu_resource`	`nvidia.com/gpu`	GPU resource name

SageMaker Settings

Environment Variable	YAML Key	Default	Description
`SAGEMAKER_ENABLED`	`clouds.sagemaker.enabled`	`false`	Enable SageMaker monitoring
`SAGEMAKER_SERVICES`	`clouds.sagemaker.services`	`training_jobs,endpoints`	Services to monitor

Local GPU Monitoring (NVML)

Environment Variable	YAML Key	Default	Description
`NVML_ENABLED`	`nvml.enabled`	`false`	Enable local GPU sampling
`NVML_SAMPLE_INTERVAL`	`nvml.sample_interval`	`10`	Seconds between GPU samples
`NVML_IDLE_GPU_THRESHOLD`	`nvml.idle_gpu_threshold`	`10`	GPU% below this = idle
`NVML_IDLE_MEMORY_THRESHOLD`	`nvml.idle_memory_threshold`	`10`	Memory% below this = idle
`NVML_IDLE_DURATION`	`nvml.idle_duration`	`120`	Seconds idle to confirm status

gRPC Settings

Environment Variable	YAML Key	Default	Description
`GRPC_TARGET`	`communication.grpc.target`	(none)	gRPC server `host:port`
`GRPC_USE_TLS`	`communication.grpc.use_tls`	`false`	Enable TLS
`GRPC_CA_CERT`	`communication.grpc.ca_cert`	(none)	CA certificate path

AI Spend Tracking

Environment Variable	YAML Key	Default	Description
`AI_SPEND_ENABLED`	`ai_spend.enabled`	`false`	Enable LLM cost tracking

Recommendations

Environment Variable	YAML Key	Default	Description
`ENABLE_RECOMMENDATIONS`	`recommendations.enabled`	`false`	Enable RI/SP/Spot recommendations
`RECOMMENDATION_INTERVAL`	`recommendations.interval`	`21600`	Seconds between recommendation cycles

YAML Configuration Example

backend:
  api_url: https://api.gocloudera.com
  api_key: gc_live_abc123def456
  tenant_id: my-company

logging:
  level: INFO

monitoring:
  interval: 300

communication:
  mode: both
  grpc:
    target: grpc.gocloudera.com:50051
    use_tls: true

clouds:
  aws:
    enabled: true
    region: us-east-1
    services: ec2,sagemaker

  azure:
    enabled: true
    subscription_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

  gcp:
    enabled: false

  kubernetes:
    enabled: true
    namespace: ml-workloads
    gpu_resource: nvidia.com/gpu

nvml:
  enabled: true
  sample_interval: 10
  idle_gpu_threshold: 10
  idle_duration: 120

ai_spend:
  enabled: true

recommendations:
  enabled: true
  interval: 21600

Multi-Cloud Setup

You can monitor multiple clouds simultaneously. Enable each provider and provide credentials:

docker run -d \
  --name gocloudera-agent \
  -e BACKEND_API_URL=https://api.gocloudera.com \
  -e BACKEND_API_KEY=your-key \
  -e AWS_ENABLED=true \
  -e AWS_REGION=us-east-1 \
  -e AZURE_ENABLED=true \
  -e AZURE_SUBSCRIPTION_ID=xxx \
  -e GCP_ENABLED=true \
  -e GCP_PROJECT_ID=my-project \
  -e GOOGLE_APPLICATION_CREDENTIALS=/creds/gcp-key.json \
  -v /path/to/gcp-key.json:/creds/gcp-key.json:ro \
  gocloudera/gpu-agent:latest

Health & Diagnostics

Check agent status:

python main.py --status   # Show current status
python main.py --health   # Show health check

Check Docker logs:

docker logs gocloudera-agent --tail 50

Common issues:

“Connection refused” — Check BACKEND_API_URL is reachable. Verify firewall allows outbound HTTPS.
“401 Unauthorized” — API key is wrong or expired. Regenerate in Cloud Accounts.
“No instances found” — Cloud credentials lack required permissions. Check IAM policies.
“NVML not available” — NVIDIA drivers not installed, or Docker container doesn’t have GPU access. Mount the NVIDIA libraries.

Data Collected

The agent collects and sends:

Data Type	Frequency	Description
Instances	Every cycle (5 min)	Instance ID, type, state, GPU type, region, tags
Metrics	Every cycle (5 min)	GPU utilization, memory, temperature, CPU, network I/O
Costs	Every cycle (5 min)	Hourly/daily cost from billing APIs or estimation
Recommendations	Every 6 hours	RI/SP/Spot suggestions with projected savings
AI Spend	Every cycle (5 min)	LLM token counts, costs by provider/model
Health	Every cycle (5 min)	Agent uptime, last collection time, error counts

All data is sent over HTTPS (encrypted in transit). The agent stores no data locally — it’s stateless and can be restarted at any time without data loss.

Getting Started

Setup & Operations

Reference

Agent Installation

Agent Installation & Configuration

Requirements

Installation Methods

Docker (Recommended)

Python (Direct)

Systemd Service

Kubernetes DaemonSet

Configuration Reference

Core Settings

AWS Settings

Azure Settings

GCP Settings

Kubernetes Settings

SageMaker Settings

Local GPU Monitoring (NVML)

gRPC Settings

AI Spend Tracking

Recommendations

YAML Configuration Example

Multi-Cloud Setup

Health & Diagnostics

Data Collected

Getting Started

Setup & Operations

Reference

​Agent Installation & Configuration

​Requirements

​Installation Methods

​Docker (Recommended)

​Python (Direct)

​Systemd Service

​Kubernetes DaemonSet

​Configuration Reference

​Core Settings

​AWS Settings

​Azure Settings

​GCP Settings

​Kubernetes Settings

​SageMaker Settings

​Local GPU Monitoring (NVML)

​gRPC Settings

​AI Spend Tracking

​Recommendations

​YAML Configuration Example

​Multi-Cloud Setup

​Health & Diagnostics

​Data Collected

Agent Installation & Configuration

Requirements

Installation Methods

Docker (Recommended)

Python (Direct)

Systemd Service

Kubernetes DaemonSet

Configuration Reference

Core Settings

AWS Settings

Azure Settings

GCP Settings

Kubernetes Settings

SageMaker Settings

Local GPU Monitoring (NVML)

gRPC Settings

AI Spend Tracking

Recommendations

YAML Configuration Example

Multi-Cloud Setup

Health & Diagnostics

Data Collected