Agent Installation & Configuration
The GoCloudera agent is a lightweight Python process that collects GPU metrics, instance metadata, cost data, and LLM spend from your infrastructure and sends it to the GoCloudera dashboard.Requirements
- Python 3.9+ (or Docker)
- Network access to
https://api.gocloudera.com(HTTPS, port 443) - Optional: port 50051 outbound for gRPC streaming
- Read-only cloud credentials for the providers you want to monitor
Installation Methods
Docker (Recommended)
Python (Direct)
Systemd Service
Kubernetes DaemonSet
For monitoring GPU nodes in a Kubernetes cluster:Configuration Reference
Configuration can be set via YAML file (--config config.yaml), environment variables, or both. Environment variables take precedence.
Core Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
BACKEND_API_URL | backend.api_url | (required) | Dashboard API endpoint |
BACKEND_API_KEY | backend.api_key | (required) | Your tenant API key |
TENANT_ID | backend.tenant_id | default | Tenant identifier |
AGENT_ID | backend.agent_id | agent-{TENANT_ID} | Unique agent ID |
LOG_LEVEL | logging.level | INFO | DEBUG, INFO, WARNING, ERROR |
MONITORING_INTERVAL | monitoring.interval | 300 | Seconds between monitoring cycles |
COMM_MODE | communication.mode | http | http, grpc, or both |
AWS Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
AWS_ENABLED | clouds.aws.enabled | false | Enable AWS monitoring |
AWS_REGION | clouds.aws.region | us-east-1 | AWS region |
AWS_ACCESS_KEY_ID | - | (IAM role) | AWS access key |
AWS_SECRET_ACCESS_KEY | - | (IAM role) | AWS secret key |
AWS_SERVICES | clouds.aws.services | ec2,sagemaker | Services to monitor |
Azure Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
AZURE_ENABLED | clouds.azure.enabled | false | Enable Azure monitoring |
AZURE_SUBSCRIPTION_ID | clouds.azure.subscription_id | (required) | Subscription ID |
AZURE_RESOURCE_GROUP | clouds.azure.resource_group | (all) | Specific resource group |
AZURE_TENANT_ID | - | (DefaultAzureCredential) | Azure AD tenant |
AZURE_CLIENT_ID | - | (DefaultAzureCredential) | Service principal client ID |
AZURE_CLIENT_SECRET | - | (DefaultAzureCredential) | Service principal secret |
GCP Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
GCP_ENABLED | clouds.gcp.enabled | false | Enable GCP monitoring |
GCP_PROJECT_ID | clouds.gcp.project_id | (required) | Project ID |
GCP_REGION | clouds.gcp.region | us-central1 | Default region |
GCP_BILLING_DATASET | clouds.gcp.billing_dataset | (none) | BigQuery dataset for real billing |
GOOGLE_APPLICATION_CREDENTIALS | - | (ADC) | Path to service account JSON |
Kubernetes Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
K8S_ENABLED | clouds.kubernetes.enabled | false | Enable K8s monitoring |
KUBECONFIG_PATH | clouds.kubernetes.kubeconfig | ~/.kube/config | Kubeconfig path |
KUBERNETES_NAMESPACE | clouds.kubernetes.namespace | default | Namespace to monitor |
K8S_GPU_RESOURCE_NAME | clouds.kubernetes.gpu_resource | nvidia.com/gpu | GPU resource name |
SageMaker Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
SAGEMAKER_ENABLED | clouds.sagemaker.enabled | false | Enable SageMaker monitoring |
SAGEMAKER_SERVICES | clouds.sagemaker.services | training_jobs,endpoints | Services to monitor |
Local GPU Monitoring (NVML)
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
NVML_ENABLED | nvml.enabled | false | Enable local GPU sampling |
NVML_SAMPLE_INTERVAL | nvml.sample_interval | 10 | Seconds between GPU samples |
NVML_IDLE_GPU_THRESHOLD | nvml.idle_gpu_threshold | 10 | GPU% below this = idle |
NVML_IDLE_MEMORY_THRESHOLD | nvml.idle_memory_threshold | 10 | Memory% below this = idle |
NVML_IDLE_DURATION | nvml.idle_duration | 120 | Seconds idle to confirm status |
gRPC Settings
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
GRPC_TARGET | communication.grpc.target | (none) | gRPC server host:port |
GRPC_USE_TLS | communication.grpc.use_tls | false | Enable TLS |
GRPC_CA_CERT | communication.grpc.ca_cert | (none) | CA certificate path |
AI Spend Tracking
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
AI_SPEND_ENABLED | ai_spend.enabled | false | Enable LLM cost tracking |
Recommendations
| Environment Variable | YAML Key | Default | Description |
|---|---|---|---|
ENABLE_RECOMMENDATIONS | recommendations.enabled | false | Enable RI/SP/Spot recommendations |
RECOMMENDATION_INTERVAL | recommendations.interval | 21600 | Seconds between recommendation cycles |
YAML Configuration Example
Multi-Cloud Setup
You can monitor multiple clouds simultaneously. Enable each provider and provide credentials:Health & Diagnostics
Check agent status:- “Connection refused” — Check BACKEND_API_URL is reachable. Verify firewall allows outbound HTTPS.
- “401 Unauthorized” — API key is wrong or expired. Regenerate in Cloud Accounts.
- “No instances found” — Cloud credentials lack required permissions. Check IAM policies.
- “NVML not available” — NVIDIA drivers not installed, or Docker container doesn’t have GPU access. Mount the NVIDIA libraries.
Data Collected
The agent collects and sends:| Data Type | Frequency | Description |
|---|---|---|
| Instances | Every cycle (5 min) | Instance ID, type, state, GPU type, region, tags |
| Metrics | Every cycle (5 min) | GPU utilization, memory, temperature, CPU, network I/O |
| Costs | Every cycle (5 min) | Hourly/daily cost from billing APIs or estimation |
| Recommendations | Every 6 hours | RI/SP/Spot suggestions with projected savings |
| AI Spend | Every cycle (5 min) | LLM token counts, costs by provider/model |
| Health | Every cycle (5 min) | Agent uptime, last collection time, error counts |