How
How to Build a Cost Dashboard for AI Inference: Tracking Spending Per Model and Version in Real Time
A single Llama 3 70B inference call on AWS can cost $0.0035 per 1,000 tokens, but if your team deploys 10 model versions across 3 cloud regions, the monthly …
A single Llama 3 70B inference call on AWS can cost $0.0035 per 1,000 tokens, but if your team deploys 10 model versions across 3 cloud regions, the monthly bill easily exceeds $15,000 without a dedicated tracking system. According to a 2024 Gartner report on cloud AI costs, 63% of enterprises admit they cannot accurately attribute inference spending to specific model versions or experiments. This article builds a real-time cost dashboard using vLLM telemetry, Prometheus metrics, and cloud billing APIs, with a focus on the China-based engineer’s dual challenge of optimizing for latency while managing yuan-denominated spend across both domestic and overseas cloud providers.
Why Per-Model Cost Visibility Matters More in 2025
Most teams still rely on aggregate cloud bills that lump compute, storage, and networking into a single line item. A 2023 CNCF survey found that 71% of organizations using Kubernetes for ML workloads had no per-namespace cost breakdown for inference pods. Without granular tracking, a single misconfigured model version—like a 7B param model accidentally pinned to an A100-80GB—can inflate monthly costs by 40-60%.
Per-model cost visibility directly impacts three decisions: which model to promote to production, whether to prune stale versions, and when to switch between on-demand vs. spot instances. For Chinese teams deploying on Alibaba Cloud PAI or Tencent Cloud TI-ONE, the billing granularity differs from AWS—domestic cloud providers charge by GPU-second at the vGPU level, while overseas providers bill by instance-hour. A unified dashboard must normalize both units.
Core Architecture: Telemetry Pipeline for Inference Spend
The dashboard requires three data layers: inference telemetry, cloud billing data, and model version metadata. The telemetry layer captures per-request token counts and GPU utilization via vLLM’s built-in metrics endpoint. vLLM exposes vllm:num_requests_running, vllm:gpu_cache_usage, and per-model request counters at Prometheus-compatible format.
Cloud billing data comes from each provider’s cost API—AWS Cost Explorer, GCP Billing Export to BigQuery, or Alibaba Cloud Billing API. The key challenge is joining these hourly cost files with model version tags. A 2024 internal study by Modal showed that inference cost variance between model versions can reach 3.2x for the same architecture due to batch size and max tokens configuration differences.
H3: Data Ingestion Layer
Use a lightweight Python script running as a Kubernetes CronJob every 5 minutes to scrape vLLM metrics and push them to a Prometheus instance. For each model version, capture vllm:request_success_total and vllm:avg_generation_throughput_toks_per_s. Store these in a time-series database like VictoriaMetrics for long-term retention at 0.3 yuan per GB per month on an Alibaba Cloud TSDB instance.
H3: Cost Attribution Logic
Map each inference request to a cost by multiplying token count by provider-specific per-token price. For AWS SageMaker, the price per 1,000 tokens for Llama 3 70B is $0.0035 for input and $0.0047 for output. For Alibaba Cloud PAI-EAS, the same model runs at ¥0.022 per 1,000 input tokens (approximately $0.003 at current exchange rates). The dashboard must store these pricing tables in a YAML config file version-controlled alongside model deployments.
Building the Dashboard with Grafana and SQL
Grafana connects to Prometheus for real-time metrics and to a PostgreSQL database for cost data. The SQL query for per-model cost looks like:
SELECT model_version,
SUM(total_tokens) * unit_price AS total_cost,
COUNT(DISTINCT request_id) AS request_count
FROM inference_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY model_version
For Chinese teams operating across both domestic and overseas clouds, create separate data sources in Grafana and use a cloud_provider tag to filter. A 2024 Alibaba Cloud whitepaper on MLOps recommended keeping cost dashboards under 10-second query latency by pre-aggregating data into hourly tables.
H3: Real-Time Alerts via Webhook
Set Grafana alert rules to trigger when any model version’s daily cost exceeds 120% of its 7-day moving average. The alert sends to Feishu or DingTalk webhooks. For example, if a fine-tuned Qwen 2.5 32B version suddenly costs ¥3,200 per day instead of its usual ¥2,600, the on-call engineer receives a notification within 60 seconds.
Key Metrics to Track Per Model Version
Not all cost metrics matter equally. Focus on these five KPIs for each model version:
- Cost per 1,000 tokens (yuan or USD) — the atomic unit for comparison
- GPU utilization rate — below 40% means you’re over-provisioning
- Request latency P99 — high latency often correlates with inefficient batching
- Cache hit ratio — vLLM’s prefix caching can reduce cost by 30-50%
- Cost per request — useful for per-user billing in SaaS products
A 2024 benchmark by RunPod showed that models with identical architecture but different quantization methods (FP16 vs INT8) had a 1.8x cost difference at the same throughput level. Tracking these metrics per version prevents teams from accidentally deploying an expensive variant.
H3: Version Tagging Strategy
Use semantic versioning in model registry metadata: qwen2.5-32b-v1.2-fp16. Embed the quantization method, batch size, and max tokens into the version string. This allows the dashboard to filter and group costs without querying additional metadata tables. For teams using MLflow or BentoML, export these tags to the cost dashboard via a nightly ETL job.
Cost Optimization Levers from Dashboard Insights
Once the dashboard shows per-model costs, three optimization actions emerge. First, right-size GPU allocation — if a model version consistently uses under 50% GPU memory, switch to a smaller instance type. On AWS, moving from g5.12xlarge to g5.8xlarge cuts cost by 33% while maintaining throughput for models under 30B parameters.
Second, implement request batching. vLLM’s continuous batching can increase throughput by 2-3x without additional GPU cost. The dashboard should show a “batch efficiency” metric: requests per second divided by theoretical max. A value below 0.4 indicates room for batching optimization.
Third, schedule model version pruning. If a model version hasn’t received requests for 7 days, the dashboard flags it for automatic deletion. A 2024 study by Replicate found that 22% of deployed model versions receive zero production traffic but still incur GPU standby costs. Automating deletion based on dashboard data can save 15-20% of total inference spend.
For cross-border deployment scenarios, some teams use NordVPN 跨境访问 to securely access overseas cloud billing APIs from mainland China, ensuring the cost dashboard can pull data from AWS, GCP, and Azure alongside domestic providers.
Implementation Roadmap for Chinese Cloud Environments
Deploying this dashboard on Alibaba Cloud or Tencent Cloud requires adapting the telemetry pipeline. Alibaba Cloud’s Container Service for Kubernetes (ACK) supports Prometheus Operator natively. Use ack-prometheus-operator to scrape vLLM metrics from inference pods. For billing data, Alibaba Cloud’s Billing API provides cost data with 1-hour granularity but requires RAM role authorization.
Tencent Cloud’s TKE platform offers similar capabilities through its Monitor service. The key difference is that Tencent Cloud charges by vGPU-second rather than GPU-hour. The dashboard must convert vGPU-seconds to yuan using the on-demand pricing table from Tencent Cloud’s pricing page. A 2024 Tencent Cloud documentation update confirmed that vGPU billing granularity supports 1-second increments for inference workloads.
H3: Cost Normalization Table
| Cloud Provider | Billing Unit | Price for A100-80GB (per hour) | Conversion Factor |
|---|---|---|---|
| AWS (us-east-1) | Instance-hour | $3.91 (p3.16xlarge) | 1.0 |
| Alibaba Cloud (Shanghai) | vGPU-second | ¥0.028 per vGPU-second | 1 vGPU = 1/8 A100 |
| Tencent Cloud (Guangzhou) | vGPU-second | ¥0.032 per vGPU-second | 1 vGPU = 1/8 A100 |
For accurate cross-cloud comparison, the dashboard should display both raw local currency and normalized USD equivalent using daily exchange rates from the People’s Bank of China.
FAQ
Q1: How often should the cost dashboard refresh data?
For real-time visibility, scrape vLLM metrics every 30 seconds and update cost data every hour. Cloud billing APIs typically have 1-hour latency, so hourly refresh is sufficient. The Prometheus metrics provide sub-minute granularity for GPU utilization and throughput, which helps detect cost anomalies faster than billing data alone.
Q2: What’s the minimum viable setup for a team of 5 engineers?
Use a single Grafana Cloud instance (free tier covers 10,000 series) connected to a self-hosted VictoriaMetrics on a 2-core, 4GB RAM Alibaba Cloud ECS instance costing ¥120 per month. Add a Python script that runs every hour to pull billing data from Alibaba Cloud Billing API and join it with model version tags from MLflow. Total infrastructure cost: under ¥200 per month.
Q3: Can this dashboard track inference costs for models deployed on edge devices?
Yes, but you need to add an edge telemetry agent that sends metrics back to the central Prometheus instance. For devices running vLLM on NVIDIA Jetson, the same metrics endpoint works. Expect additional data transfer costs of approximately ¥0.8 per GB per month for 100 devices reporting every 5 minutes.
参考资料
- Gartner 2024, “Cloud AI Cost Management Report”
- CNCF 2023, “Cloud Native ML Workloads Survey”
- Alibaba Cloud 2024, “MLOps Best Practices on PAI Platform”
- Tencent Cloud 2024, “TKE GPU Billing Documentation Update”
- Modal 2024, “Inference Cost Variance Across Model Versions Study”