Backup

Backup and Disaster Recovery for Self-Hosted Inference: High Availability Design for Weights, Config, and Logs

A single GPU server failure during a production inference run can erase 120+ hours of fine-tuned LoRA weights, 3.2 GB of request logs, and every config chang…

A single GPU server failure during a production inference run can erase 120+ hours of fine-tuned LoRA weights, 3.2 GB of request logs, and every config change made in the previous deployment cycle. According to the 2024 Uptime Institute Annual Outage Analysis, 71% of on-premises infrastructure outages result in data loss exceeding 500 MB, and the median recovery time for self-hosted AI workloads is 14.3 hours — compared to 47 minutes for properly backed-up cloud deployments. For Chinese AI teams running self-hosted inference on domestic GPU clusters (e.g., Ascend 910B or NVIDIA H800 via Chinese cloud providers), the absence of a structured backup and disaster recovery (DR) plan is no longer a cost-saving choice; it is a direct operational liability. The China Academy of Information and Communications Technology (CAICT) 2024 AI Infrastructure Report notes that 63% of Chinese enterprises running self-hosted LLM inference lack any automated backup for model weights or inference logs, exposing them to regulatory compliance risks under the Measures for the Management of Generative AI Services (effective August 2023). This article provides a high-availability design framework for three critical data categories: weights, configs, and logs — with specific RPO/RTO targets, tooling choices, and cost benchmarks.

The Three Data Categories That Need Different Backup Strategies

Self-hosted inference generates three distinct data types, each with fundamentally different recovery requirements. Treating them identically leads to either wasted storage or unacceptable data loss.

Model weights are the largest asset: a single 70B-parameter model occupies 140 GB (FP16) or 280 GB (INT8). Their recovery point objective (RPO) can be 24–48 hours because weights change only during fine-tuning or version updates. Their recovery time objective (RTO) should be under 30 minutes — any longer and your inference endpoint becomes a bottleneck for downstream services.

Configuration files (e.g., vLLM serve arguments, model registry paths, environment variables) are small (typically 2–50 KB) but change frequently. Their RPO should be under 5 minutes, with RTO under 2 minutes. Losing a config means your inference server may fail to start or serve the wrong model version.

Inference logs (request payloads, latency metrics, token counts) grow at 1–5 GB per day per GPU for a medium-traffic endpoint. Their RPO can be 1 hour, but RTO should be under 15 minutes — logs are critical for debugging, billing, and compliance audits under China’s data localization laws.

Weights Backup: Object Storage with Versioning and Delta Sync

Model weights are too large for traditional file-level backup to NFS or local disk. The industry-standard approach is object storage with versioning (S3-compatible or compatible with Alibaba Cloud OSS, Tencent Cloud COS, or MinIO on-prem).

Recommended architecture: Use rsync or rclone with --checksum and --backup-dir flags to sync weight directories to an object store bucket. Enable bucket versioning to retain the last 3–5 weight snapshots. For a 140 GB 70B model, initial upload takes 20–40 minutes over a 1 Gbps link, but subsequent delta syncs (after LoRA fine-tuning) typically transfer only 2–8 GB.

Cost benchmark: Storing 3 versions of a 70B model (420 GB total) on Alibaba Cloud OSS Standard tier costs approximately ¥63/month (¥0.15/GB/month). Retrieval costs ¥0.10/GB for the first 10 TB. This is 5–8× cheaper than maintaining a hot standby GPU server.

RPO/RTO achieved: RPO = 24 hours (manual or cron-triggered sync); RTO = 25–40 minutes (download + load into vLLM). For mission-critical endpoints, implement cross-region replication — Alibaba Cloud OSS cross-region replication adds ¥0.08/GB for transfer but reduces RTO to under 10 minutes.

Config Backup: Git-Based Version Control with Encrypted Secrets

Configuration drift is the most common cause of inference server startup failures. A single missing --gpu-memory-utilization 0.85 flag can cause OOM errors costing 30+ minutes of debugging.

Recommended approach: Store all inference configs in a private Git repository (GitLab self-hosted or Gitee private repo). Use git-secret or sops to encrypt API keys, model registry tokens, and cloud credentials. Each config change triggers a Git commit via a pre-commit hook or CI pipeline.

Key files to version: vllm_serve.sh (all CLI flags), .env (environment variables), model_registry.yaml (model name, revision, quantization path), nginx.conf (if using reverse proxy), docker-compose.yml (if containerized).

Recovery procedure: On a new server, git clone the config repo, decrypt secrets with a single passphrase, and run the serve script. Total RTO: under 2 minutes. This is the fastest recovery path among the three data categories.

Security note: Never store plaintext API keys in Git. Use a secrets manager (HashiCorp Vault, or Alibaba Cloud KMS) and reference them via environment variables at runtime.

Log Backup: Structured Streaming to a Centralized Log Store

Inference logs are high-volume, append-only data. Backing them up via scp or rsync every hour is impractical — you will lose the last 59 minutes of data in a crash.

Recommended architecture: Use vector.dev or fluentd as a log shipper running on the inference node. Configure it to stream logs to a centralized time-series database (e.g., VictoriaMetrics, or Alibaba Cloud SLS). Set a 1-hour flush interval with a 5-minute buffer for in-flight data.

Data volume planning: A single NVIDIA A100 serving Llama 3 70B at 50 requests/second generates approximately 3.2 GB of structured logs per day (including prompt tokens, completion tokens, latency, and error codes). For a 4-GPU node, this scales to 12.8 GB/day. Retention policy: keep 30 days hot (384 GB), then archive to cold storage for 12 months.

RPO/RTO achieved: RPO = 1 hour (maximum data loss window); RTO = 10–15 minutes (repoint log shipper to backup cluster). For compliance with China’s data localization requirements, ensure the log store is in the same region as the inference server.

Cost benchmark: Storing 384 GB of hot logs on Alibaba Cloud SLS costs approximately ¥115/month (¥0.30/GB/month for write + ¥0.05/GB/month for storage). Archiving to OSS Archive tier reduces cost to ¥0.015/GB/month.

Disaster Recovery Testing: The 60-Minute Drill

A backup strategy that has never been tested is equivalent to no backup. The industry standard is a quarterly DR drill that simulates a full server failure.

Drill procedure: 1) Take one inference node offline without warning. 2) Time the recovery from a fresh OS install. 3) Recover weights from object storage (download + load), configs from Git, and logs from the centralized store. 4) Measure total RTO.

Target metrics: For a single-node inference endpoint serving a 70B model, total RTO should be under 60 minutes. If recovery takes longer, identify bottlenecks — typically network bandwidth for weight download or GPU driver installation time.

Common failure points in Chinese cloud environments: Alibaba Cloud ECS instances with local SSD may require re-attaching data disks after a crash. Pre-write a recovery runbook with exact API commands for disk attachment, object store authentication, and model loading.

Cost-Benefit Analysis: Backup vs. Downtime

The decision to implement full backup is ultimately a financial calculation. Formula: Annual backup cost ≤ (hourly inference revenue × expected annual downtime hours × 2). The multiplier of 2 accounts for reputational damage and SLA penalties.

Example calculation: A self-hosted inference endpoint generating ¥500/hour in API revenue, with an expected 1 server failure per year (8 hours downtime). Maximum acceptable backup cost: ¥500 × 8 × 2 = ¥8,000/year. The backup architecture described above (3 weight versions in OSS + Git configs + 30-day log retention) costs approximately ¥2,200/year for a single 4-GPU node — well within the budget.

ROI breakeven: If the backup prevents even one 8-hour outage, it pays for itself in 3.6 years. For teams running multiple models or serving production traffic, the breakeven point is typically under 6 months.

Regulatory Compliance: Data Localization and Retention

For Chinese AI teams, backup strategy must align with the Cybersecurity Law (2017) and the Personal Information Protection Law (2021) . Inference logs containing user prompts may constitute “important data” and must be stored within mainland China.

Key requirements: 1) All backup storage (OSS, SLS, Git repos) must be in mainland China regions (Beijing, Shanghai, Shenzhen, or Hangzhou). 2) Log retention for generative AI services must be at least 6 months per the Measures for the Management of Generative AI Services. 3) Cross-border transfer of model weights is restricted if the model was fine-tuned on Chinese user data — use only domestic object storage.

Practical implementation: Configure Alibaba Cloud OSS with bucket policy restricting access to mainland China IP ranges. Enable access logging for all backup operations to demonstrate compliance during audits.

FAQ

Q1: How often should I back up model weights for a production inference endpoint?

For weights that change only during fine-tuning (every 1–4 weeks), a daily backup is sufficient. If you perform continuous fine-tuning or A/B test multiple LoRA adapters, increase to every 6 hours. The incremental cost of daily vs. hourly backups for a 140 GB model is approximately ¥18/month — negligible compared to the risk of losing 140 GB of trained weights.

Q2: What is the cheapest backup option for a single-GPU inference server?

Use MinIO on the same server with a separate disk partition, combined with rsync to a secondary server (or a cloud object store). Total monthly cost: ¥0 for local MinIO (uses existing disk), plus ¥30–50 for cloud storage of critical weights. This achieves RPO of 24 hours and RTO of 30 minutes for under ¥600/year.

Q3: Do I need to back up inference logs if I use a managed inference service like vLLM on RunPod?

Yes. Managed services typically retain logs for only 7–14 days. For compliance with China’s 6-month retention requirement, you must stream logs to your own storage. Configure vector.dev on the RunPod instance to forward logs to Alibaba Cloud SLS or Tencent Cloud CLS — setup takes under 30 minutes and costs ¥50–100/month for a single GPU node.

参考资料

Uptime Institute 2024 Annual Outage Analysis
China Academy of Information and Communications Technology (CAICT) 2024 AI Infrastructure Report
Cybersecurity Law of the People’s Republic of China (2017)
Personal Information Protection Law of the People’s Republic of China (2021)
Alibaba Cloud Object Storage Service (OSS) Pricing Documentation 2024