Back to news
NewsJune 26, 2026· 2 min read

Spin up a vLLM server on Hugging Face in one command, pay per second

Hugging Face Jobs now lets you launch a private, OpenAI-compatible LLM endpoint with a single CLI command. No infrastructure setup required — useful for evals, batch runs, and testing before committing to production.

Our Take

This is a convenience wrapper, not a capability breakthrough; it trades operational flexibility for speed, and only makes sense if you're willing to manage job lifecycle yourself.

Why it matters

Practitioners who run frequent one-off inference tasks (evals, batch generation, model testing) can skip infrastructure provisioning and pay only for compute time. The per-second billing and immediate teardown make it practical for temporary workloads that don't justify a long-lived endpoint.

Do this week

Infrastructure engineer: test a large model on your target hardware flavor before committing to Inference Endpoints, so you can right-size both the GPU and the context/batch settings for your actual workload.

Hugging Face exposes vLLM as a job primitive

Hugging Face Jobs now includes pre-configured Docker support for vLLM, letting you run hf jobs run --flavor a10g-large --expose 8000 vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B and get back a publicly routable OpenAI-compatible endpoint in minutes. The command handles port exposure, token authentication, and billing per second of GPU runtime. No Kubernetes. No server provisioning.

The endpoint accepts standard OpenAI API calls (curl, Python client, streaming) and gates access behind your Hugging Face token. Pricing runs at $1.50/hour for an A10G GPU; larger models scale across multiple H200s with tensor parallelism. You can SSH into the running job for debugging and monitoring.

The same pattern works with other OpenAI-compatible servers (llama.cpp, SGLang), though the post focuses on vLLM. Job cleanup is explicit (hf jobs cancel <job_id>), with optional timeout auto-kill as a safety net.

Fast iteration on models without operational overhead

This removes two friction points: the time to provision infrastructure and the cognitive load of thinking you need a long-lived service for a short-lived experiment. You can test whether a 122B model fits your latency budget on H200 hardware, run batch inference on a dataset, or validate a prompt before rolling into production.

The per-second billing model matters for this use case. An a10g-large running for 30 minutes costs $0.75. An hour-long eval run is $1.50. There's no minimum commitment, no idle-time charges, no scale-to-zero operational complexity to configure. Practitioners working in notebooks or CI/CD can point at the endpoint the same way they'd point at an OpenAI API key.

The trade-off: you are responsible for job lifecycle. If you forget to cancel, you keep paying. If you need production niceties (finer-grained access control, auto-scaling, SLA guarantees), Hugging Face points you to Inference Endpoints instead, which layers those operational features on top.

When to use Jobs vs. Inference Endpoints

Use Jobs for experiments, evals, batch generation, and model kicks-the-tires work where you control the lifecycle and cost-per-run matters more than operational automation. Pick the smallest hardware flavor that fits your model, set a reasonable timeout, and clean up when done.

Use Inference Endpoints if you're standing up a durable service: public or protected access patterns, scale-to-zero during idle periods, or team-level access control. The operational overhead is worth it when the endpoint is a product dependency, not a temporary test harness.

For larger models, the post recommends H200 flavors as better value than A100s or other options. If you get out-of-memory or cache-block errors, dial down --max-model-len and --max-num-seqs as a first debugging step. SSH access makes that diagnosis faster than log-tailing from the outside.

#Open Source#Developer Tools#LLM
Share:
Keep reading

Related stories