Our Take
The real shift is not the gallery itself; it's that agents can now compose models across organizations with zero integration code, turning documentation into discoverability.
Why it matters
Multimedia AI (image generation, 3D reconstruction, video) has been hard to chain because integration required SDKs, weights, and GPU setup. If every model exposes a plain-text schema, agents can assemble pipelines the way they already glue together code libraries.
Do this week
Document your Hugging Face Space with agents.md so agents can discover and call it without custom integration; test it by pointing Claude Code or your preferred agent at the schema URL.
Agent wires two Spaces into a 3D monument pipeline with no manual glue code
A developer prompted a coding agent to build a rotating 3D gallery of Paris monuments. The agent never touched an image generator or 3D reconstruction tool directly. Instead, it called two Hugging Face Spaces via curl:
- ideogram-ai/ideogram4 (image generation): received a monument name, returned a clean specimen photograph.
- VAST-AI/TripoSplat (3D reconstruction): received an image, returned a 3D Gaussian splat (.ply file).
The agent then handled the remaining work itself: flipped Y-axis coordinates, auto-framed each splat, compressed .ply files to .ksplat format, wrote a Three.js viewer with scroll-to-switch and drag-to-rotate controls, and deployed the result as a static Hugging Face Space. The only human inputs were taste decisions ("zoom out," "replace the obelisk").
The mechanism that made this possible: every Gradio Space on Hugging Face now exposes an agents.md file. Fetching https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md returns a plain-text schema including the API endpoint, call template, polling method, file upload format, and auth requirement. No client library. No vendor integration. An agent reads it once and can drive the Space end to end.
Documentation becomes the distribution layer for agents
Mitchell Hashimoto calls this the "building-block economy": AI is strong at wiring proven components together and weak at building everything from scratch. This thesis held for code libraries; now it's happening in multimedia.
The hard part of using state-of-the-art image, video, TTS, or 3D reconstruction models was never the model itself. It was integration: managing SDKs, downloading weights, provisioning GPUs, handling input formats, polling for results. If each model becomes a callable, documented primitive (a Space with agents.md), agents can compose them the way they compose npm packages.
The result is dramatic compression of friction. "Turn a prompt into a rotating 3D monument" used to be a project requiring manual setup and testing. Here it was a pipeline step a coding agent assembled in one pass. The barrier was integration, and agents.md removes it.
Expose your model as a Spaces-compatible API with plain-text documentation
If you host a model on Hugging Face or anywhere agents can reach it, write or generate an agents.md that covers: the API schema, the exact curl call template (with parameter names and types), the polling endpoint for async jobs, file upload protocol, and any auth hints. Test by pasting the agents.md URL into Claude Code or your preferred agent and asking it to call your endpoint.
Agents prefer what is documented and reachable. A model with a clear agents.md will be selected over an equivalent model without it, because the agent can use it without writing custom integration code. This is the same dynamic that made npm libraries dominate: discoverability and reusability beat raw capability when friction is low.