Proposal: Agent-first CLI by LysandreJik · Pull Request #45218 · huggingface/transformers

LysandreJik · 2026-04-03T13:31:08Z

This PR offers a new, Agentic surface for transformers. It tries to apply what is done elsewhere with CLIs to transformers, leveraging many current use-cases of transformers and exposing them as CLI endpoints.

I recommend reading this first: https://github.com/huggingface/transformers/blob/agent-first-cli/src/transformers/cli/agentic/README.md

In my eyes, the advantage is twofold:

We offer a nicer surface for Agents to rely on when using transformers. Instead of relying on the docs, understanding our pipelines and their limitation, or working with Python scripts and transformers primitives, we're offering single, dedicated entry-points for each relevant task.
It clearly documents the usage of all primitives in many different situations. The current approach does not leverage pipeline and this is voluntary: pipelines are great user-facing user interfaces, but they're not easy to customize and are, by definition, more limiting than what is being used here.

This is a draft for discussion; If we want to move forward, there are some interesting ways to leverage this:

This should be entirely automated using agentic tooling and should not be an interface that we maintain. The objective is simple and defined within its own directory and modules. This would not result in additional overhead for the team except for some light reviews here and there.
The examples here could also be put forward within our documentation and within our CI

Example commands:

Text Inference

Classify text into categories (supervised)

transformers classify --model distilbert/distilbert-base-uncased-finetuned-sst-2-english --text "Great movie!"

Classify text into arbitrary categories without training (zero-shot)

transformers classify --text "The stock market crashed today." --labels "politics,finance,sports"

Extract named entities from text (NER)

transformers ner --model dslim/bert-base-NER --text "Apple CEO Tim Cook met with President Biden in Washington."

Tag tokens with labels (POS tagging, chunking)

transformers token-classify --model vblagoje/bert-english-uncased-finetuned-pos --text "The cat sat on the mat."

[...]

Text Generation

Generate text from a prompt

transformers generate --model meta-llama/Llama-3.2-1B-Instruct --prompt "Once upon a time"

Stream text generation token-by-token

transformers generate --model meta-llama/Llama-3.2-1B-Instruct --prompt "Hello" --stream

Generate with sampling (temperature, top-p, top-k)

transformers generate --prompt "The future of AI" --temperature 0.7 --top-p 0.9

Generate with beam search

transformers generate --prompt "Translate this:" --num-beams 4

Run speculative decoding with a draft model

transformers generate --model meta-llama/Llama-3.1-8B-Instruct --assistant-model meta-llama/Llama-3.2-1B-Instruct --prompt "Explain gravity."

[...]

Vision

Classify an image into categories

transformers image-classify --model google/vit-base-patch16-224 --image photo.jpg

Classify an image into arbitrary categories without training (zero-shot)

transformers image-classify --model google/siglip-base-patch16-224 --image photo.jpg --labels "cat,dog,bird,fish"

Detect objects in an image with bounding boxes

transformers detect --model PekingU/rtdetr_r18vd_coco_o365 --image street.jpg

[...]

Audio

Transcribe speech to text

transformers transcribe --model openai/whisper-small --audio recording.wav

Transcribe speech with word-level timestamps

transformers transcribe --model openai/whisper-small --audio recording.wav --timestamps true --json

Classify an audio clip into categories

transformers audio-classify --model MIT/ast-finetuned-audioset-10-10-0.4593 --audio clip.wav

[...]

Video

Classify a video clip into categories

transformers video-classify --model MCG-NJU/videomae-base-finetuned-kinetics --video clip.mp4

[...]

Multimodal

Answer a question about an image (visual QA)

transformers vqa --model vikhyatk/moondream2 --image chart.png --question "What is the trend shown?"

Answer a question about a document image (document QA)

transformers document-qa --model impira/layoutlm-document-qa --image invoice.png --question "What is the total amount?"

Generate a caption for an image

transformers caption --model vikhyatk/moondream2 --image sunset.jpg

LysandreJik · 2026-04-03T13:50:46Z

For potential reviewers: this is very much a draft and hasn't been tested. It is mainly here for discussion to see whether this is something we would like to pursue, and, if so, if you have any specific thoughts regarding how we'd want to approach it. Thank you!

HuggingFaceDocBuilderDev · 2026-04-03T13:51:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik added 5 commits April 1, 2026 19:05

Agent-first CLI

ab2c9a8

Traditional CLI

a9fe9a0

Pipeline-less behavior

03836b6

Light AGENTS.md edit

2b8ad41

Better checkpoints

3ead9ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Agent-first CLI#45218

Proposal: Agent-first CLI#45218
LysandreJik wants to merge 5 commits intomainfrom
agent-first-cli

LysandreJik commented Apr 3, 2026 •

edited

Loading

Uh oh!

LysandreJik commented Apr 3, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LysandreJik commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Text Inference

Text Generation

Vision

Audio

Video

Multimodal

Uh oh!

LysandreJik commented Apr 3, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LysandreJik commented Apr 3, 2026 •

edited

Loading