Skip to content

Proposal: Agent-first CLI#45218

Draft
LysandreJik wants to merge 5 commits intomainfrom
agent-first-cli
Draft

Proposal: Agent-first CLI#45218
LysandreJik wants to merge 5 commits intomainfrom
agent-first-cli

Conversation

@LysandreJik
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik commented Apr 3, 2026

This PR offers a new, Agentic surface for transformers. It tries to apply what is done elsewhere with CLIs to transformers, leveraging many current use-cases of transformers and exposing them as CLI endpoints.

I recommend reading this first: https://github.com/huggingface/transformers/blob/agent-first-cli/src/transformers/cli/agentic/README.md

In my eyes, the advantage is twofold:

  • We offer a nicer surface for Agents to rely on when using transformers. Instead of relying on the docs, understanding our pipelines and their limitation, or working with Python scripts and transformers primitives, we're offering single, dedicated entry-points for each relevant task.
  • It clearly documents the usage of all primitives in many different situations. The current approach does not leverage pipeline and this is voluntary: pipelines are great user-facing user interfaces, but they're not easy to customize and are, by definition, more limiting than what is being used here.

This is a draft for discussion; If we want to move forward, there are some interesting ways to leverage this:

  • This should be entirely automated using agentic tooling and should not be an interface that we maintain. The objective is simple and defined within its own directory and modules. This would not result in additional overhead for the team except for some light reviews here and there.
  • The examples here could also be put forward within our documentation and within our CI

Example commands:

Text Inference

  1. Classify text into categories (supervised)

    transformers classify --model distilbert/distilbert-base-uncased-finetuned-sst-2-english --text "Great movie!"
  2. Classify text into arbitrary categories without training (zero-shot)

    transformers classify --text "The stock market crashed today." --labels "politics,finance,sports"
  3. Extract named entities from text (NER)

    transformers ner --model dslim/bert-base-NER --text "Apple CEO Tim Cook met with President Biden in Washington."
  4. Tag tokens with labels (POS tagging, chunking)

    transformers token-classify --model vblagoje/bert-english-uncased-finetuned-pos --text "The cat sat on the mat."

[...]

Text Generation

  1. Generate text from a prompt

    transformers generate --model meta-llama/Llama-3.2-1B-Instruct --prompt "Once upon a time"
  2. Stream text generation token-by-token

    transformers generate --model meta-llama/Llama-3.2-1B-Instruct --prompt "Hello" --stream
  3. Generate with sampling (temperature, top-p, top-k)

    transformers generate --prompt "The future of AI" --temperature 0.7 --top-p 0.9
  4. Generate with beam search

    transformers generate --prompt "Translate this:" --num-beams 4
  5. Run speculative decoding with a draft model

    transformers generate --model meta-llama/Llama-3.1-8B-Instruct --assistant-model meta-llama/Llama-3.2-1B-Instruct --prompt "Explain gravity."

[...]

Vision

  1. Classify an image into categories

    transformers image-classify --model google/vit-base-patch16-224 --image photo.jpg
  2. Classify an image into arbitrary categories without training (zero-shot)

    transformers image-classify --model google/siglip-base-patch16-224 --image photo.jpg --labels "cat,dog,bird,fish"
  3. Detect objects in an image with bounding boxes

    transformers detect --model PekingU/rtdetr_r18vd_coco_o365 --image street.jpg

[...]

Audio

  1. Transcribe speech to text

    transformers transcribe --model openai/whisper-small --audio recording.wav
  2. Transcribe speech with word-level timestamps

    transformers transcribe --model openai/whisper-small --audio recording.wav --timestamps true --json
  3. Classify an audio clip into categories

    transformers audio-classify --model MIT/ast-finetuned-audioset-10-10-0.4593 --audio clip.wav

[...]

Video

  1. Classify a video clip into categories
    transformers video-classify --model MCG-NJU/videomae-base-finetuned-kinetics --video clip.mp4

[...]

Multimodal

  1. Answer a question about an image (visual QA)

    transformers vqa --model vikhyatk/moondream2 --image chart.png --question "What is the trend shown?"
  2. Answer a question about a document image (document QA)

    transformers document-qa --model impira/layoutlm-document-qa --image invoice.png --question "What is the total amount?"
  3. Generate a caption for an image

    transformers caption --model vikhyatk/moondream2 --image sunset.jpg

@LysandreJik
Copy link
Copy Markdown
Member Author

For potential reviewers: this is very much a draft and hasn't been tested. It is mainly here for discussion to see whether this is something we would like to pursue, and, if so, if you have any specific thoughts regarding how we'd want to approach it. Thank you!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants