Principles for agent-native CLIs

(twitter.com)

100 points | by blumpy22 20 hours ago

23 comments

  • Reubend 10 hours ago
    > A nicely aligned table with ANSI colors is for humans. An agent extracting a post ID needs JSON.

    Wrong. While table formatting can confuse an LLM in some cases, a natural language output in pure text is almost always better than JSON for small amounts of data. After all, LLMs have more natural language training data than JSON training data.

    The fallacy that LLMs need machine readable outputs just because they're machines is pervasive and it's a huge misconception about the way these models work.

    On the other hand, I agree that large amounts of data should be outputted in a machine readable way so that the LLM can run scripts over it for more advanced parsing.

    • meander_water 10 hours ago
      Agree with JSON. But, surprisingly html and latex perform slightly better than markdown for more complex tables.

      Check out this paper - https://arxiv.org/abs/2506.13405

    • cpard 7 hours ago
      I totally agree with what you are saying here and it’s really confusing to me why anyone would think that json is a good format for LLMs. There’s so much redundant text in json. LLMs don’t need that and my experience is that as the document gets bigger it actually hurts the LLM.
    • rolha-capoeira 10 hours ago
      I don't disagree, but I'm wondering if there's any evidence of this available.

      > After all, LLMs have more natural language training data than JSON training data.

      While that is true, data also doesn't usually look like natural language (i.e. a collection of financial records). And when it does (i.e. a collection of chat messages), I wonder if it's more confusing if it's unstructured, even if small.

      I expect most frontier models to handle these cases just fine either way, so it may largely depend on context—specifically, how much there is, and where the attention shakes out. Ultimately, a claim one way or the other, for something this context-dependent, would have to be backed up by a lot of testing and would probably conclude that, "in most cases, you should do this"

    • iagooar 6 hours ago
      Yes and no. The LLM that sees a JSON structure can decide to use tools to extract and format data as needed, whereas it cannot do the same with natural language.

      The Unix philosophy of small, composable tools is still valid in the era of stochastic machines!

      • Reubend 6 hours ago
        As I said, I agree with you in the case of big outputs. But for small outputs, tool calls can be reliably created from the NL version. There's no need for JSON.
    • well_ackshually 30 minutes ago
      [dead]
  • wolttam 18 hours ago
    Getting agents used to using `--force` to bypass prompts seems like a bad idea. `--force` is for when the action failed (or would fail) for some reason and you want it to definitely happen this time.

    I think `--yes` or `--yes-do-the-dangerous-thing` is leagues better.

    • staticshock 11 hours ago
      A pattern I like for CLIs is that by default each command runs in dry-run mode, and only with `--commit` is it allowed to do dangerous things. Kind of like `git clean` vs `git clean --force`, except that `--force` feels like bad names for the distinction. Likewise, `--dry-run` implies that the command does the dangerous thing by default, which is bad. `--commit` gets the balance right, it sounds right, and it's sufficiently self-explanatory.

      (Oh, and there's no shorthand, like `-c`. It's `--commit` or bust.)

    • tekacs 18 hours ago
      It also in the case of an LLM can bias it towards using that sort of flag more commonly, which is less than ideal when it then uses a more ordinary Unix command that uses that to mean something dangerous.
    • dimes 15 hours ago
      CLIs should check isatty and, if it returns false, disable any interactive functionality because it won’t work.
      • rixed 11 hours ago
        Please don't do that, expect has to die.
        • dimes 33 minutes ago
          I don’t mean that expect should be used. But flags like —no-interactive are unnecessary. CLIs can just check `isatty == false` instead of requiring an explicit flag.
    • Pxtl 11 hours ago
      I'm all about the "-ForceDoTheDangerousThing" when I'm making tools (most of my shell scripting is pwsh).

      The naked "-Force" has always been a mistake on even minimally complex tools.

    • ihsw 18 hours ago
      `--non-interactive` has precedent too.
    • hajekt2 17 hours ago
      [flagged]
  • tfrancisl 17 hours ago
    I dont want "agent-native CLIs" to proliferate because I'd rather we design CLIs for human use and programmatic (automation) use first. Agents are good at vomiting json between tool calls, I am not, and never will be.

    Too many tools stray so wildly from UNIX principles. If we design for agents first we will likely see more and more of this.

    • theshrike79 17 hours ago
      The point IMO in "agent-native CLIs" is to make them match the statistical average.

      Let the Agent use the CLI and if it guesses the wrong option, you make that the RIGHT option.

      Every time it doesn't guess something right, you change it.

      • pmontra 16 hours ago
        I would naively suppose that the agent is able to read the man page or run the help command of the tool. They usually contain plenty of information. But bending the tool to suit the agent has some value. The GNU-AI suite of userland tools? Unfortunately it's possible that every model will settle on a different average. If that's the case we can't bend to every model. Models will have to bend to whatever we want to use.
        • theshrike79 16 hours ago
          Of course it can read the man page and run cmd --help.

          Now you've wasted context on, what? Learning how to use the tool. And it will waste context on it every single time. (You can write skills to mitigate this a bit, but still).

          The alternative is to make the tool work as the user (an LLM in this case) expects it to work, without having to resort to the manual.

        • riknos314 15 hours ago
          If the parameter names mostly standardize across tools because the models learn to predict those names, then humans will also learn to predict those flag names so this actually has the potential to make tools more human friendly and easier to learn.
      • tfrancisl 17 hours ago
        > Let the Agent use the CLI and if it guesses the wrong option, you make that the RIGHT option

        This sounds backwards and presumes that the statistics machines which are LLMs are getting it right when they "average" out to the wrong command. No, fix the agents behavior, dont change the CLI to accommodate it.

        • rsalus 11 hours ago
          the real solution is to simply provide hints in responses so that the model may self-correct, e.g., recommended next actions, describe commands to get schema definitions, etc.
        • alchemist1e9 17 hours ago
          I don’t remember exactly the specific examples off the top of my head (some are definitely ffmpeg commands) but I do know that when LLMs keep hallucinating command line flags that don’t exist for that specific command their “suggestion” is actually very reasonable and so many developers are adding support to their tools for common hallucinations.
          • tfrancisl 16 hours ago
            Not to belabor my point, but I think "adding support to tools for common hallucinations" is a bad idea. Sounds like something a vibecoded project being spammed with issues by agents might do. Not so much a serious, mature project, though.
            • alchemist1e9 16 hours ago
              Well we will have to agree to disagree because my understanding of what has been generally the case is that the LLMs might vibe-coding spam, that’s true, but the interesting difference is generally speaking their “suggestions” are very reasonable and represent in hindsight useful changes that make the commands more useful for everyone, humans included.
              • QuercusMax 12 hours ago
                If an option exists but it's got a poorly named flag, adding a flag alias is probably a good idea for usability in general. Most CLI tools probably don't report telemetry about failed executions, though, cuz that would be very creepy.
    • alchemist1e9 17 hours ago
      It’s also likely that agents would also be better if they didn’t deal with json vomit either. I’m optimistic that agent frameworks will eventually come full circle and realize concise teletype linear CLIs aka old school UNIX is actually very effective and efficient for agents as well as humans!
  • pseudosavant 14 hours ago
    I'm all in on agent-first CLIs. The CLIs I've been building have still been easier to use for me as a human than the average CLI tool. It isn't like CLIs tools have famously simple or consistent arguments from tool to tool anyway.

    I find it so much more successful to have an agent interact with a CLI than an API or MCP. I can just ask: query my dev DB for an ideal URL to test a new page. It'll find the right users, resources, etc and create an excellent test URL to quickly validate the behavior of my changes. I can have it get the latest spec from Confluence, or find the latest PR build for a workitem.

    If you have an API, you should really look at providing a CLI for it too.

    Plugging my tools/examples:

    - https://github.com/pseudosavant/confluence-fetch

    - https://github.com/pseudosavant/azwi

    - https://github.com/pseudosavant/sql-agent-cli

    • rsalus 11 hours ago
      agree, although the pattern I've been following is to provide a self-contained CLI for portability and usability purposes, and then an "mcp" subcommand which launches an MCP server over stdio. ultimately the "CLI" and "MCP" surfaces act as thin facades over the same functional layer.
  • zbentley 15 hours ago
    For reasons other commenters have expressed better than I could, the idea of "agent-native CLIs" seems like a poor one.

    Why not just do the "mycli skill-path" idea from the article, and skip the rest? Basically:

    1. Add regular, for-humans-or-programs flags and modes to your CLI as single-purpose, composable features (otherwise known as "how we've always added lots of features to a CLI without legislating a particular use-case"). Doing this in a messy way makes a messy CLI, same as it ever was. Don't do it in a messy way.

    2. When requested, have the CLI itself, or its manual/website, puke out a skill file which directs agents in how to compose those things for likely LLM uses of the CLI. Talking hardcoded, static text here. Nothing crazy.

    In other words, a "manpage for LLMs" or "manpage-as-skill" option. That's a lot more flexible and easier to change and update than an entire made-for-LLMs behavior layer. So you'd have "man mytool" and "skill mytool" available as separate documents, emphasizing separate capabilities of the same underlying CLI. "skill mytool" would be for use by LLMs or for piping "skill mytool > SKILL.md" or whatever.

    This is a little bit analogous to Git's notion of "porcelain" and "plumbing" (not that Git's a particularly sterling example of composable, friendly UX). The composable or special-case-only APIs still exist for direct use, are dogfooded internally for the human-user-intended paths, and a pre-baked document exists directing LLMs/users in how to use those lower-level details effectively.

    Sure, LLMs can read your manpage/helpdoc, or website, or source code, and figure things out, but that's slow and costs tokens and command-approval loops. This is a marginal efficiency proposal at best, but hopefully one that discourages people from writing bimodal, tortured CLIs just for the sake of LLM-friendliness.

    Is that nuts?

  • debarshri 17 hours ago
    I think every CLI is agent native when invoked from claude or any coding agents.

    I was really suprised today. We at adaptive [1], is an access management platform to access psql, mysql, vms, k8s etc. When you use `adaptive connect <db-name>` it would connect create just-in-time tunnel and connect the user to the database. You cannot do traditional psql operation etc. That design is by choice.

    Today I was trying to invoke it via claude, and, god damn, it found a way to connect. It create a pseudo shell in python, pass the queries and treat our cli like a tool. This would have been humanly not possible. Partly because, you would like about risks, good practice/bad practice, would be scared to execute and write code like that, and it just did it and acheived the goal.

    [1] https://adaptive.live

  • lacymorrow 11 hours ago
    One thing I'd add from building a shell plugin that routes natural language to agents: the detection heuristic matters way more than flag conventions.

    We spent a lot of time on when to run something as a shell command vs send it to an LLM. The hard lesson: false positives are much worse than false negatives. "git push --force" accidentally going to an LLM instead of executing is the kind of thing that kills user trust instantly. Our heuristic ended up very conservative.

    The bigger surprise was the real-time visual indicator. We added a small color signal showing "this goes to shell" vs "this goes to agent" as you type, and it changed how people wrote more than anything else. Before it, people hedged natural language queries with shell-like syntax just in case. After it, they wrote normally.

    On the isatty point — right for automation. But there's a third mode worth thinking about: "orchestrated interactive," where a human is watching the agent use your CLI and needs to step in. Pure non-interactive breaks that entirely.

  • rahimnathwani 17 hours ago
    This guy took inspiration from gog cli (steipete's cli for Google Workspace, which predates gws cli and is apparently more agent-friendly and token-efficient):

    https://github.com/mvanhorn/cli-printing-press

    He made a whole bunch of agent-friendly CLIs: https://printingpress.dev/

    https://github.com/mvanhorn/printing-press-library/tree/main...

  • qudat 15 hours ago
    The entire concept that we need to cater CLIs to agents at all should tell us how far away they are from being “junior devs” or “an intern” and I reject the premise.

    A lack of structured output has never been a blocker for agents to work, that’s a traditional coding problem.

    “Write good help text and error messages” is just good design which is self evident.

    • rsalus 11 hours ago
      not really.. I never understand the inclination to be reductive. the patterns emerging can be fairly novel.
  • BobbyJo 11 hours ago
    If AI became a forcing function for cross industry semantic consistency in public tooling, I would be so happy.
  • peterldowns 15 hours ago
    This is really good, particularly the async tasks part. Hadn't thought about that. We'll be thinking about these lessons for the next version of our agent CLI.
  • ChrisArchitect 17 hours ago
  • jiehong 17 hours ago
    This reminds me that agents sometimes really like heredoc in shells, and waste tokens retrying with a file.
  • sandermvanvliet 18 hours ago
    Is it me or are all these articles about using AI effectively and building for AI just, you know, things that we should have been doing all along?

    It feels like most of the “rules” are “don’t be an ass to your consumer”.

    • tom_ 15 hours ago
      Doing stuff for other people: generally low-status work, to one degree or another

      Doing stuff for the machine: the behaviour of a pragmatic, nuanced builder. A forward-thinking agentic AI pioneer, executing and shipping at the unexplored boundary of modes of human creativity #building #shipping #executing

    • bensyverson 18 hours ago
      Partially, but I think if you design for agents, their needs are different enough from a human's that you end up making different choices.

      I found myself nodding along to the linked tweet/article. Recently I did many rounds of iterative user-centered design with an agent to improve the CLI interface in Jobs [0], a task manager for LLMs. The resulting CLI follows most of these principles.

      One great idea from the tweet that I will be adding: a `feedback` subcommand, for the agent to capture feedback while they work.

      [0]: https://github.com/bensyverson/jobs

  • isityettime 16 hours ago
    I broadly agree with the article. But I think it's wrong about the failures of past command-line interface design. The author writes:

    > There's a deeper assumption underneath all of it. The classic Command Line Interface Guidelines treat a human at a terminal as the primary user, with agents as a tolerated secondary audience. That's no longer the right default. Cloudflare puts it directly in their post: "Increasingly, agents are the primary customer of our APIs." Their whole schema approach is built around that. HeyGen launched their CLI with "agent" in the marketing copy. Design for agents first, and humans benefit. Designing for humans first and bolting on agent support is what produces the inconsistent, prompt-prone, stdout-only CLIs the first five principles exist to correct.

    I don't think that's true at all. If you're someone who has lived in the terminal for a few years, you will have a sense of taste that naturally leads you to do the right thing. If you've used Git and systemctl and you know why p7zip feels alien on Unix and you have cursed a command where `-h` doesn't mean help, nobody needs to tell you basically any of this. If you've ever met jq, you don't need anyone to tell you that `--json` is a very valuable thing to have. You also don't need anyone to tell you what a uniform hierarchy of flags and options with different scopes should look like; if you've used a program that uses subcommands, even a shitty one, you know what a good one should look like.

    When command-line tools (or inconsistent collections thereof) are difficult for AI in the ways the article describes, it's because they're shit. When command-line tools are shit, it's because nobody is taking the design of those interfaces seriously at all, typically some combination of:

      - the interface isn't "designed" at all, it's just naively evolved.
      - you're leaving writing a CLI tool to someone who tolerates the command-line but doesn't live in it
      - the object is treated as only a human/interactive interface or only a programming interface when in fact it's always both
      - your suite of tools has diffuse ownership and nobody thinks command-line interfaces are important enough to have standards for
    
    If you treat a GUI as unseriously as that, it invariably turns to a pile of shit, too!

    Anybody who ought to be writing one has already internalized all the right norms. Most of it comes for free from living in the shell. Put one person in charge and it'll be uniform. If you can't, writing a style guide and enforcing it with linters and tests is a great idea. But this is just taking command-line interfaces seriously as interfaces. It has pretty much nothing to do with AI except at the edges (e.g., json-flavored companion to --help).

  • walski 17 hours ago
    Definitively super human ultra intelligence by the end of Q4!!!!11 Also not able to use tools, which are not explicitly built for machine consumption.
  • glenngillen 11 hours ago
    So I've been working heavily on redesigning the CLI for where I currently work. I took the approach of building it from the ground up to be agent-first, primarily because we already had a good sense on what was missing for the human sense but any agent implications were entirely unknown. I'm really happy with that decision and we've ended up with a much better human experience as a result too. I plan to write up our experience at some point, but in the interim a few comments on the linked principles.

    When I worked at Heroku basically all of these principles were true (though usually described slightly differently or for different reasons) back then too. These are just good CLI design principles, nothing agent-native about them. Build small sharp commands that don't require interactivity, follow *nix conventions so users can pipe in/out results to build workflows beyond what you initially imagined, provide useful help and examples, if there's a reasonable guess about the next thing a person should do offer it as a suggestion, be consistent in your terminology, be consistent in data format (e.g., don't expect a shortform name of a resource as the input in one place and the integer ID in another), for information that is important for the context in which to execute a command (e.g., which user, which org, etc) provide an environment level config and a per-command config option.

    Just lots of generally helpful advice for people. Turns out it's helpful to agents too.

    Something that seems like agent-specific conventional wisdom that I'm not fully bought into: JSON as the output format. For all but the most trivial outputs the LLM does not actually seem to want JSON output and will instead jump through various hoops to turn it into something it can parse more easily. We experimented with TOON[1] as a format and immediately confirmed the reduced token output claims. However when benchmarking actual real use cases TOON performed worse than both JSON and having the LLM just consume the human output. Digging further into that was eye-opening as it revealed the reason JSON did so well was less to do with the LLM understanding JSON and more its knowledge of the extensive ecosystem around JSON as a format that already exists. Looking at all the various tool calls we could see it'd make heavy use of piping JSON data to `jq`, `cut`, `awk`, `sort`, `wc`, etc. to get the data into the shape it needed. Failing that it would fall back to writing temporary python scripts to get it into the correct shape.

    Capturing all of those logs to understand the performance differences felt like a form of usability testing we used to do at Heroku too. I suddenly saw the way someone (something in this case) was using the tool in ways I didn't entirely expect. Many of them to essentially get answers to perfectly reasonable questions that we should be surfacing in a better way to both humans and agents alike. It's like I managed to squash hundreds of usability tests into a couple of days. It was pretty simple to add additional flexibility into the CLI commands and clearer messaging in other places which drastically reduced the need for the LLM to post-process the data no matter what format they received it in.

    So we still support JSON as a data format because it's genuinely useful for a bunch of reasons. But we also have something more LLM friendly (TOON-like, but not entirely compliant in specific circumstances where we can see it's inefficient) to be as efficient with token usage as we can be. That's about the only agent-only addition to the CLI in the end. Despite building it agent-first, it's helped us get to a better human product.

    [1] https://github.com/toon-format/toon

  • light_hue_1 6 hours ago
    What's remarkable about this list is that every single piece of advice is exactly the opposite of what you should do!

    > 1. Non-interactive by default > Commands have to run without interactive prompts when an agent invokes them. When a subagent spawns a background process, there's nothing answering the prompt. The command hangs.

    If a command has stdout/stdin attached, you can be interactive mode. If it doesn't you can be in non-interactive mode. This isn't even wrong, it's just nonsense.

    > 2. Structured, parseable output > A nicely aligned table with ANSI colors is for humans. An agent extracting a post ID needs JSON.

    It's a natural language agent. It obviously doesn't need JSON. And JSON is extremely wasteful in terms of tokens.

    > 3. Errors that teach, and enumerate > The original principle was "fail fast with actionable errors." That still holds, with one refinement I missed the first time. When the failure is "you passed an invalid value for X," the error should include the valid set.

    Except that the valid set can be huge. You're much better off describing what's valid. It's a natural language agent. Talk to it!

    > 4. Safe retries and explicit mutation boundaries > Agents retry. Humans glance at a duplicate row and notice; agents don't.

    What does this have to do with agents? Yeah, if possible make your operations idempotent. If not, well, .. then don't? Humans will make exactly the same mistakes as agents here.

    > 5. Bounded responses, at every layer > Tokens cost money and context. Big outputs are sometimes justified, but the default should be narrow.

    How do you know what your agent needs? Let the agent bound and select. Agents are perfectly capable of sending your output through grep or through head/tail.

    > 6. Cross-CLI vocabulary consistency > This is the principle I'm most certain about, and the one most under-stated in the original. > Agents don't memorize one CLI at a time. They build a generalized model of what CLIs do, drawn from every CLI they've seen. When your tool uses info for what every other tool calls get, the agent doesn't fail; it succeeds slowly, with extra retries, after burning tokens on --help. Multiply that across thousands of agent invocations per week and the cost is real.

    Agents need to deal with hundreds of CLIs that are all inconsistent. What matters is that you describe to the agent how each CLI works. It doesn't matter if they're consistent.

    > 7. Three-layer introspection > The original principle here was "progressive help discovery": top-level --help lists commands, subcommand --help shows usage. That's still true, but it's now the bottom layer of a three-layer stack. Each layer answers a different question.

    Truly the worst advice. Take a simple clean output that's easy to understand in one go and turn it into a crazy complex json that requires multiple inferences to understand. Not only are you wasting compute, you're wasting your time waiting for that compute.

    > 8. Async-aware execution > Most CLIs treat async APIs the way the underlying HTTP endpoint does: submit returns a job ID, poll returns a status, that's the agent's problem. Two failure modes follow. Either the agent writes its own poll loop (wasting tokens and getting it subtly wrong), or it doesn't, and the workflow fails because the result wasn't ready when the next step ran.

    No, it got worse. Horrific advice. Take a simple API that an agent can easily wait for in the background and turn it into a stateful monster that can clog everything up with junk. Oh and now when you have multiple agents they get to have a fun conflict.

    > 9. Persistent identity through profiles > Agents don't show up once. They show up tomorrow, and the day after, and a week from now, in a different shell, with the same underlying intent and a different specific input. Stateless leaf-shaped CLIs make every invocation re-specify the same eight flags.

    Ok, I was wrong. 8 was bad. 9 is much much worse. It's a guarantee that your agents will get things wrong. Why? Because agents forget all the time!

    > 10. Two-way I/O > The original principle 6 (composable and predictable structure) covered stdin/stdout pipelining. That's still true. But agents don't only consume CLIs through pipes, and the CLI doesn't only emit through stdout. There are two new mechanisms worth adding: a way for the CLI to emit artifacts where the agent actually needs them, and a way for the agent to report friction back.

    This literally exists in a form that every single agent knows: bash pipes and redirects. They've been trained on billions of examples of this. Now instead of just using that, you're adding a custom version that will just confuse the agent.

    I'm not sure I could have written a worse list if I tried.

    • terrabitz 1 hour ago
      Yeah, there were a couple good points, but a lot that felt off when I read this article (beyond the fact it was largely LLM-generated).

      That said, I will say I personally love an optional --wait flag. I've written so many bash scripts where I have to do the status looping manually when all I want is to just do the operation, then do something else once it's complete. For the most part I'm willing to sacrifice a little control there for simplicity.

      I 100% agree with your take on the "Two Way I/O". I hate having to figure out how to coerce tools to give me the right output file when all I want is for them to cleanly write the output to stdout, the progress messages and errors to stderr, and let me deal with how they get redirected. This is a core principle that's existed in CLI tools since forever. Agents and humans are both very capable of stringing together other tools to get the results you want.

  • frb 2 hours ago
    [dead]
  • micalo 10 hours ago
    [flagged]
  • Amber-chen 14 hours ago
    [flagged]
  • arian_ 16 hours ago
    [dead]