Prompt based GUI and terminal automation #3463
Replies: 6 comments
-
|
For such assistant to become usable, it shall be both real-time and always-learning. Real-time means it will listen to the user, monitor feedback from computer and react accordingly. Always-learning means it will watch and imitate user's actions, ask questions to users and search online to learn more. |
Beta Was this translation helpful? Give feedback.
-
|
Indeed, that would be very cool. Efforts like this exist, e.g. https://robotme.org/ and we'll probably not get into this in the first version, but in subsequent versions, this is definitely on the table! |
Beta Was this translation helpful? Give feedback.
-
|
Sounds a bit like speech recognition software (e.g. Dragon NaturallySpeaking) that can do specific actions like clicking somewhere, opening programs or dictating text, but combined with an intent recognition like the current voice assistants (Alexa, Siri, ...), but more flexible what it can understand. The end product could be an app that runs in the background of your pc or smartphone and you can talk to it. Ask any question and command it do to stuff on the computer for you. Here is my research for "Linux Voice Interface": https://pad.nixnet.services/d1W89tL8Qj6-65-UJcp5SA?view Especially check out Almond aka Genie from Stanford. Maybe you can collaborate with them to create an Open Source, Privacy-Preserving Voice Assistant. Also integration with Home Assistant would be great. |
Beta Was this translation helpful? Give feedback.
-
|
Now it is been partly implemented, and as part of my ideology, the project Cybergod has been released. Here's the program in action: cybergod_with_background.mp4If anyone interested in Cybergod, please join official discord group. |
Beta Was this translation helpful? Give feedback.
-
|
Developed a terminal interaction environment for agents, capable of converting all info from terminal into meaningful text, including cursor and styling information. Terminal environment can be captured as image with cursor denoted in red: OpenDevin is working on this right now. |
Beta Was this translation helpful? Give feedback.
-
|
GUI and terminal automation as an agent capability is a compelling direction — it's one of the few things that lets AI operate in the real computer environment rather than just the LLM API layer. A few things we found matter for production GUI automation agents: Capability declaration — what can this agent actually click/type vs what is off-limits? Without explicit capability bounds, an automation agent can end up doing things the user didn't intend. We use capability manifests at spawn time: the agent knows it can interact with [application X, terminal, browser Y] but not [email client, file system outside /tmp]. Action attribution — if the agent takes a destructive action (deletes a file, sends an email), you need an audit trail that shows exactly what the agent was instructed to do vs what it inferred was correct. Signed execution receipts for each discrete action. Cost and rate limits — screenshot capture + vision model calls are expensive. Agents need to know their budget for vision calls and make decisions about when to screenshot vs when to infer state from accessible text. Fallback gracefully — GUI elements change, applications update. A robust automation agent needs to handle "element not found" without silently failing. For the terminal automation specifically, we've been building this as part of KinthAI's agent execution environment — agents that can run terminal commands within a sandboxed context: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale covers the isolation model. Are you targeting single-user desktop automation or multi-user server-side automation? |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I always want to make a bot to execute GUI and terminal tasks like human, such as "check and cleanup disks", "make a funny video and upload to youtube", "edit and test this bash script till it is bug-free", "talk to people on twitter and post ads".
Of course these tasks can be done by domain specific software, but since ChatGPT shows promising capabilities, and Open-Assistant is working on it, I wonder if it can target human-level computer operations to become a real killer assistant.
Beta Was this translation helpful? Give feedback.
All reactions