Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
AI/ML
10 min read
Share
Over the last five months, we’ve been building and experimenting with an agent (codename ActionEngine) that can operate across two of the most fundamental digital environments: the browser and the terminal. From the architecture built, the design decisions made, and the lessons learned, we explored what it means for an agent to observe, plan, and take meaningful action across real tools.
Our work sits alongside efforts like OpenAI Operator, Manus, and Magentic-UI, all of which are looking into how agents can reason about the world and take steps to change it, whether that means running shell commands, filling out web forms, or interacting with legacy interfaces.
We were inspired by a desire to build something useful for Cisco customers as we noted that network engineers, QA testers, and infrastructure engineers tend to work across a multitude of different tools. Because they span across so many separate domains and services like SSH (Secure Shell) terminals, old internal admin panels, SaaS, intranet services, etc. common workflows often involve a brittle sequence of steps that don’t benefit from modern automation.
It’s also not enough to have an agent that can simply use these tools. We wanted to teach it how to be a useful co-collaborator by enabling ActionEngine to learn your common workflows on the fly by watching and observing your behavior. By bringing these two pieces together, we hope to bridge an important gap of agentic trust that is often missing in modern agentic apps.
Like many emerging agentic systems, we leaned heavily into a plan-then-act pattern. This means that the agent builds a to-do list before taking any concrete action. That plan then becomes the agent’s internal scratchpad – a persistent structure that helps it avoid redundant steps, recover from failure, and explain its reasoning along the way of a complex, multi-step flow. This plan is also critical in the trust building process and enables the user to see, modify, and collaboratively create this plan to ensure the agent remains on track and operates within acceptable boundaries.
This planning system is in tools like Manus and Claude Code, and we drew inspiration from those implementations. Our learning mode leverages this plan system by observing your actions and using that information to build a composable plan based off of it. This grants the agent the ability to adapt and refine its approach based on real-time feedback and user interactions.
By continuously learning from the user's behavior, the agent can create more efficient workflows that align closely with the user's preferences and habits.
In practice, this means that as the agent observes the user navigating through tasks, it can identify patterns and suggest optimizations. For instance, if a user frequently performs a specific sequence of actions, the agent can automate that sequence in future interactions, thereby saving time and reducing cognitive load.
In learning mode, the agent:
This approach allows our agent to operate with both autonomy and accountability. It knows what it’s supposed to do, and it lets you audit, modify, or insert human-in-the-loop approvals anywhere along the way.
One of the hardest parts about building agents that operate across rich, complex environments is context.
Web pages are chaotic. Dumping raw HTML into a large language model (LLM) rarely gives good results – it’s too much information, most of it is irrelevant, and it’s not structured in a way that the model can use effectively. We explored a few different techniques to compress and prioritize what gets used to develop the context. This process involves a lot of prompt-engineering – reducing a huge amount of context into something that is actionable and easily interpretable by the agent.
0[:]<a title=Cisco.com Worldwide #fw-c-header__logo href="https://www.cisco.com">
1[:]<button #accordion-825b090c07-item-80272c0e79-button text="Products and Services" aria-label="Products and Services" aria-expanded="false">
2[:]<button #accordion-825b090c07-item-f57cb5a340-button text="Solutions" aria-label="Solutions" aria-expanded="false">
3[:]<button #accordion-825b090c07-item-803d0a291a-button text="Support" aria-label="Support" aria-expanded="false">
4[:]<button #accordion-825b090c07-item-3265282a49-button text="Learn" aria-label="Learn" aria-expanded="false">
...
We also experimented with tools like r.jina.ai which can turn any webpage into a Markdown format. This is great for reading comprehension since Markdown is naturally compatible with LLMs. This format is best for web documents like blog posts, READMEs, or wikis.
Title: AI Infrastructure, Secure Networking, and Software Solutions
URL Source: http://cisco.com/
Markdown Content:
AI Infrastructure, Secure Networking, and Software Solutions - Cisco
===============
* [Skip to main content](http://cisco.com/#fw-c-content)
* [Skip to search](http://cisco.com/#fw-c-header__button--search)
* [Skip to footer](http://cisco.com/#fw-c-footer)
[](https://www.cisco.com/ "Cisco.com Worldwide")
### Products and Services
Back
Products and Services
Close
[Products and Services Home](https://www.cisco.com/c/en/us/products/index.html)
...
Each of these approaches has trade-offs. Browser-use gives structure, but minimal detail into the text within. Markdown gives details, but no clickable affordances. We’re still iterating on this, but the key engineering challenge is to compress context intelligently. You need to give the model just enough that it knows how to act, but not enough that it becomes overwhelmed.
In building ActionEngine we settled on a multi-agent flow wherein it has isolated nodes that are responsible for a small subset of the overall agentic process.
We created three separate agents: a “planning” agent, a “thinking” agent, and an “executor” agent. Each of these agents has different system prompts and contexts that allow them to think both broadly about the task at hand, while also only needing to be responsible for its own decisions.
The planning agent is responsible for long-term thinking. It’s the “supervisor.” It manages access to the planning tool which allows the application to create and manage the to do lists.
It can either: create new plans, mark_steps where it can mark individual steps as complete, in_progress, blocked, or finished, or it can update_plan and change the plan around drastically.
The thinking agent is the “brain” and communication layer of the application. It doesn’t access any tools but instead uses structured outputs to generate a brain state of thoughts.
These thoughts then surface up to the user so they can follow along with the decision-making process of the autonomous agent. For example, the agent may say, “Hmm, I’m not seeing the right element on this page,” which can help the user understand why the agent is behaving in a particular way.
The executor agent is the “engineer.” It has access to the terminal tool, the browser_use tool, and the terminate tool. It obtains the full context of all of the environments that the application oversees: web page, the terminal states, the plan, and the agentic thoughts. It uses all of this information to determine the next best step, and it is responsible for deciding when the task is fully complete and when to hand back control to the user.
As of now, each of these agents is called sequentially in a loop during the flow, but we are eager to experiment with different ways of structuring these agents to see if they lead to better outcomes.
We also wanted this agent to be able to play well with others. That’s where the AGNTCY Agent Connect Protocol (ACP) comes in.
ACP is a shared spec for how agents expose their abilities via REST. It’s framework-agnostic so you can use LangGraph, AutoGen, CrewAI, or any other system, and if you follow the ACP pattern, your agent can:
This interoperability is a huge win for composability. It means you can experiment with mix-and-match components and avoid locking-in with a particular agent. To make implementation easier, we leveraged our own WorkflowSrv – a package aimed at building and deploying ACP-compliant servers with minimal setup.
If you’re using AGNTCY components already within your system, it would be easy to extend your multi-agent application with browser/terminal use capabilities by communicating with ActionEngine over ACP.
Security was top of mind throughout this project. Because the agent is executing terminal commands and operating within a browser, we wanted to make sure it ran in a completely isolated environment.
We sandbox everything via Docker containers. Each agent instance gets:
This isn’t novel on its own, but it’s a core part of why we felt comfortable using and developing this tool internally. And it gives us the flexibility to build toward multi-tenant or ephemeral agent instances in the future.
To evaluate ActionEngine, we leverage the Mind2Web benchmark, focusing on the action_repr data for our metric calculation. Our evaluation process involves two key phases: executing tasks and collecting run IDs, followed by processing traces and evaluating the agent's performance.
After a task is completed, we retrieve the trace data and extract action representations from the agent's tool calls. These extracted actions are then compared against expected actions using the G-Eval metric from DeepEval. G-Eval, an LLM-as-a-judge approach, allows us to define "good" performance in natural language, and DeepEval handles the underlying scoring.
Our evaluation criteria account for equivalences in element types and actions, ensuring we focus on semantic interaction rather than identical HTML tags. This approach acknowledges the variations in action_reprs due to our custom filtering logic that aims to align them with the Mind2Web format.
We use the correctness metric from G-Eval, and the results provide a score along with a brief explanation to aid in debugging. For our dataset, we specifically select a subset of websites that do not include robot/human verification or cookie acceptance, as these can impede agent execution.
Right now, we consider this a research project. We’re not pushing for adoption, monetization, or sustained development. It was months of deep exploration into how agents could move beyond chat and into the real, multi-surface workflows we all live in.
We’ve made the code available here on GitHub, and we’re happy to chat with anyone who’s curious, skeptical, or inspired. At the very least, we hope this adds to the ongoing conversation about what kind of tools agents will need to thrive in messy, real-world environments.
The future of AI agents isn’t just about being smart. It’s about being useful, composable, and learnable. That’s the bet we made here, and we’re excited to see where it leads.
Get emerging insights on innovative technology straight to your inbox.
Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.
* No email required
The Shift is Outshift’s exclusive newsletter.
Get the latest news and updates on agentic AI, quantum, next-gen infra, and other groundbreaking innovations shaping the future of technology straight to your inbox.