Outshift | Building actionable AI agents with AGNTCY ACP for seamless browser and terminal workflows

Over the last five months, we’ve been building and experimenting with an agent (codename ActionEngine) that can operate across two of the most fundamental digital environments: the browser and the terminal. From the architecture built, the design decisions made, and the lessons learned, we explored what it means for an agent to observe, plan, and take meaningful action across real tools.

Our work sits alongside efforts like OpenAI Operator, Manus, and Magentic-UI, all of which are looking into how agents can reason about the world and take steps to change it, whether that means running shell commands, filling out web forms, or interacting with legacy interfaces.

Building an agentic collaborator for seamless workflows

We were inspired by a desire to build something useful for Cisco customers as we noted that network engineers, QA testers, and infrastructure engineers tend to work across a multitude of different tools. Because they span across so many separate domains and services like SSH (Secure Shell) terminals, old internal admin panels, SaaS, intranet services, etc. common workflows often involve a brittle sequence of steps that don’t benefit from modern automation.

It’s also not enough to have an agent that can simply use these tools. We wanted to teach it how to be a useful co-collaborator by enabling ActionEngine to learn your common workflows on the fly by watching and observing your behavior. By bringing these two pieces together, we hope to bridge an important gap of agentic trust that is often missing in modern agentic apps.

A look at the home screen of the ActionEngine application.

A planning-first architecture

Like many emerging agentic systems, we leaned heavily into a plan-then-act pattern. This means that the agent builds a to-do list before taking any concrete action. That plan then becomes the agent’s internal scratchpad – a persistent structure that helps it avoid redundant steps, recover from failure, and explain its reasoning along the way of a complex, multi-step flow. This plan is also critical in the trust building process and enables the user to see, modify, and collaboratively create this plan to ensure the agent remains on track and operates within acceptable boundaries.

This planning system is in tools like Manus and Claude Code, and we drew inspiration from those implementations. Our learning mode leverages this plan system by observing your actions and using that information to build a composable plan based off of it. This grants the agent the ability to adapt and refine its approach based on real-time feedback and user interactions.

By continuously learning from the user's behavior, the agent can create more efficient workflows that align closely with the user's preferences and habits.

In practice, this means that as the agent observes the user navigating through tasks, it can identify patterns and suggest optimizations. For instance, if a user frequently performs a specific sequence of actions, the agent can automate that sequence in future interactions, thereby saving time and reducing cognitive load.

Our learning mode generates a plan on the right side based on the human user’s actions in the browser and the terminal.

In learning mode, the agent:

Listens to your clicks, form entries, and navigation inside the browser.
Observes terminal commands executed inside a TMux session.
Records a sequence of steps that becomes a reusable editable plan.

This approach allows our agent to operate with both autonomy and accountability. It knows what it’s supposed to do, and it lets you audit, modify, or insert human-in-the-loop approvals anywhere along the way.

Human-in-the-loop ensures that terminal commands get approved before running to ensure the user remains in control at all times.

Context is everything

One of the hardest parts about building agents that operate across rich, complex environments is context.

Web pages are chaotic. Dumping raw HTML into a large language model (LLM) rarely gives good results – it’s too much information, most of it is irrelevant, and it’s not structured in a way that the model can use effectively. We explored a few different techniques to compress and prioritize what gets used to develop the context. This process involves a lot of prompt-engineering – reducing a huge amount of context into something that is actionable and easily interpretable by the agent.

Browser-use gave us a lightweight, DOM-powered summary of clickable elements. This layout is great because it applies IDs to all the interactive elements. In our code, we modified the base browser-use DOMService class to also incorporate aria-labels and element IDs, which we noted increased the capability of the agent to leverage existing accessibility tags to increase its semantic understanding of the page. This format is best for interactive web applications like dashboards.

0[:]<a title=Cisco.com Worldwide #fw-c-header__logo href="https://www.cisco.com">
1[:]<button #accordion-825b090c07-item-80272c0e79-button text="Products and Services" aria-label="Products and Services" aria-expanded="false">
2[:]<button #accordion-825b090c07-item-f57cb5a340-button text="Solutions" aria-label="Solutions" aria-expanded="false">
3[:]<button #accordion-825b090c07-item-803d0a291a-button text="Support" aria-label="Support" aria-expanded="false">
4[:]<button #accordion-825b090c07-item-3265282a49-button text="Learn" aria-label="Learn" aria-expanded="false">
...

We also experimented with tools like r.jina.ai which can turn any webpage into a Markdown format. This is great for reading comprehension since Markdown is naturally compatible with LLMs. This format is best for web documents like blog posts, READMEs, or wikis.

Title: AI Infrastructure, Secure Networking, and Software Solutions

URL Source: http://cisco.com/

Markdown Content:
AI Infrastructure, Secure Networking, and Software Solutions - Cisco

===============

*   [Skip to main content](http://cisco.com/#fw-c-content)
*   [Skip to search](http://cisco.com/#fw-c-header__button--search)
*   [Skip to footer](http://cisco.com/#fw-c-footer)

[](https://www.cisco.com/ "Cisco.com Worldwide")

### Products and Services

Back

Products and Services

Close

[Products and Services Home](https://www.cisco.com/c/en/us/products/index.html)
...

Each of these approaches has trade-offs. Browser-use gives structure, but minimal detail into the text within. Markdown gives details, but no clickable affordances. We’re still iterating on this, but the key engineering challenge is to compress context intelligently. You need to give the model just enough that it knows how to act, but not enough that it becomes overwhelmed.

Separate agents with separate responsibilities

In building ActionEngine we settled on a multi-agent flow wherein it has isolated nodes that are responsible for a small subset of the overall agentic process.

We created three separate agents: a “planning” agent, a “thinking” agent, and an “executor” agent. Each of these agents has different system prompts and contexts that allow them to think both broadly about the task at hand, while also only needing to be responsible for its own decisions.

The planning agent

The planning agent is responsible for long-term thinking. It’s the “supervisor.” It manages access to the planning tool which allows the application to create and manage the to do lists.

It can either: create new plans, mark_steps where it can mark individual steps as complete, in_progress, blocked, or finished, or it can update_plan and change the plan around drastically.

The thinking agent

The thinking agent is the “brain” and communication layer of the application. It doesn’t access any tools but instead uses structured outputs to generate a brain state of thoughts.

These thoughts then surface up to the user so they can follow along with the decision-making process of the autonomous agent. For example, the agent may say, “Hmm, I’m not seeing the right element on this page,” which can help the user understand why the agent is behaving in a particular way.

The executor agent

The executor agent is the “engineer.” It has access to the terminal tool, the browser_use tool, and the terminate tool. It obtains the full context of all of the environments that the application oversees: web page, the terminal states, the plan, and the agentic thoughts. It uses all of this information to determine the next best step, and it is responsible for deciding when the task is fully complete and when to hand back control to the user.

As of now, each of these agents is called sequentially in a loop during the flow, but we are eager to experiment with different ways of structuring these agents to see if they lead to better outcomes.

Interoperability through ACP

We also wanted this agent to be able to play well with others. That’s where the AGNTCY Agent Connect Protocol (ACP) comes in.

ACP is a shared spec for how agents expose their abilities via REST. It’s framework-agnostic so you can use LangGraph, AutoGen, CrewAI, or any other system, and if you follow the ACP pattern, your agent can:

Receive tasks from outside planners
Delegate subtasks to other agents
Be orchestrated into a larger agentic pipeline

This interoperability is a huge win for composability. It means you can experiment with mix-and-match components and avoid locking-in with a particular agent. To make implementation easier, we leveraged our own WorkflowSrv – a package aimed at building and deploying ACP-compliant servers with minimal setup.

If you’re using AGNTCY components already within your system, it would be easy to extend your multi-agent application with browser/terminal use capabilities by communicating with ActionEngine over ACP.

Everything runs in a sandbox

Security was top of mind throughout this project. Because the agent is executing terminal commands and operating within a browser, we wanted to make sure it ran in a completely isolated environment.

We sandbox everything via Docker containers. Each agent instance gets:

Its own terminal (via Tmux)
Its own browser instance (Chromium via Playwright)
and optionally, its own mounted file system if the user needs to inject additional resources.

This isn’t novel on its own, but it’s a core part of why we felt comfortable using and developing this tool internally. And it gives us the flexibility to build toward multi-tenant or ephemeral agent instances in the future.

Evaluating ActionEngine

To evaluate ActionEngine, we leverage the Mind2Web benchmark, focusing on the action_repr data for our metric calculation. Our evaluation process involves two key phases: executing tasks and collecting run IDs, followed by processing traces and evaluating the agent's performance.

After a task is completed, we retrieve the trace data and extract action representations from the agent's tool calls. These extracted actions are then compared against expected actions using the G-Eval metric from DeepEval. G-Eval, an LLM-as-a-judge approach, allows us to define "good" performance in natural language, and DeepEval handles the underlying scoring.

Our evaluation criteria account for equivalences in element types and actions, ensuring we focus on semantic interaction rather than identical HTML tags. This approach acknowledges the variations in action_reprs due to our custom filtering logic that aims to align them with the Mind2Web format.

We use the correctness metric from G-Eval, and the results provide a score along with a brief explanation to aid in debugging. For our dataset, we specifically select a subset of websites that do not include robot/human verification or cookie acceptance, as these can impede agent execution.

ActionEngine is now available on GitHub

Right now, we consider this a research project. We’re not pushing for adoption, monetization, or sustained development. It was months of deep exploration into how agents could move beyond chat and into the real, multi-surface workflows we all live in.

We’ve made the code available here on GitHub, and we’re happy to chat with anyone who’s curious, skeptical, or inspired. At the very least, we hope this adds to the ongoing conversation about what kind of tools agents will need to thrive in messy, real-world environments.

The future of AI agents isn’t just about being smart. It’s about being useful, composable, and learnable. That’s the bet we made here, and we’re excited to see where it leads.

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Our Work

Our Collaborators

Company

Apply

Connect

Categories

Resource Hub

by

Julia Valenti

Published on 06/12/2025

Last updated on 06/12/2025

Published on 06/12/2025

Last updated on 06/12/2025

Building actionable AI agents with AGNTCY ACP for seamless browser and terminal workflows

Get emerging insights on innovative technology straight to your inbox.

Building an agentic collaborator for seamless workflows

A planning-first architecture

Context is everything

Separate agents with separate responsibilities

The planning agent

The thinking agent

The executor agent

Interoperability through ACP

Everything runs in a sandbox

Evaluating ActionEngine

ActionEngine is now available on GitHub

Welcome to the future of agentic AI: The Internet of Agents

Related articles

AI/ML

Transform AI performance with agent observability and evaluation

AI/ML

AGNTCY Agent Directory: Find and publish AI agents with our new service

AI/ML

Integrate and evaluate: New MCP support and observability framework from AGNTCY