Skip to content

Introduction

Implementation of the Cognition Layer Algorithm ($A_1$) with Xplore

Xplore is a Graph-based agent that cycles into a loop focusing on describing, planning subgoals and reviewing or reacting to the current status by using the $A_2$ algorithm. It is a minimalist improvement of RePlan enhancing the reasoning capabilities by enforcing reasoning over the images obtained from the display.

Design Insights

Using LangGraph we can design a graph of agents that interacts with each other in order to obtain complex behaviors. The Xplore core graph is the following:

  • General Planner: This agent is responsible for generating a general plan to satisfy a query. This plan will be generated as a set of strings (plan) with a description of the status of the system and a reasoning on how the plan could be solved. This reasoning contains 3 mandatory labels to complete:

  • description: This ensures the agent reason about the current status of the system, identifying relevant properties such as which windows are open, the user OS, etc.

  • reasoning: This label makes the agent reason about how the goal can be reached.
  • plan: This label is where the plan (set of subgoals as strings) are shortly described following the previous reasoning.

  • Subgoal Planner: This agent is responsible for generating an action or set of actions (with the names according to the actions registered in the action space) in order to satisfy a subgoal. This agent contains 2 mandatory labels:

  • reasoning: A reasoning about what the user has requested and how that subgoal could be reached.
  • steps: The set of actions to execute in order to complete the goal. Note we force the agent to reason again about the subgoal, since this subgoal could be not fully described, so this step reinforces the precission and accuracy of the action generated.

  • Interpreter: This node is responsible of translating the set of actions generated by previous nodes into Exelent code that will be sent to $A_2$. When the interpreter has executed all the actions, it will return.

  • Review Completed: This node is responsible for calling an agent that will receive an image of the previous status of the machine and a posterior status after executing all the actions generated. Then it will remove from the subgoal list those subgoals that have been already reached. If all subgoals have been completed, it ends the execution, else it sends all the subgoals to the subgoal planner.

Vision Capabilities

In contrast with Planex or RePlan, Xplore have vision capabilities to integrate images from the display into the reasoning. For reaching this, each node that executes in Xplore, is provided with a screenshot of the current status of the display. LLM's such as OpenAI GPT-4o model enable the possibility of passing this images as zero-shot prompts, enhancing the reasoning of the agent fully adapting to the multiple variables of different systems.

An example screenshot passed to Xplore

Results

Key Advantages

  • Adaptability: By using visual capabilities, xplore is able to adapt to multiple environments, taking care of the arrangement of the windows, buttons, etc.
  • Target Aiming: Xplore uses both long-term and short-term planning for building the following actions. This awareness boosts the planning process by simplifying uncertain future status and focusing closer actions.
  • Explainability: By reasoning after each step we can define why the agent selects each action and what is its pourpuse.
  • Cause-Effect Aware: By reviewing the effects of each action the agent takes, makes the agent able to react to incorrect steps and correct the plan to a viable solution.

Key Disadvantages

  • Velocity: By reasoning and using visual capabilities in each action the velocity of each action has been reduced to 50s/task (aprox.).
  • Backward Planning: The Agent can't recover from certain actions where the next subgoal cannot be completed without replanning all the task or adding correction subgoals.
  • Hallucinations and Icon difficulties: Not all elements from the display are correctly interpreted from the agent, where visual icons, buttons or drawings are misunderstood.

In the next video, we provide an example execution of Xplore. Video