Skip to content

Introduction

The MouseAgent is a vision-based agent that uses LLM's capabilities to describe images in order to generate mouse intelligent movements and interactions. Thus, the main goal of this agent is to locate, move the mouse and interact with different elements in the user display.

For obtaining this, multiple machine-learning based models can be approached, however using LLM's vision capabilities we empower graphical reasoning where the agent not only generates an action, but also thinks about the arrangement of the apps, buttons and elements in the display, obtaining a better adaptability to multiple scenarios than fine-tunning or using specialized models.

Design Insights

MouseAgent uses an algorithm inspired on Mac Voice Control, where the user tells numbers to select where the mouse should move. We replicate this behavior by following these steps:

  1. Take a screenshot: This will be the first input to the LLM, where the agent must describe and analyse the general arrangement of the elements and reason about how to locate a specified element.
display screenshot

A screenshot taken from an example environment.

  1. Draw a Grid: Split the screenshot into mutliple cells, each one with its correspondent number so the agent can describe a sector of the screen with just the generation of a label.
grided screenshot

Each sector of the screen is labeled into a cell.

  1. Select Cell: The grid image is passed to the agent, which will reason about the possible cells containing the target element and will generate a number to zoom in.

  2. Zoom: Resize the selected cell and repeat the steps 3 and 4 until obtaining the desired accuracy.

grided screenshot

An example of zoomed cell. Here the agent could find the button "Code"

5 Inverse Resolution: Use inverse coordinate calculation in order to determine the pixel or coordinates of the center of the display's selected sector and use those coordinates in order to move the mouse to that location.

Implementation

All the grid drawing and computation has been designed into the Grid class at this file. It encapsulates the logic of zooming and selecting cells as shows:


    # Loading Screenshot
    image_path = "screenshot.png"
    image = Image.open(image_path)
    grid = Grid.from_image(image, size=10)

    # grid.image contains the grid drawing
    plt.imshow(grid.image) 
    plt.show()

    # You can make zoom into the cells
    selection = int(input("Cell: "))
    sub_grid = grid.zoom(selection, size=5)
    plt.imshow(sub_grid.image)
    plt.show()

   # You can also access the cells by using the grid attribute
   print(grid.cells[9])

The class MouseDescriptor uses the Grid class to move the mouse. MouseAgent generates a graph that executes the algorithm explained in the previous section.

```python

The MouseAgent can be called with just one command

MouseAgent.find("Amazon Icon at Firefox") ``

Note: higher iterations in the MouseAgent enables more accuracy, however, using more than 2 iterations has been shown to be ineffective for user displays.