rw-book-cover

Metadata

Highlights

  • Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. OmniParser closes this gap by ‘tokenizing’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements. (View Highlight)
  • OmniParser V2 takes this capability to the next level. Compared to its predecessor (opens in new tab), it achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation. In particular, OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version. Notably, Omniparser+GPT-4o achieves state-of-the-art average accuracy of 39.6 on a recently released grounding benchmark ScreenSpot Pro (opens in new tab), which features high resolution screen and tiny target icons. This is a substantially improvement on GPT-4o’s original score of 0.8. (View Highlight)
  • To enable faster experimentation with different agent settings, we created OmniTool, a dockerized Windows system that incorporates a suite of essential tools for agents. Out of the box, we enable OmniParser to be used with a variety of state-of-the-art LLMs: OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) and Anthropic (Sonnet) combining the screen understanding, grounding, action planning and execution steps. (View Highlight)