The future success of AI assistants hinges on UX

Why Planning, Execution, and Confirmation will get us there

May 08, 2024

For AI to gain adoption by consumers, for it to be as deeply integrated in our lives as smartphones are today, it needs to provide overwhelming value.

The greatest value in AI assistants is in the things they can do for us. It’s no surprise therefore, that there is so much investment into creating agentic AI.

Today, the killer task1 that AI models are good at (at a consumer-grade level) is information retrieval and synthesis2 on publicly available data. But this isn’t enough.

We need assistants that can do complex and arbitrary tasks on our private data while also interfacing with various applications.

For example, one specific category of tasks is:

Information retrieval and synthesis on personal data

—

How are my individual stocks performing this month?
[Access Brokerage Account]
→ [Pull up Account Holdings]
→ [Filter for individual stocks]
→ [Extract/Retrieve monthly performance]
→ [Synthesize contextual response]

—

When was the last time we had a meeting discussing our business outlook more than one year out?
[Access Account with Meeting Notes]
→ [Semantic search through notes]
→ [Synthesize response]

—

Which pizza place did I eat at in Lisbon with the large windows, wood-fired oven, and about a half mile from my Airbnb?
Option 1
[Access Google Maps]
→ [Find visited locations in Lisbon]
→ [Filter restaurants with pizza in description]
→ [Access Airbnb Account]
→ [Load Trips]
→ [Find Trip in Lisbon]
→ [Extract location]
→ [Select Restaurant ~0.5 miles from Airbnb]
—
Option 2
[Access Google Photos]
→ [Access Airbnb Account]
→ [Load Trips]
→ [Find Trip in Lisbon]
→ [Extract location]
→ [Find photos with location metadata ~0.5 miles away]
→ [Filter photos with pizza/wood-fired oven/large window]
→ [Match locations to establishments in Google Maps]
→ [Select restaurant with pizza in description]

—

These are all scenarios that aren’t possible today, but perhaps within the realm of possibility in a year or two as AI assistant software is developed with embedded access to and the capability of interacting with arbitrary UIs. This is why the promise of the Rabbit R13 (current instantiation notwithstanding) and their research on Large Action Models4 (put succinctly: a foundational model for human-designed UIs) approach is so interesting. And why there’s a lot to be bullish on releases from Apple and Google this year, simply because they are their gatekeepers to their respective mobile OS walled gardens.

However, we need to generalize further and expect AI assistants to go beyond information retrieval and synthesis. Tasks that are inherently more risky and/or for which it is harder to predict the user’s intent.

Any other rote task that today is typically accomplished through UI

Send an email to Amazon’s email support about the shampoo I just bought missing a pump. Request another item be sent out, and barring that a refund.
[Access Amazon Account]
→ [Pull up purchased items]
→ [Select the related item from context]
→ [Start an email support message]
→ [Compose message as requested]
→ [Send email]

—

Schedule a bill pay for all my outstanding credit card bills 11 days before their due date.
[Access Bank Account]
→ [Open Bill Pay Screen]
→ [Select unpaid credit card bills]
→ [For each bill, schedule payment]
Assumptions:
1. Pay the statement balance (as opposed to the minimum payment)
2. Use default checking account for payment
3. If insufficient funds in checking, initiate transfer from savings

—

The immediate reaction to this is, well, should we really trust AI agents, built on technology that is often wrong (or as some say, hallucinates, though this is imprecise) to actually do these tasks as intended?

In short, the answer is no, at least not directly, But there is a way to get nearly all the value from AI agents actually doing these tasks while accommodating the fact that they are sometimes wrong.

The Planning, Execution, and Confirmation Framework

We need assistants (AI agents) built on systems that adopt the following framework:

Let’s walk through all the edges.

—

[Planning] → [Execution]

The agent must plan what they need to do, based on the goals it is given and an understanding of the app it is working with, and then execute it.

[Execution] → [Planning]

After executing one or more actions, the environment (defined by the app) changes and the agent needs to decide what to do next to accomplish based on the given goals.

[Planning+Execution] → [Confirmation]

We reach a step that requires confirmation. One of three cases can happen. We’ll reference the rote tasks— Email Amazon and Bill Pay— we discussed before in the examples below.

The application supplies a directive that the next action requires confirmation
- e.g. sending an email/text always necessitates a confirmation [Email Amazon]
The agent independently decides that the next action requires confirmation, based on its own evaluation of how important the action is compared to user’s data or on user’s historical rejections in the past
- e.g. for credit card bills, the user has sometimes paid the minimum balance and sometimes paid the statement balance, so this requires confirmation [Bill Pay Assumption 1]
- e.g. user is generally sensitive to financial transactions (has rejected many finance-related steps in the past), so account selection for bill pay needs confirmation [Bill Pay Assumption 2]
The agent and application jointly decide that the next action requires confirmation
- e.g. the application flags that people may want to confirm savings to checking transfers, and agent notes that user generally prefers (and has indicated before) to split their funds across multiple accounts and banks for security reasons. for these reasons, the transfer from savings to checking requires confirmation [Bill Pay Assumption 3]

[Confirmation] → [Planning+Execution]

The human in the loop (us) decides whether or not to confirm the next action. If confirmed, proceed to execution. If rejected, go back to planning with the user’s response on why.

Now we can take the rote task examples above and re-imagine them with confirmations instead.

Send an email to Amazon’s email support about the shampoo I just bought missing a pump. Request another item be sent out, and barring that a refund.

[Access Amazon Account]
[Pull up purchased items]
[Select the related item from context]
[Start an email support message]
[Compose message as requested]
Before sending email, show the user the email and require confirmation
1. If rejected, go back to step 2 and readjust any steps as needed based on feedback
[Send Email]

—

Schedule a bill pay for all my outstanding credit card bills 11 days before their due date.

[Access Bank Account]
[Open Bill Pay Screen]
[Select unpaid credit card bills]
Before selecting the amount to pay for each bill, show the user the possible options between statement balance, minimum payment, and in-between that can be paid and require confirmation
1. User selects from options or rejects entirely
2. If rejected, go back to step 2 and readjust as needed based on feedback
Select the default checking account for all bill payments, but require confirmation
1. If rejected, based on feedback go back to the start of step 5 (if payment account needs adjusting) or further back (if more significant error).
Before scheduling all payments, show the user the payment details, including the amount, schedule date, payment accounts, etc. and require confirmation
1. If rejected, based on feedback go to the step that makes the most sense and adjust.
[For each bill, schedule payment]
Bank app indicates insufficient funds in checking account. Before initiating transfer from savings, require confirmation
1. If rejected, based on feedback go back to start of step 7 (if amount of transfer or account to transfer from needs adjusting), or further back/cancel (if with new information user changes their intent).

—

Overall, the confirmation loop resolves the problem of AI errors by relying on the human user directly when it makes sense to.

In fact, AI assistants interacting with traditional UIs designed for humans may just be a temporary stopgap. In the AI agent <> Human hybrid future, applications would be built with virtually unlimited flexibility, exposing functionality that could never be exposed in UI for humans. Instead of just providing portfolio performance at 3 months, 1 year, and 5 years, brokerage apps could supply the full, granular history as needed. And the standard account management interface present in any authenticated user-facing application today could be totally deprecated in favor of having more flexible control over one’s settings, private data, and support requests.

The future of human-facing UI, then, would be relegated exclusively to the confirmation loops, sometimes supplied by the application, and often generated on-the-fly by the assistant to provide the context necessary for a particular confirmation.

If this makes you feel vaguely uneasy, as it does me, take solace in the fact that this AI-assisted way of interacting with complex systems that affect our lives is perhaps a more natural state of affairs than being constrained to the subset of functionality exposed in UIs today.

Before we end, a quick note on why this framework will be necessary even as AI models get better and more capable in the future.

But why can’t we just build assistants that always get it right?

Why do we have to accommodate “dumb” AI models in the first place. Why can’t we just build better models— trained on larger sets of higher-quality tokens, with more parameters, and with better-designed architectures— that get always get it right?

The answer has to do with the fundamental difference between deterministic and stochastic computation.

Deterministic computation is more like the computers we have grown up with: transistor-based computation that is virtually 100% correct (with built-in error correction). This powers nearly everything we do today, and leads to the expectation that when we do something, be it initiating a bank transfer, typing a message, or loading a webpage, that these happen consistently and as we expect. Any failures are typically transient and the result of the systems we’ve built around this computation.

But asymptotic behavior of systems in general follow the Law of Diminishing Returns5 , which suggests that it requires more effort to go from 80 → 90%, 90% → 95%, 95% → 99%, etc. Suppose, then, though that we relax the 100% correctness constraint and walk backwards. Intuitively, a system of computation that targets 95% correctness would require exponentially less effort or equivalently, perform exponentially better at the same level of effort.

The flexibility afforded from such a shift isn’t just theoretical— we’re already benefactors of this today. Transformer-based stochastic computation powers the LLMs that will increasingly power the applications of tomorrow. Instead of being subject to the 100% correctness paradigm of carefully clicking and typing through UIs tens of times to accomplish a basic task, we’ll increasingly flock to the 95% correctness paradigm where our intent is probabilistically inferred from just a sentence6 describing what we want to happen, with a natural error correcting confirmation loop.

Acknowledgements

Special thanks to Dmitry Vagner for the conversation that sparked this post.

https://en.wikipedia.org/wiki/Killer_application

A shortlist of other tasks includes: artistic inspiration, data wrangling, and guided assistance on textual tasks (e.g. coding or writing). However, none of these tasks sufficiently impactful at convincing a majority of people to create an OpenAI account or purchase an AI device.

https://www.rabbit.tech/rabbit-r1

https://www.rabbit.tech/research

https://en.wikipedia.org/wiki/Diminishing_returns

And taking it one step further, intent could be proactively derived from our private data and past history of actions, so that AI assistants anticipate what we want accomplished. Subject to the confirmation loop as necessary.

Multimodal Echolalia

Discussion about this post