For anyone looking to try this in an E2E testing context, we just released a library for Playwright called ZeroStep (https://zerostep.com/) that lets you script AI based actions, assertions, and extractions.
This is a working example that tests the core "book a meeting" workflow in Calendly:
import { test, expect } from '@playwright/test'
import { ai } from '@zerostep/playwright'
test.describe('Calendly', () => {
test('book the next available timeslot', async ({ page }) => {
await page.goto('https://calendly.com/zerostep-test/test-calendly')
await ai('Verify that a calendar is displayed', { page, test })
await ai('Dismiss the privacy modal', { page, test })
await ai('Click on the first available day of the month', { page, test })
await ai('Click on the first available time in the sidebar', { page, test })
await ai('Click the Next button', { page, test })
await ai('Fill out the form with realistic values', { page, test })
await ai('Submit the form', { page, test })
const element = await page.getByText('You are scheduled')
expect(element).toBeDefined()
})
})
It would be much easier to consider this as solution if it would _output_ the generated test steps, and/or cache them and only modify them if needed.
Your example above - 7 function calls in one test. let's say usually closer to 5, we have hundreds of tests. Every single PR runs E2E tests. We open a handful of PRs a day. Let's call it 5. We're already looking at thousands of invocations a day. Based on your pricing, that would be incredibly expensive.
Pricing is listed on https://zerostep.com - you get 1,000 ai() calls per month for free, and then the cheapest paid plan is 2,000 ai() calls per month for $20, 4,000 for $40, etc. So basically you pay a penny per ai() call.
In terms of reliability - we have a hard dependency on the OpenAI API, so that's what will affect reliability the most. We're using GPT-3.5 and GPT-4 models, which have been fairly reliable, but we'll bump to GPT-4-Turbo eventually. Right now GPT-4-Turbo is listed as "not suited for production use" in OpenAI's docs: https://platform.openai.com/docs/models
That's one aspect of reliability, but the one I was more curious about was determinism. If I repeatedly run the same test suite on the same code base and the same data and configuration, am I guaranteed to get the same test results every time, or is it possible for ai() to change its mind about what actions to take?
Ah got it. So GPT is non-deterministic, but we somewhat handle that by having a caching layer in our AI. Basically if you make an ai() call, and we see that the page state is identical to a previous invocation of that exact AI prompt, then we will not consult the AI and install return you the cached result. We did this mainly to reduce costs and speed up execution of the 2nd-to-nth run of the same test, but it does make the AI a bit more deterministic.
There are some new features in GPT-4-Turbo that will let us handle determinism better, and we will be exploring that once GPT-4-Turbo is stable.
That makes a lot of sense, thank you for the explanation, I will have to explore this the next time I am building page tests. Have considered doing it myself but much happier using a relatively inexpensive product than maintaining the creaky homebuild version.
this seems useful based on the fact that software pieces do not work with each other. so the human has to manually move data from one to the other.
in most of the cases if users always have to do A->B, does it make more sense to build automation in code instead of using ai? the automation can be built by engineers who are also assisted by ai.
Been excited for this type of neural network since they announced it months and months ago. Imagine this type of agent in conjunction with a framework like autogen
This is a working example that tests the core "book a meeting" workflow in Calendly: