This is a straightforward plan to build the UI state capture system.
Goal: Get browser automation working and start integrating the AI.
What we need to do:
- Set up the project structure (folders, files)
- Install dependencies (Playwright, Google Generative AI library)
- Create a basic browser automation script
- Test that we can open a browser, navigate to a website, and take screenshots
- Test that we can click buttons and fill forms
- Set up Gemini API integration
- Build a basic task parser that can extract app name and action from a task description
What we'll have at the end:
- Browser automation that works (can navigate, click, take screenshots)
- Task parser that can understand natural language and extract structured info
Goal: Connect all the pieces together so we have a full end-to-end flow.
What we need to do:
- Build the navigation planner (uses LLM to generate step-by-step plan)
- Build the browser controller (executes actions like click, fill, navigate)
- Build the main Agent B orchestrator (ties everything together)
- Implement element finding (how to find buttons, forms, etc. by text, placeholder, role)
- Add screenshot capture system
- Test the full flow with one simple task
What we'll have at the end:
- A system that takes a task description, parses it, plans the steps, executes them, and captures screenshots
- At least one working task end-to-end
Goal: Make it more reliable and test it on multiple different tasks.
What we need to do:
- Improve state detection (wait for modals to fully load, forms to appear, etc.)
- Better element finding strategies (handle edge cases)
- Add basic error handling
- Test Task 1: Create task in Asana
- Test Task 2: Filter database in Notion
- Test Task 3: Create database in Notion
- Fix any issues that come up
What we'll have at the end:
- System that can handle 3 different tasks
- Screenshots captured at the right moments (not too early, not too late)
- Better error handling so it doesn't crash on edge cases
Goal: Handle authentication properly and test the remaining tasks.
What we need to do:
- Implement session/cookie management (save login state so we don't have to log in every time)
- Test Task 4: Pick another task
- Test Task 5: Pick another task
- Organize screenshot output better (clean folder structure)
- Add metadata to screenshots (what step it is, what action was taken)
- Create proper dataset folder structure
What we'll have at the end:
- Session persistence (login once, reuse for all tasks)
- All 5 tasks working reliably
- Well-organized dataset structure that's ready to submit
Goal: Prepare everything for submission.
What we need to do:
- Write README with setup instructions
- Record Loom video showing the system working and explaining the approach
- Finalize the dataset (make sure all screenshots are good quality)
- Write task descriptions for the dataset
- Test everything one more time to make sure it works
What we'll have at the end:
- README.md with clear instructions
- Loom video link
- Dataset folder with 5 tasks, organized and documented
- GitHub repo ready to share