AI agent that thinks and creates like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream.
Focus: Multimodal Storytelling with Interleaved Output
Build an agent that thinks and creates like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream. Leverage Gemini's native interleaved output to generate rich, mixed-media responses that combine narration with visuals, explanations with generated imagery, or storyboards with voiceover, all in one cohesive flow. Examples include Interactive storybooks (text + generated illustrations inline), marketing asset generator (copy + visuals + video in one go), educational explainers (narration woven with diagrams), and social content creator (caption + image + hashtags together).
Mandatory Tech: Must use Gemini's interleaved/mixed output capabilities. The agents are hosted on Google Cloud.
TaleSpark/
├── app.py # FastAPI backend
├── requirements.txt # Python dependencies
├── README.md # This file
├── frontend/ # Vue 3 + TypeScript + Vite
│ ├── package.json
│ ├── tsconfig.json
│ ├── vite.config.ts
│ ├── index.html
│ ├── public/
│ │ └── favicon.svg
│ └── src/
│ ├── main.ts # Entry point
│ ├── App.vue # Root component
│ ├── types.ts # TypeScript definitions
│ ├── styles/
│ │ └── main.css # Global styles + CSS variables
│ ├── composables/
│ │ ├── useAppState.ts # Global state management
│ │ ├── useSSE.ts # Server-Sent Events streaming
│ │ └── useThreeScene.ts # Three.js particle system
│ └── components/
│ ├── WelcomeScreen.vue # Hero with animated logo + particles
│ ├── StorySetup.vue # Genre selection + prompt input
│ ├── StoryViewer.vue # Streaming story display
│ ├── StoryComplete.vue # Celebration + stats
│ ├── LoadingScreen.vue # Animated quill writing
│ ├── GenreCard.vue # 3D tilt genre cards
│ ├── SceneCard.vue # Image + text + typewriter
│ └── AudioPlayer.vue # Custom audio player
├── dist/ # Built frontend (production)
├── static/ # Generated images/audio
└── plans/
└── frontend-architecture.md
flowchart LR
%% Styles
classDef frontend fill:#d4edda,stroke:#28a745,stroke-width:2px;
classDef backend fill:#cce5ff,stroke:#007bff,stroke-width:2px;
classDef ai fill:#f8d7da,stroke:#dc3545,stroke-width:2px;
classDef cloud fill:#fff3cd,stroke:#ffc107,stroke-width:2px;
%% Nodes
subgraph Frontend
UI[Web Browser]:::frontend
end
subgraph Backend
API[FastAPI Endpoint]:::backend
EQ[(Event Queue)]:::backend
TQ[(TTS Text Queue)]:::backend
LLM_W[Task 1: LLM Producer]:::backend
TTS_W[Task 2: TTS Worker]:::backend
FS[(Local Static Files)]:::backend
end
subgraph The_Brain
LLM[Gemini 2.5 Pro]:::ai
IMG[Imagen 3]:::ai
TTS[GCP TTS API]:::cloud
end
%% Flow 1: Initialization
UI -->|1. POST Prompt| API
API -->|Starts| LLM_W
API -->|Starts| TTS_W
%% Flow 2: Task 1 (Text & Image Interleaved)
LLM_W -->|2. Stream Chat| LLM
LLM -.->|Text Chunks| LLM_W
LLM_W -->|3. Tool Pause| IMG
IMG -.->|Image Data| LLM_W
%% Flow 3: Queues Routing
LLM_W -->|Push Text/Img Event| EQ
LLM_W -->|Push Sentences| TQ
%% Flow 4: Task 2 (Parallel Audio)
TQ -->|Pop Sentences| TTS_W
TTS_W -->|4. Synthesize| TTS
TTS -.->|MP3 Data| TTS_W
TTS_W -->|Save File| FS
TTS_W -->|Push Audio Event| EQ
%% Flow 5: Output to Frontend
EQ -->|5. SSE Stream| UI
UI -.->|6. Fetch MP3/JPG| FS
- Python 3.10+
- Node.js 18+
- Google Cloud project with Gemini API enabled
# 1. Install Python dependencies
pip install -r requirements.txt
# 2. Install frontend dependencies
Invoke-WebRequest https://get.pnpm.io/install.ps1 -UseBasicParsing | Invoke-Expression
cd frontend
pnpm install
# 3. Configure Google Cloud (set PROJECT_ID in app.py)
# Required: Google Cloud project with Vertex AI enabled# Terminal 1: Start the FastAPI backend
python app.py
# Backend runs at http://localhost:8000
# Terminal 2: Start Vue dev server (hot reload)
cd frontend
pnpm run dev
# Frontend runs at http://localhost:5173The frontend proxies API requests to the backend:
/api/*→http://localhost:8000/api/*/static/*→http://localhost:8000/static/*
# Build frontend
cd frontend
pnpm run build
# This creates the dist/ folder with static files
# Run production server
python app.py
# Serves the built frontend from dist/- Three.js Particle Background — Ambient golden particles floating upward, react to mouse movement, change color per genre
- Genre Theming — 5 distinct themes (Fantasy, Sci-Fi, Mystery, Fairy Tale, Adventure) via CSS custom properties
- GSAP Animations — Smooth page transitions, logo entrance, button glows, card 3D tilts
- Real-time Streaming — Server-Sent Events deliver story content as it's generated
- Typewriter Effect — Text streams in character-by-character with cursor
- Custom Audio Player — Styled player with progress bar and auto-play
- Responsive Design — Works on desktop, tablet, and mobile
- Gemini 2.5 Pro — Generates story text with interleaved tool calls
- Imagen 3.0 — Generates scene images
- google text-to-speech API — Converts text to speech narration
- Server-Sent Events — Streams content in real-time
- Create a Google Cloud project
- Enable Vertex AI API
- Set
PROJECT_IDinapp.py:
PROJECT_ID = "your-project-id"For production, you might want to use environment variables:
export PROJECT_ID="your-project-id"
export LOCATION="us-central1"| Layer | Technology |
|---|---|
| Frontend Framework | Vue 3 + TypeScript |
| Build Tool | Vite |
| Animations | GSAP |
| 3D Effects | Three.js |
| Styling | CSS Custom Properties |
| Backend | FastAPI (Python) |
| AI | Google Gemini + Imagen |
| Audio | google text to speech |
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Serve frontend |
| POST | /api/generate |
Generate story (SSE stream) |
Request:
{
"prompt": "A young dragon discovers it can speak human languages..."
}Response: Server-Sent Events stream
{"type": "image", "src": "/static/img_abc123.jpg"}
{"type": "text", "chunk": "Once upon a "}
{"type": "text", "chunk": "time, in a land..."}
{"type": "audio", "src": "/static/aud_def456.mp3"}MIT
Built for the Gemini Live Agent Challenge.