This HTML file demonstrates a text-to-speech application using the Gemini API and WebSockets. It allows you to type a text message into an input field, send it to the Gemini model, and receive an audio response that is played back in the browser. The application uses the browser's built-in WebSocket and AudioContext APIs to handle real-time communication and audio playback.
This example focuses on demonstrating:
- Low-level interaction with the Gemini API using WebSockets.
- Handling audio responses from the API and playing them back in the browser.
- Managing user input and displaying messages.
- Implementing audio chunk queuing and playback with
AudioContext.
The application's functionality can be broken down into several key components:
- API Endpoint: The application connects to the Gemini API using a specific WebSocket endpoint URL:
This URL includes the API host, the service path, and an API key for authentication. Remember to replace
wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey}${apiKey}with your actual API key. - WebSocket Object: A new
WebSocketobject is created in JavaScript, initiating the connection:const ws = new WebSocket(endpoint);
- Event Handlers: Event handlers are defined to manage the connection's lifecycle and handle incoming messages:
onopen: Triggered when the connection is successfully opened.onmessage: Triggered when a message is received from the server.onerror: Triggered if an error occurs during the connection.onclose: Triggered when the connection is closed.
onopenHandler: When theonopenevent is triggered, the application sends a setup message to the API.- Setup Message Structure: The setup message is a JSON object that configures the interaction:
const setupMessage = { setup: { model: "models/gemini-2.0-flash-exp", generation_config: { response_modalities: ["AUDIO"] } } };
model: Specifies the Gemini model to use ("models/gemini-2.0-flash-exp"in this case).generation_config: Configures the generation parameters. Here,response_modalitiesis set to["AUDIO"]to request audio output.
- Sending the Message: The setup message is stringified and sent to the server using
ws.send(). - Input Disabled: Initially, the user input field and send button are disabled. They are only enabled after the setup is complete.
onmessageHandler: Theonmessageevent handler receives messages from the server.- Handling Different Response Types: The code handles either
BloborJSONdata. It convertsBlobdata to text and parses the text as JSON. - Response Parsing: The received message is parsed as a JSON object using
JSON.parse(). - Message Types: The code checks for two types of messages:
setupComplete: Indicates that the setup process is finished.serverContent: Contains the model's response, which in this case will be audio data.
- Enabling Input: When a
setupCompletemessage is received, the application enables the user input field and the send button. sendUserMessage()Function: This function is called when the user clicks the "Send" button or presses Enter in the input field.- User Message Structure: The user message is a JSON object:
const contentMessage = { client_content: { turns: [{ role: "user", parts: [{ text: message }] }], turn_complete: true } };
client_content: Contains the conversation content.turns: An array representing the conversation turns.role: Indicates the role of the speaker ("user" in this case).parts: An array of content parts (in this case, a single text part containing the user's message).turn_complete: Set totrueto signal the end of the user's turn.
- Sending the Message: The content message is stringified and sent to the server using
ws.send(). - Clearing Input: The input field is cleared after the message is sent.
serverContentwith Audio: When aserverContentmessage containing audio data is received, the application extracts the base64-encoded audio data.inlineData: The audio data is found inresponse.serverContent.modelTurn.parts[0].inlineData.data.playAudioChunk(): This function is called to handle the audio chunk.- Audio Queue: Audio is pushed into an
audioQueuearray for processing. - Audio Playback Management:
isPlayingAudioflag ensures that chunks are played sequentially, one after the other.
ensureAudioInitialized(): This function initializes theAudioContextwhen the first audio chunk is received. This is done lazily to comply with browser autoplay policies. It sets a sample rate of 24000.- Lazy Initialization: The
AudioContextis only created when the first audio chunk is received. This is because some browsers restrict audio playback unless it's initiated by a user action. - Sample Rate: The sample rate is set to 24000 Hz, which is a common sample rate for speech audio.
- Lazy Initialization: The
playAudioChunk(): This function adds an audio chunk to a queue (audioQueue) and initiates audio playback if it's not already playing.processAudioQueue(): This function is responsible for processing and playing audio chunks from the queue.- Chunk Handling: It retrieves an audio chunk from the queue.
- Base64 Decoding: The base64-encoded audio chunk is decoded to an
ArrayBufferusingbase64ToArrayBuffer(). - PCM to Float32 Conversion: The raw PCM16LE (16-bit little-endian Pulse Code Modulation) audio data is converted to Float32 format using
convertPCM16LEToFloat32(). This is necessary becauseAudioContextworks with floating-point audio data. - Creating an
AudioBuffer: AnAudioBufferis created with a single channel, the appropriate length, and a sample rate of 24000 Hz. The Float32 audio data is then copied into theAudioBuffer. - Creating an
AudioBufferSourceNode: AnAudioBufferSourceNodeis created, which acts as a source for the audio data. TheAudioBufferis assigned to the source node. - Connecting to Destination: The source node is connected to the
AudioContext's destination (the speakers). - Starting Playback:
source.start(0)starts the playback of the audio chunk immediately. onendedEvent: A promise is used with theonendedevent of the source node to ensure that the next chunk in the queue is only played after the current chunk has finished playing. This is crucial for maintaining the correct order and avoiding overlapping audio.
base64ToArrayBuffer(): Converts a base64-encoded string to anArrayBuffer.convertPCM16LEToFloat32(): Converts PCM16LE audio data to Float32 format.logMessage(): Appends a message to theoutputdiv on the HTML page.
onerrorHandler: Logs WebSocket errors to the console and displays an error message on the page.oncloseHandler: Logs information about the connection closure.
This example demonstrates a basic text-to-speech application using the Gemini API with WebSockets. It showcases:
- Establishing a WebSocket connection and sending a setup message.
- Handling user input and sending text messages to the API.
- Receiving audio responses in base64-encoded chunks.
- Decoding and converting audio data to a format suitable for playback.
- Using
AudioContextto play the audio in the browser sequentially, one chunk after the other. - Implementing basic error handling and connection closure.
This example provides a starting point for building more sophisticated applications that can generate audio responses from the Gemini model and play them back in real time, all within the browser environment using low-level WebSockets and AudioContext for audio management.