This HTML file demonstrates a single exchange with the Gemini language model using WebSockets, illustrating the fundamental principles of interacting with the API at a low level, without using an SDK. The application establishes a WebSocket connection, sends a hardcoded user message, and displays the model's response in the browser.
This example is primarily for educational purposes, showcasing how to use the browser's built-in WebSocket API to communicate with the Gemini API directly. It is not intended to be a full-fledged chat application but rather a simplified demonstration of the underlying communication mechanism.
The application's functionality can be broken down into several key components:
- API Endpoint: The application connects to the Gemini API using a specific WebSocket endpoint URL:
This URL includes the API host, the service path, and an API key for authentication. Replace
wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=${apiKey}${apiKey}with your actual API key. - WebSocket Object: A new
WebSocketobject is created in JavaScript, initiating the connection:const ws = new WebSocket(endpoint);
- Event Handlers: Event handlers are defined to manage the connection's lifecycle and handle incoming messages:
onopen: Triggered when the connection is successfully opened.onmessage: Triggered when a message is received from the server.onerror: Triggered if an error occurs during the connection.onclose: Triggered when the connection is closed.
- API Requirement: The Gemini API requires a setup message to be sent as the very first message after the WebSocket connection is established. This is crucial for configuring the session.
onopenHandler: Theonopenevent handler, which is triggered when the connection is open, is responsible for sending this setup message.- Setup Message Structure: The setup message is a JSON object that conforms to the
BidiGenerateContentSetupformat as defined in the API documentation:const setupMessage = { setup: { model: "models/gemini-2.0-flash-exp", generation_config: { response_modalities: ["text"] } } };
model: Specifies the Gemini model to use ("models/gemini-2.0-flash-exp"in this case).generation_config: Configures the generation parameters, such as theresponse_modalities(set to"text"for text-based output). You can also specify other parameters liketemperature,top_p,top_k, etc., withingeneration_configas needed.
- Sending the Message: The setup message is stringified and sent to the server using
ws.send():ws.send(JSON.stringify(setupMessage));
onmessageHandler: Theonmessageevent handler receives messages from the server.- Data Handling: The code handles potential
Blobdata usingnew Response(event.data).text(), but in this text-only example, it directly parses the message as JSON. - Response Parsing: The received message is parsed as a JSON object using
JSON.parse(). - Message Types: The code specifically checks for a
BidiGenerateContentSetupCompletemessage type, indicated by thesetupCompletefield in the response.
setupCompleteCheck: The code includes a conditional check to ensure that asetupCompletemessage is received before sending any user content:if (response.setupComplete) { // ... Send user message ... }
- Why This Is Important: This check is essential because the API will not process user content messages until the setup is complete. Sending content before receiving confirmation that the setup is complete will likely result in an error or unexpected behavior. The API might close the connection if messages other than the initial setup message are sent before the setup is completed.
- Triggered by
setupComplete: Only after thesetupCompletemessage is received and processed does the application send a user message to the model. - User Message Structure: The user message is a JSON object conforming to the
BidiGenerateContentClientContentformat:const contentMessage = { client_content: { turns: [{ role: "user", parts: [{ text: "Hello! Are you there?" }] }], turn_complete: true } };
client_content: Contains the conversation content.turns: An array representing the conversation turns.role: Indicates the role of the speaker ("user" in this case).parts: An array of content parts (in this case, a single text part).text: The actual user message (hardcoded to "Hello! Are you there?").turn_complete: Set totrueto signal the end of the user's turn.
- Sending the Message: The content message is stringified and sent to the server using
ws.send().
serverContentHandling: When aserverContentmessage is received (which contains the model's response), the application extracts the response text.- Response Extraction: The model's response is accessed using
response.serverContent.modelTurn.parts[0]?.text. - Displaying the Response: The
logMessage()function displays the model's response in theoutputdiv on the HTML page.
onerrorHandler: Theonerrorevent handler logs any WebSocket errors to the console and displays an error message on the page.oncloseHandler: Theoncloseevent handler logs information about the connection closure, including the reason and status code.
logMessage()Function: This utility function creates a new paragraph element (<p>) and appends it to theoutputdiv, displaying the provided message on the page.
This example focuses on demonstrating the low-level interaction with the Gemini API using WebSockets. It highlights the importance of the setup phase and demonstrates how to:
- Establish a raw WebSocket connection without relying on an SDK.
- Send a properly formatted
BidiGenerateContentSetupmessage as the first message to configure the session. - Wait for and verify the
setupCompletemessage before sending any user content. - Send properly formatted
BidiGenerateContentClientContentmessages containing user input. - Parse and interpret the JSON responses from the API, including
BidiGenerateContentSetupCompleteandBidiGenerateContentServerContentmessages. - Handle basic events like
onopen,onmessage,onerror, andonclose.
By examining this code, you can gain a deeper understanding of the underlying communication protocol and message formats used by the Gemini API, particularly the mandatory setup phase. This knowledge can be valuable for debugging, troubleshooting, or building custom integrations that require more control than an SDK might offer.
Note: This is a simplified example for educational purposes. A real-world chat application would involve more complex features like:
- Dynamic user input.
- Handling multiple conversation turns.
- Maintaining conversation history.
- Potentially integrating audio or video.
This example provides a solid foundation for understanding the basic principles involved in interacting with the Gemini API at a low level using WebSockets, especially the crucial setup process.