feat: implement chatroom for multi-llm conversation by dolaameng · Pull Request #162 · Kaggle/kaggle-benchmarks

dolaameng · 2026-05-14T01:02:49Z

What

ChatRoom is a shared conversation context for kaggle-benchmarks that lets
multiple LLMs converse with full awareness of each other's identities and roles.

Key capabilities:

Perspective-aware message routing — each LLM sees its own messages as
assistant and peers' messages as attributed user messages.
Automatic participant isolation — the same LLM instance can be reused for
multiple participants without identity collisions.
Private information — messages can be restricted to specific participants
via visible_to, and multi-turn private conversations are supported via
private_channel.
Code-driven participants — game engines, moderators, and rule systems can
participate alongside LLMs as Actor instances.

Why

Multi-agent evaluation (debate, negotiation, social deduction, cooperative games)
is an increasingly important dimension for frontier LLM benchmarking, but the
existing Chat API doesn't support it natively:

LLMs are unaware of each other. Each agent has an isolated chat context.
The user manually forwards messages between them, stripping and re-injecting
roles. LLMs have no idea they are talking to another LLM.
Boilerplate is high. Existing multi-agent benchmarks in this repo
(dungeon_adventure.py,
game_tic_tac_toe.py,
pgg.py)
each re-implement ~40–160 lines of manual orchestration.
No conversation memory. Some benchmarks create a brand new Chat
every turn, leaving LLMs with zero memory of previous turns.

How

Core Abstraction

A ChatRoom is a shared conversation space. Users register participants via
add_participant(), then drive the conversation inside a with room: block
using two primitives:

Primitive	Purpose	LLM Call?
`room.post(msg)`	Narrator broadcast (system-level directives)	No
`participant.reply()` / `actor.say(msg)`	A participant speaks	Depends on type

room = kbench.ChatRoom(system_prompt="A structured debate.")
alice = room.add_participant(llm, name="Alice", system_prompt="Argue FOR.")
bob = room.add_participant(llm, name="Bob", system_prompt="Argue AGAINST.")

with room:
    room.post("Topic: Should we phase out fossil fuels by 2035?")
    alice.reply()
    bob.reply()

Participant Registration (`add_participant`)

room.add_participant(actor, *, name=, avatar=, system_prompt=):

Identity awareness — auto-injects a participant roster into each LLM's
system prompt so every agent knows who else is in the room.
Message routing — when any participant speaks, their message is
automatically visible to all other participants.
Automatic isolation — for LLMChat participants, creates an independent
clone so the same LLM can be reused for multiple participants.

Perspective Projection

The core mechanism that makes multi-agent conversations work. When an LLM calls
reply(), the room:

Builds a system prompt — concatenates participant roster + room prompt +
personal prompt.
Projects the message history — creates a viewer-specific copy where the
viewer's own messages have role assistant and all peer messages have role
user with name prefixes (e.g., [Bob]: ...).
Calls the LLM with the projected history, then appends the response to
the ground-truth log.

Original messages are never mutated — projection always creates new objects.

Private Information

visible_to — any room.post() can be restricted to a subset of
participants. Hidden messages are filtered out during perspective projection.
private_channel — creates a child ChatRoom for multi-turn private
conversations (e.g., werewolf night phase). Members retain context across
both the main room and the channel; non-members never see channel messages.

Code-Driven Participants (`Actor`)

Non-LLM participants (game engines, moderators) participate as Actor instances
using actor.say(msg). Other participants see attributed messages just like LLM
messages (e.g., [Game]: Board: X|O|_).

Integration with Existing Framework

Context manager — ChatRoom integrates with the existing contexts.enter()
system, so chats.get_current_chat() returns the active room.
Chat hierarchy — entering a ChatRoom auto-registers it as a nested step
in the parent chat, so the full transcript renders in the Panel UI without any
UI modifications.
Structured outputs — reply(schema=...) works for typed responses (e.g.,
game moves, votes).
Caching — participant isolation ensures cache IDs are based on the
underlying model identity, not participant names.

Example: Before & After

Tic-Tac-Toe

Before — fresh chat each turn, zero memory:

while not game.is_game_over():
    with kbench.chats.new(...):           # brand new context every turn
        move = llm.prompt(state, schema=action_schema)

After — full history, attributed turns:

room = kbench.ChatRoom()
game_engine = kbench.Actor(name="Game", avatar="🎮")
room.add_participant(game_engine)
player_x = room.add_participant(llm, name="Player X")
player_o = room.add_participant(llm, name="Player O")

with room:
    while not game.is_game_over():
        game_engine.say(f"Board:\n{game.get_board()}")
        move = players[game.current].reply(schema=TicTacToeMove)
        game.make_move(move)

Key Files

File	Role
chats.py	`ChatRoom` class, perspective projection, visibility filtering
llms.py	`LLMChat.reply()` implementation
base.py	`Actor.say()` implementation
test_chatroom.py	Test suite

Open Questions

reply() visibility — should reply() support visible_to for private
LLM-generated responses? Deferred for now; all reply() output is public.
Serialization — how to serialize room conversations in run.json
(ground-truth log vs. per-participant projections vs. both).

…om framework - Implemented ChatRoom context manager for multi-agent perspective-aware message routing. - Added identity awareness, system prompt enrichment, and automatic roster injection. - Added support for private channels, visible_to restrictions, and interleaved histories. - Refactored game_werewolf_chatroom.py to dynamically scale up to 7 players. - Added unit tests verifying multi-directional privacy and sealed bid isolation.

…oting bugs - Fixed cache_id in runs.py and slug in serialization.py to resolve the actual model version identifier instead of participant name. - Fixed werewolf game vote extraction robustness in game_werewolf_chatroom.py to prevent false-positives on mentioned names. - Fixed test_corporate_takeover_chatroom assertion types in test_chatroom.py.

…sing findings

…fix streaming typeerrors - Switched game_werewolf_chatroom.py to use structured WerewolfVote outputs. - Enabled kbench.config.enable_interactive_mode() and player.stream_responses = True for live onstream rendering. - Added survival/existence validation to voting loops to prevent crashes during ties or votes on dead players. - Fixed panel.py new_chunk streaming TypeError by extracting string content from LLMResponse chunk objects. - Updated tests/test_chatroom.py werewolf mock responses to return structured JSON.

…sign examples

…update design doc - Remove 4-player backward compat; run_werewolf now strictly requires 7 players. - Upgrade Alice/Bob wolf prompts with double-bluff and distancing strategies. - Update test_werewolf_chatroom to simulate a full 2-round, 7-player game. - Add section 9.5 to design.md for Panel streaming bug fix. - Remove example-specific structured voting details from design doc.

…vatars - Replace fuzzy name matching with explicit eligible name lists in vote prompts. - Use neutral role-agnostic avatars to avoid spoiling werewolf identities.

develra

Reviewed at a high level to the best of my ability in a time-boxed way (30 minutes) - mostly looking at the examples and tests. LGTM - neat feature, but def might have missed some more subtle issues.

dolaameng · 2026-05-29T15:34:18Z

+        # Uses role="user" because LLM APIs require user/assistant alternation.
+        self._narrator = actors.Actor(name=name, role="user", avatar="📢")
+
+    def add_participant(


Do we still need to clone the llm?

dolaameng added 13 commits May 14, 2026 00:47

docs: Update design.md with multi-agent cache, identity, and vote par…

83662c7

…sing findings

docs: Clean up example-specific details from design.md

d5aca19

feat(chatroom): Add reason field to WerewolfVote schema and update de…

9033944

…sign examples

fix(werewolf): List exact player names in vote prompts, use neutral a…

38c5f29

…vatars - Replace fuzzy name matching with explicit eligible name lists in vote prompts. - Use neutral role-agnostic avatars to avoid spoiling werewolf identities.

fix tests

74c2d58

Merge branch 'ci' into dolaameng/chatroom

419edf6

refactor

fb059a8

merge ci

3bfb4a0

turing test

633ced9

dolaameng force-pushed the dolaameng/chatroom branch 2 times, most recently from bc7701a to 633ced9 Compare May 27, 2026 22:34

dolaameng added 7 commits May 27, 2026 22:44

remove room.run from pr

64b040a

tidy up

106100c

add pizza order example

c928c65

refactor add_participant api

73ea0ed

add golden tests

e6e2fa1

change talk to say/reply

cd1378e

minor

2e84931

dolaameng force-pushed the dolaameng/chatroom branch from ca3ee6b to 2e84931 Compare May 28, 2026 22:04

dolaameng added 3 commits May 28, 2026 22:20

refactor chats

1d5de50

refactor reply

fe03067

refactor chatroom

7922f61

dolaameng requested review from develra and s-alexey May 29, 2026 03:33

dolaameng marked this pull request as ready for review May 29, 2026 03:34

Update design doc to match current implementation

9b3ec0e

dolaameng force-pushed the dolaameng/chatroom branch from e8e876e to 9b3ec0e Compare May 29, 2026 04:02

develra approved these changes May 29, 2026

View reviewed changes

dolaameng commented May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement chatroom for multi-llm conversation#162

feat: implement chatroom for multi-llm conversation#162
dolaameng wants to merge 24 commits into
cifrom
dolaameng/chatroom

dolaameng commented May 14, 2026 •

edited

Loading

Uh oh!

develra left a comment

Uh oh!

dolaameng May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolaameng commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Core Abstraction

Participant Registration (add_participant)

Perspective Projection

Private Information

Code-Driven Participants (Actor)

Integration with Existing Framework

Example: Before & After

Tic-Tac-Toe

Key Files

Open Questions

Uh oh!

develra left a comment

Choose a reason for hiding this comment

Uh oh!

dolaameng May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dolaameng commented May 14, 2026 •

edited

Loading

Participant Registration (`add_participant`)

Code-Driven Participants (`Actor`)