You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a `single unified interface <https://convokit.cornell.edu/documentation/architecture.html>`_ inspired by (and compatible with) scikit-learn. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is `4.1.0 <https://github.com/CornellNLP/ConvoKit/releases/tag/v4.1.0>`_ (released Mar. 10, 2026); follow the project on GitHub to keep track of updates.
21
+
22
+
Quick Links
23
+
-----------
24
+
25
+
* :doc:`installation` - Get started with ConvoKit
26
+
* :doc:`datasets` - Browse available conversational datasets
27
+
* :doc:`features` - Explore analysis features and APIs
* `Discord Community <https://discord.gg/WMFqMWgz6P>`_
31
+
32
+
Documentation
33
+
-------------
34
+
35
+
Documentation is hosted `here <https://convokit.cornell.edu/documentation/>`_.
36
+
37
+
If you are new to ConvoKit, great places to get started are:
38
+
39
+
* The `Core Concepts tutorial <https://convokit.cornell.edu/documentation/architecture.html>`_ for an overview of ConvoKit's object model
40
+
* The `High-level tutorial <https://convokit.cornell.edu/documentation/tutorial.html>`_ for a walkthrough of importing ConvoKit, loading a Corpus, and using its functions
41
+
42
+
For an overview, watch our SIGDIAL talk introducing the toolkit:
Join our `Discord community <https://discord.gg/WMFqMWgz6P>`_ to:
62
+
63
+
* Get help with installation and usage
64
+
* Stay updated on the latest releases
65
+
* Discuss progress, features, and issues
66
+
* Share your work and connect with others
67
+
68
+
Citation
69
+
--------
70
+
71
+
If you use ConvoKit code or datasets, please acknowledge the respective components in addition to:
72
+
73
+
Jonathan P. Chang, Caleb Chiam, Liye Fu, Andrew Wang, Justine Zhang, Cristian Danescu-Niculescu-Mizil. 2020.
74
+
"ConvoKit: A Toolkit for the Analysis of Conversations". *Proceedings of SIGDIAL*.
75
+
76
+
Funding
77
+
-------
78
+
79
+
*ConvoKit is funded in part by the U.S. National Science Foundation under Grant No. IIS-1750615 (CAREER). Any opinions, findings, and conclusions in this work are those of the author(s) and do not necessarily reflect the views of Cornell University or the National Science Foundation.*
@@ -334,7 +335,7 @@ A collection of 1,155 five-minute telephone conversations between two participan
334
335
</div>
335
336
336
337
Stanford Politeness Corpus
337
-
------------------------
338
+
--------------------------
338
339
339
340
.. raw:: html
340
341
@@ -466,7 +467,7 @@ Fora Corpus
466
467
467
468
<divclass="dataset-card"data-tags="small size, speaker info, utterance labels, timestamps, group, in person, various topics">
468
469
469
-
Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitated dialogues regarding issues like education, elections, and public health, primarily through the sharing of personal experience. The corpus is available by request from the authors (https://github.com/schropes/fora-corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed below.
470
+
Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitated dialogues regarding issues like education, elections, and public health, primarily through the sharing of personal experience. The corpus is available by request from the authors (https://github.com/schropes/fora-corpus) and ConvoKit contains code for converting the transcripts into ConvoKit format, as detailed in the documentation.
470
471
471
472
* **Tags:** small size, speaker info, utterance labels, timestamps, group, in person, various topics
@@ -476,12 +477,13 @@ Fora corpus is a dataset of 262 annotated transcripts of multi-person facilitate
476
477
</div>
477
478
478
479
Unintended Offense Corpus
479
-
-------------
480
+
-------------------------
480
481
481
482
.. raw:: html
482
483
483
484
<divclass="dataset-card"data-tags="online, asynchronous, outcome, labels, utterance labels, timestamps, Twitter/X, medium size, short conversations, various topics, politeness">
484
-
A collection of unintentionally offensive Tweets and replies in which a Tweet in the exchange was offensive to someone, followed by an indication that the poster meant no offense.
485
+
486
+
A collection of unintentionally offensive Tweets and replies in which a Tweet in the exchange was offensive to someone, followed by an indication that the poster meant no offense. ConvoKit contains code for converting the data into ConvoKit format, as detailed in the documentation.
485
487
486
488
* **Tags:** online, asynchronous, outcome, labels, utterance labels, timestamps, Twitter/X, medium size, short conversations, various topics, politeness
@@ -491,14 +493,15 @@ A collection of unintentionally offensive Tweets and replies in which a Tweet in
491
493
</div>
492
494
493
495
Ubuntu Chat Logs
494
-
-------------
496
+
----------------
495
497
496
498
.. raw:: html
497
499
498
500
<divclass="dataset-card"data-tags="online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, timestamps, small size, medium conversations, customer support, problem solving, derailment">
499
501
500
502
A collection of conversations featuring pairs of speakers where one speaker is assisting the other through Ubuntu chat logs to help them solve their problem.
501
503
504
+
* **Download name:** ``ubuntu-chat-logs``
502
505
* **Tags:** online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, timestamps, small size, medium conversations, customer support, problem solving, derailment
@@ -507,14 +510,15 @@ A collection of conversations featuring pairs of speakers where one speaker is a
507
510
</div>
508
511
509
512
Contextual Abuse Corpus
510
-
-------------
513
+
-----------------------
511
514
512
515
.. raw:: html
513
516
514
517
<divclass="dataset-card"data-tags="online, asynchronous, utterance, labels, timestamps, Reddit, medium size, short conversations, various topics">
515
518
516
519
A dataset of annotated Reddit entries labeled into one or more of six primary categories of abuse. Secondary categories, labels annotated in the context of the conversation thread, and rationales are also included as part of the dataset.
517
520
521
+
* **Download name:** ``contextual-abuse``
518
522
* **Tags:** online, asynchronous, utterance, labels, timestamps, Reddit, medium size, short conversations, various topics
@@ -523,14 +527,15 @@ A dataset of annotated Reddit entries labeled into one or more of six primary ca
523
527
</div>
524
528
525
529
NewsInterview Corpus
526
-
-------------
530
+
--------------------
527
531
528
532
.. raw:: html
529
533
530
534
<divclass="dataset-card"data-tags="dyadic, asymmetric, synchronous, speaker info, summaries, timestamps, media, medium size, medium conversations, various topics, interviews, Q&A">
531
535
532
536
A collection of two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), focusing on journalistic interviews between interviewers and sources from 2000 to 2020.
533
537
538
+
* **Download name:** ``news-interview``
534
539
* **Tags:** dyadic, asymmetric, synchronous, speaker info, summaries, timestamps, media, medium size, medium conversations, various topics, interviews, Q&A
@@ -539,14 +544,15 @@ A collection of two-person informational interviews from National Public Radio (
539
544
</div>
540
545
541
546
Emotional Support Conversation Corpus
542
-
-------------
547
+
-------------------------------------
543
548
544
549
.. raw:: html
545
550
546
551
<divclass="dataset-card"data-tags="online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, medium size, medium conversations, various topics, support">
547
552
548
553
This dataset contains approximately 1,300 conversations collected between emotional support seekers and supporters.
549
554
555
+
* **Download name:** ``emotional-support``
550
556
* **Tags:** online, dyadic, asymmetric, synchronous, outcome, labels, utterance labels, speaker info, medium size, medium conversations, various topics, support
0 commit comments