25.3 million pull request review comments on GitHub since January 2015 till December 2018.
xz-compressed CSV, with columns:
COMMENT_ID- identifier of the comment in mother dataset - GH ArchiveCOMMIT_ID- commit hash to which the review comment is attachedURL- path to the GitHub pull request the comment comes fromAUTHOR- GitHub user of the author of the commentCREATED_AT- creation date of the commentBODY- raw content of the comment
Python:
# too big for pandas.read_csv
import codecs
import csv
import lzma
with lzma.open("review_comments.csv.xz") as archf:
reader = csv.DictReader(codecs.getreader("utf-8")(archf))
for record in reader:
print(record)The dataset was generated from GH Archive in the following notebook.
The comments which exceeded Python's csv.field_size_limit equal to 128KB were discarded (~10 comments).
We gathered some statistics about the dataset.