Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Example data - statistics extracted from patch annotations

The stats/ subdirectory contains JSON files generated with the diff-gather-stats script and its various subcommand, processing the output of running diff-annotate from-repo command, found in annotations/ subdirectory. Those files were generated by running the DVC pipeline (which is defined in the /dvc.yaml file), with the dvc repro command.

Using DVC pipeline makes it possible to regenerate only those files that need it, and re-run all stages that need it.

Here is the graph of DVC pipeline stages, as Mermaid flowchart:

flowchart LR
        node1["annotate@{0.a,0.b,1.c}"]
        node3["clone@{0,1}"]
        node4["purpose-counter@{0,1{"]
        node5["purpose-per-file@{0,1}"]
        node6["lines-stats@{0,1}"]
        node7["timeline@{0,1}"]
        node8["timeline.purpose-to-type@{0,1}"]
        node1-->node4
        node1-->node5
        node1-->node6
        node1-->node7
        node1-->node8
        node3-->node1
Loading

Variables in the DAG of DVC stages above:

You can also see whole up-to-date interactive graph of stages and their dependencies at https://dagshub.com/ncusi/PatchScope#repo-graph-view.

Those files are being analyzed by Jupyter notebooks in the /notebooks/ directory, see /notebooks/README.md.

Projects and repositories

The list of different example repositories taken into considerations is borrowed from the GitVision app demo site.

  • Large repositories:
    • TensorFlow: A comprehensive machine learning library by Google
      This repo provides a great example of a large, complex open-source project with a very active community.
    • ...

Other repositories were selected by authors of this project:

  • Qtile: A full-featured, hackable tiling window manager written and configured in Python
    This repo is a medium-sized, but quite active project.

Repositories are cloned into ~/example_repositories/. On authors workstation this directory is a symbolic link to /mnt/data/python-diff-annotator/example_repositories/ directory.

This operation can be done by running the "clone" stage of the DVC pipeline.

NOTE: all commands are assumed to be run from the top directory of the project, not from its examples/stats/ subdirectory.

Generating annotation data (for 'tensorflow')

The annotation data for further processing is generated directly from the repo in the "flat" format. It was generated with the following command:

diff-annotate from-repo \
  --output-dir=data/examples/annotations/tensorflow/ezhulenev/ \
  ~/example_repositories/tensorflow/ \
  --author=ezhulenev@google.com

and

diff-annotate from-repo \
  --output-dir=data/examples/annotations/tensorflow/yong.tang/ \
  ~/example_repositories/tensorflow/ \
  --author=yong.tang.github@outlook.com

both using the "annotate" stage of DVC pipeline.

The "flat" format has the following structure: <output_dir>/<commit_id>.json.

Generating stats data (for 'tensorflow')

Statistics computed from annotations were saved in JSON files, one single file per different type of statistics.

  • tensorflow.purpose-per-file.json (1.8 MB) was generated with the following command
    using the "purpose-per-file" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      purpose-per-file \
      data/examples/stats/tensorflow.purpose-per-file.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.lines-stats.json (9.8 MB) was generated with the following command
    using the "lines-stats" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      lines-stats \
      data/examples/stats/tensorflow.lines-stats.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.timeline.json (3.2 MB) was generated with the following command
    using the "timeline" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      timeline \
      data/examples/stats/tensorflow.timeline.json \
      data/examples/annotations/tensorflow/
    
  • tensorflow.timeline.purpose-to-type.json (3.2 MB) was generated with the following command
    using the "timeline.purpose-to-type" stage of DVC pipeline:

    diff-gather-stats --annotations-dir='' \
      timeline \
      --purpose-to-annotation=data \
      --purpose-to-annotation=documentation \
      --purpose-to-annotation=markup \
      --purpose-to-annotation=other \
      --purpose-to-annotation=project \
      --purpose-to-annotation=test \
      data/examples/stats/tensorflow.timeline.purpose-to-type.json \
      data/examples/annotations/tensorflow/