Skip to content

Proposal: Stop overloading of notes column #1864

@lukas2510

Description

@lukas2510

Hi IDoFT maintainers,

Context

  • Motivation: I started programmatic analysis on the dataset, focusing on tests that have a fix to infer the origin of flakiness from the fix commit. While doing so I ran into inconsistent evidence formats.
  • I analyzed only rows marked as “fixed” (DeveloperFixed, Accepted, InspiredAFix, FixedOrder) across pr-data.csv, gr-data.csv, py-data.csv.
  • Goal: verify the presence/quality of fix evidence using the existing “PR Link” and “Notes” fields for automation (e.g., to identify the origin of flakiness from the fix commit).
  • Reproducibility: the scripts and generated CSVs are available on my fork/branch: https://github.com/lukas2510/idoft/tree/origin-of-flakiness

Artifacts (full, reproducible samples)
Here is the overview over the datasets: edge_cases_report.csv

  • Fork/branch with scripts and outputs: https://github.com/lukas2510/idoft/tree/origin-of-flakiness
  • data_transformation/output/edge_cases/edge_case_pr_link_and_notes_samples.csv
  • data_transformation/output/edge_cases/edge_case_old_dataset_repo_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_idoft_repo_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_no_evidence_samples.csv
  • data_transformation/output/edge_cases/edge_case_bare_commit_sha_samples.csv
  • data_transformation/output/edge_cases/edge_case_multiple_links_in_notes_samples.csv
  • data_transformation/output/edge_cases/edge_case_pr_commit_url_samples.csv
  • data_transformation/output/edge_cases/edge_case_fork_link_samples.csv
  • data_transformation/output/edge_cases/edge_case_branch_tree_link_samples.csv

Why this matters

  • The Notes column is currently overloaded (issues, PRs, commits, branches, external references). This makes automation brittle and increases manual review.
  • A small, consistent structure would make it much easier to automate origin-of-flakiness analysis and keep the dataset uniform.

Minimal change proposal (grouped for clarity)

  1. Fix commit column & link format unification
    Example of a confusing row here
  • Add a single optional column: Commit Link (or Fix_Commit_Link). Keep existing PR Link.
  • Evidence hierarchy: Commit Link (best) > PR Link (second-best). If both exist, Commit Link is authoritative.
  • Accepted commit evidence formats (one of):
    • Full URL: https://github.com/<org>/<repo>/commit/<sha>
    • PR commit view: https://github.com/<org>/<repo>/pull/<num>/commits/<sha> (normalize/store as /commit/<sha>)
    • Bare 40-char SHA (should be resolved/expanded to full URL where possible)
  • Move any commit links currently embedded in Notes into the new Commit Link column.
  1. Untangle idoft-wrapped / self-referential links
    Example of a confusing row here
  • Pattern: Notes contains an idoft issue linking onward to the actual PR or commit.
  • Proposed rule: Store the final PR/commit in Commit Link / PR Link; keep the idoft issue only as contextual provenance in Notes.
  • Remove links to TestingResearchIllinois/flaky-test-dataset because it does not exist anymore.
  • Result: Notes becomes lighter
  1. Cross-repo redirection
    Example of a confusing row here
  • Pattern: Initial PR (original repo) in PR Link; final accepted fix in a different repo only in Notes.
  • Proposed rule: PR Link (or Commit Link) should point to the authoritative PR/commit where the fix landed (target repo). Initial exploratory/redirected PR kept in Notes.
  • Benefit: Downstream automation can fetch the diff from the correct repository without extra heuristics.

Questions for maintainers

  • In fixed-status rows with neither a PR link nor a commit link (currently 47 cases), is data missing or are these valid (e.g., private/internal fixes)? The cases can be found here
  • What do you think of my approach to remove the overloading of the notes column? I am open for other suggestions as well.

Thanks for your work on IDoFT! I’m happy to prepare a PR if this direction sounds good.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions