Skip to content

Add details on Human Evaluations#11

Open
PuneetKohli wants to merge 4 commits into
OSU-NLP-Group:mainfrom
careerflow:add-human-evals
Open

Add details on Human Evaluations#11
PuneetKohli wants to merge 4 commits into
OSU-NLP-Group:mainfrom
careerflow:add-human-evals

Conversation

@PuneetKohli
Copy link
Copy Markdown

Add Human Evaluation Guidelines Document

Summary

This PR adds comprehensive human evaluation guidelines to help submitters conduct consistent, high-quality evaluations for the Online Mind2Web benchmark. The document establishes standardized evaluation protocols and introduces Careerflow.ai as the official trusted evaluation partner.

Motivation

Human evaluation is required for all submissions to Online Mind2Web, but ensuring consistency and fairness across evaluations conducted by different submitters has been a challenge. Without standardized guidelines, evaluator subjectivity and varying interpretations can lead to biased or incomparable results, undermining the integrity of the benchmark.

Changes

  • Added HUMAN_EVALUATIONS.md: A comprehensive guide covering:

    • Standardized evaluation process (multi-annotator review with QA)
    • Detailed evaluation criteria and label definitions (0: Failure, 1: Success, 2: Not Executable)
    • Specific guidelines for filter verification, sorting requirements, and edge cases
    • Two submission options: conduct your own evaluation (with proof requirements) or use the official partner
    • Information about Careerflow.ai as the official evaluation partner
  • Updated README.md: Added a link to the guidelines document in the Evaluation Results section

Benefits

  1. Consistency: Establishes clear, standardized evaluation criteria that all submitters can follow
  2. Fairness: Ensures all evaluations are conducted using the same rigorous process
  3. Time Savings: Provides an official evaluation partner option that eliminates the need for submitters to recruit and manage evaluators
  4. Transparency: Clear documentation of evaluation requirements and processes
  5. Quality Assurance: Multi-annotator consensus model reduces individual evaluator bias

Notes

  • The document is written from the benchmark maintainers' perspective (using "we") since it's part of the official repository
  • Submitters can choose to conduct their own evaluations (with proof requirements) or use the official partner

Related

This addresses the need for standardized human evaluation guidelines mentioned in discussions about ensuring fair comparisons across submissions.

@PuneetKohli
Copy link
Copy Markdown
Author

@XueTianci Let's merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant