This document turns the private-dataset strategy into a contract-focused execution plan.
Use it when you want to build the first internal golden set for memory-path-engine from real contract text.
The first private contract dataset should answer one question:
On real contract documents, does graph-aware retrieval surface the right evidence, path shape, and exception logic better than flat retrieval?
This is not a document-archive project. It is an evaluation asset.
The first version should therefore optimize for:
- stable source documents
- high-quality evidence labels
- enough diversity to expose graph and exception behavior
- low enough scope that a small team can finish it in
2-4weeks
Recommended first pilot:
30contracts120-180golden cases4-6cases per contract- one primary language per pilot if possible
Recommended first report:
lexical_baselineembedding_baselinestructure_onlyweighted_graph
Add activation_spreading_v1 only after the golden set is stable.
Before annotation starts, build a document inventory.
Suggested columns:
| Column | Required | Purpose |
|---|---|---|
doc_id |
yes | Stable internal document identifier |
file_name |
yes | Source filename |
file_path |
yes | Original storage path |
contract_type |
yes | MSA, NDA, procurement, software license, DPA, services, etc. |
template_family |
yes | Standard template family or non_standard |
template_version |
no | Version if known |
language |
yes | zh, en, bilingual |
business_line |
no | Business grouping for later stratification |
counterparty_type |
no | Supplier, customer, partner, employee, etc. |
signed_date |
no | Useful for version drift and legacy wording |
page_count |
no | Complexity proxy |
has_appendix |
yes | yes / no |
has_sow_or_order_form |
yes | yes / no |
risk_level |
yes | low, medium, high |
exception_density |
yes | low, medium, high by quick reviewer estimate |
termination_complexity |
yes | low, medium, high |
selected_for_pilot |
yes | yes / no |
selection_reason |
no | Why this contract entered the pilot |
Minimum rule:
- do not annotate anything before this inventory exists
Do not choose the first 30 contracts randomly.
Use stratified sampling with the following target mix:
8-10MSA / services agreements5-6procurement or supply agreements4-5software license / SaaS agreements3-4DPA / privacy / security agreements3-4NDA / confidentiality or framework agreements3-4non-standard or heavily negotiated contracts
If your corpus is mostly Chinese:
20-24Chinese contracts4-6English contracts2-4bilingual contracts
If bilingual quality is noisy, move bilingual contracts into phase 2 instead of phase 1.
- at least
10contracts with clear termination sections - at least
10contracts with visible exception wording - at least
8contracts with appendices, schedules, or order-form dependencies - at least
6contracts with negotiated deviations from a standard template
Exclude:
- scanned image PDFs without reliable OCR
- contracts with severe redaction that breaks clause logic
- duplicates of the same template unless version comparison is the explicit goal
- contracts whose clause numbering is too broken to assign stable node ids quickly
Do not force annotators to write benchmark JSON directly.
Use a tabular annotation sheet first, then export to JSON.
Suggested columns:
| Column | Required | Purpose |
|---|---|---|
case_id |
yes | Stable case id |
doc_id |
yes | Join back to the inventory |
query |
yes | Business-style question |
answer_short |
yes | Short canonical answer for human review |
case_family |
yes | One of the taxonomy families below |
difficulty |
yes | easy, medium, hard |
tags |
yes | Comma-separated tags such as multi_hop, exception, termination |
evidence_node_ids |
yes | Gold evidence nodes |
minimum_evidence_matches |
yes | Usually 1 or full evidence count |
path_scope |
no | best_path or any_path |
path_steps |
no | Ordered node ids plus optional edge types |
required_edge_types |
no | Needed edge types |
required_semantic_roles |
no | Roles such as exception, remedy, condition |
required_contradiction_pairs |
no | Gold contradiction or override pairs |
annotator_a |
yes | Initial annotator |
annotator_b |
yes | Reviewer annotator |
review_status |
yes | draft, reviewed, approved, rejected |
review_notes |
no | Why a label changed |
The first pilot should cover the following question families.
Purpose:
- baseline sanity checks
- confirm the parser and node ids are stable
Examples:
- "What is the invoice payment term?"
- "How many days does the customer have to cure a breach?"
Target share:
20-25%
Purpose:
- validate graph-aware retrieval
- connect precondition, trigger, and consequence clauses
Examples:
- "If the customer materially breaches and fails to cure, what may the supplier do next?"
- "When termination happens, what post-termination obligations remain?"
Target share:
25-30%
Purpose:
- validate
exception_to, anomaly-like wording, and semantic roles
Examples:
- "Does the standard 30-day payment rule still apply if the deliverables are defective?"
- "When does a limitation-of-liability cap not apply?"
Target share:
20-25%
Purpose:
- detect rule tension between clauses, schedules, or amendments
Examples:
- "The main agreement says thirty days, but the order form says fifteen days. Which governs?"
- "One section allows subcontracting; another forbids disclosure to third parties. What constraint actually applies?"
Target share:
10-15%
Purpose:
- useful for procedural contract logic such as notice -> cure -> termination -> return/delete/certify
Examples:
- "After notice and failed cure, what is the next contractual consequence?"
- "What happens after termination notice is delivered and the cure period expires?"
Target share:
10-15%
Purpose:
- prevent misleading wins from clause wording overlap
Examples:
- two payment clauses in different contexts
- one termination clause for convenience and another for breach
Target share:
10%embedded across the above families
Your gold labels must target stable unit ids, not free text.
Recommended rule:
node_id = "{document_stem}:{unit_number}"
Where unit_number comes from the parser's final clause or sentence segmentation layer.
Required discipline:
- once a pilot dataset is frozen, do not silently renumber node ids
- if parsing changes, create a new dataset version
- Convert contracts into stable text.
- Split into structural units.
- Assign candidate
node_ids. - Run light rule tagging for likely:
exceptionterminationnoticecureliabilitypayment
Annotator A fills:
queryanswer_shortevidence_node_ids- optional
path - optional semantic and contradiction fields
Annotator B independently checks:
- whether evidence is minimal and sufficient
- whether the path is over-constrained
- whether the case should really be tagged
exceptionorcontradiction
A reviewer resolves disagreements and writes a short note when changing:
- evidence set
- path shape
- contradiction pair
- case family
Only approved rows are exported to benchmark JSON.
The pilot should not be accepted unless:
- at least
120approved cases exist - evidence-label agreement between A and B is
>= 0.85 - every case has valid
node_ids that resolve in the frozen document snapshot - at least
30%of cases are non-trivial (multi_hop,exception, orcontradiction) - at least
10hard-negative pairs exist across the dataset
Use simple tools first:
- spreadsheet or Airtable for inventory
- spreadsheet or internal labeling sheet for golden annotations
- export script to benchmark JSON
Do not start with a complex custom annotation app unless your team already has one.
{
"case_id": "pilot-msa-termination-001",
"query": "If the customer materially breaches the agreement and does not cure after notice, what may happen next and what obligation remains after termination?",
"tags": ["contract", "termination", "multi_hop"],
"expectation": {
"evidence_node_ids": ["msa_2024_v3:21", "msa_2024_v3:22"],
"minimum_evidence_matches": 1,
"path_scope": "best_path",
"path": {
"match_mode": "prefix",
"steps": [
{ "node_id": "msa_2024_v3:22", "via_edge_type": null },
{ "node_id": "msa_2024_v3:21", "via_edge_type": "depends_on" }
]
}
}
}- build inventory
- choose first 30 contracts
- freeze source snapshots
- finish parsing and stable node ids
- annotate first 40-60 cases
- review and adjudicate first half
- annotate second half
- export approved cases
- run the first private benchmark comparison
- write the first error analysis memo
After the first pilot is stable, expand in this order:
- more contracts from the same families
- bilingual contracts
- amendment chains and appendices
- dynamic-memory sequential cases
- comparison across template versions