Skip to content

Commit a65e9b1

Browse files
committed
some fixes
1 parent 3147acf commit a65e9b1

7 files changed

Lines changed: 336 additions & 117 deletions

File tree

generate_synthetic_table/prompts/academic.yaml

Lines changed: 47 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -94,37 +94,68 @@ generate_qa_from_image: |
9494
9595
generate_synthetic_table: |
9696
You are a Synthetic Data Generator specializing in Academic Data.
97-
Your task is to generate a new HTML table that mirrors the structure of the provided original table but contains entirely new, realistic synthetic academic data.
97+
98+
**CRITICAL INSTRUCTION: DO NOT COPY ORIGINAL DATA**
99+
Your task is to generate a new HTML table with the SAME STRUCTURE as the original but COMPLETELY DIFFERENT academic data values.
100+
The goal is to create realistic synthetic academic data that looks like it could come from the same domain, but with entirely different students, courses, and metrics.
98101
99102
**Inputs:**
100-
1. **Original Table Structure:**
103+
1. **Original Table Structure (for structure reference ONLY - DO NOT copy the data values):**
101104
{html}
102105
103-
2. **Table Summary:**
106+
2. **Table Summary (describes the data patterns to follow):**
104107
{summary}
105108
106109
**Requirements:**
107-
1. **Structure:** Keep the exact same HTML structure.
108-
2. **Data:** Replace ALL cell values with new, synthetic academic data.
109-
- Use realistic Korean student names, university names, course titles, and grades.
110-
- Contexts: Transcripts, Research Papers, Enrollment Stats, Faculty Lists.
111-
- Do NOT use real private data.
112-
3. **Consistency:** Ensure mathematical consistency (e.g., sum of credits, correct GPA calculations if visible).
113-
4. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
110+
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, merges).
111+
2. **Headers:** Keep header text the same (column names, category labels).
112+
3. **Data Transformation - MANDATORY:**
113+
- **ALL data cell values MUST be replaced with completely new synthetic values.**
114+
- **DO NOT copy any original data values** - generate fresh, realistic alternatives.
115+
- For student names: Generate new Korean student names (e.g., "김철수" → "이영희", "학생A" → "학생B")
116+
- For university names: Generate new Korean university names
117+
- For course titles: Generate new course names
118+
- For grades/scores: Generate new realistic values
119+
- For model names (if research table): Generate new model/method names
120+
- For dates: Generate new plausible dates
121+
4. **Domain Consistency:**
122+
- Ensure academic logic (credits sum correctly, GPA calculations valid)
123+
- Use realistic Korean academic terminology
124+
- Contexts: Transcripts, Research Papers, Enrollment Stats, Faculty Lists
125+
5. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
126+
127+
**Example Transformation:**
128+
- Original: "서울대학교" → Synthetic: "고려대학교"
129+
- Original: "학점 4.2" → Synthetic: "학점 3.8"
130+
- Original: "BERT-Large" → Synthetic: "RoBERTa-Base"
131+
132+
Remember: The synthetic table should look like a completely different academic dataset from the same domain.
114133
115134
generate_synthetic_table_from_image: |
116135
You are a Synthetic Data Generator specializing in Academic Data.
117-
Your task is to generate a new HTML table that mirrors the structure of the provided image but contains entirely new, realistic synthetic academic data.
136+
137+
**CRITICAL INSTRUCTION: DO NOT TRANSCRIBE - GENERATE NEW DATA**
138+
Your task is NOT to OCR/transcribe the image. Instead, you must:
139+
1. Understand the table's STRUCTURE from the image
140+
2. Understand it's an ACADEMIC table
141+
3. Generate COMPLETELY NEW synthetic academic data that fits the domain but uses different values
118142
119143
**Inputs:**
120-
1. **Image:** An image of an academic table.
144+
1. **Image:** An image of an academic table. Use this to understand structure and domain ONLY.
121145
122146
**Requirements:**
123147
1. **Structure Preservation:** Accurately reconstruct the table structure.
124-
2. **Data Generation:** Replace ALL cell values with new, synthetic academic data.
125-
- Use realistic Korean student names, course titles, grades, research topics.
126-
3. **Styling:** Use **Tailwind CSS** classes (same as default).
148+
2. **Headers:** Keep header text (column names, category labels) the same as in the image.
149+
3. **Data Generation - CRITICAL:**
150+
- **DO NOT copy the data values from the image** - this is NOT an OCR task
151+
- Generate COMPLETELY NEW synthetic academic values for all data cells
152+
- For student/model names: Generate new names (different from what you see)
153+
- For grades/scores: Generate new realistic values
154+
- For course/research topics: Generate new titles
155+
4. **Styling:** Use **Tailwind CSS** classes (same as default).
127156
- `class="border-collapse border border-slate-400 w-full text-sm text-left rtl:text-right text-gray-500"` on `<table>`.
128157
- `class="border border-slate-300 p-2 bg-gray-50 font-semibold"` on `<th>`.
129158
- `class="border border-slate-300 p-2"` on `<td>`.
130-
4. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
159+
5. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
160+
161+
Remember: The output should be a new synthetic academic dataset, not a transcription of the original.

generate_synthetic_table/prompts/business.yaml

Lines changed: 46 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -94,37 +94,67 @@ generate_qa_from_image: |
9494
9595
generate_synthetic_table: |
9696
You are a Synthetic Data Generator specializing in Business Data.
97-
Your task is to generate a new HTML table that mirrors the structure of the provided original table but contains entirely new, realistic synthetic business data.
97+
98+
**CRITICAL INSTRUCTION: DO NOT COPY ORIGINAL DATA**
99+
Your task is to generate a new HTML table with the SAME STRUCTURE as the original but COMPLETELY DIFFERENT business data values.
100+
The goal is to create realistic synthetic business data that looks like it could come from the same domain, but with entirely different companies, employees, products, and metrics.
98101
99102
**Inputs:**
100-
1. **Original Table Structure:**
103+
1. **Original Table Structure (for structure reference ONLY - DO NOT copy the data values):**
101104
{html}
102105
103-
2. **Table Summary:**
106+
2. **Table Summary (describes the data patterns to follow):**
104107
{summary}
105108
106109
**Requirements:**
107-
1. **Structure:** Keep the exact same HTML structure.
108-
2. **Data:** Replace ALL cell values with new, synthetic business data.
109-
- Use realistic Korean company names, department names, product lines, and financial metrics.
110-
- Contexts: Sales Reports, Inventory, HR Employee Lists, Marketing Campaigns.
111-
- Do NOT use real private data.
112-
3. **Consistency:** Ensure mathematical consistency (e.g., Q1 + Q2 + Q3 + Q4 = Total).
113-
4. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
110+
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, merges).
111+
2. **Headers:** Keep header text the same (column names, category labels).
112+
3. **Data Transformation - MANDATORY:**
113+
- **ALL data cell values MUST be replaced with completely new synthetic values.**
114+
- **DO NOT copy any original data values** - generate fresh, realistic alternatives.
115+
- For company names: Generate new Korean company names (e.g., "삼성물산" → "현대상사", "A팀" → "B팀")
116+
- For employee names: Generate new Korean names
117+
- For product names: Generate new product line names
118+
- For revenue/sales figures: Generate new realistic amounts (different values)
119+
- For dates: Generate new plausible dates
120+
4. **Domain Consistency:**
121+
- Ensure business logic (Q1+Q2+Q3+Q4=Total, percentages add up)
122+
- Use realistic Korean business terminology
123+
- Contexts: Sales Reports, Inventory, HR Employee Lists, Marketing Campaigns
124+
5. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
125+
126+
**Example Transformation:**
127+
- Original: "영업1팀" → Synthetic: "마케팅2팀"
128+
- Original: "매출 5억원" → Synthetic: "매출 7.3억원"
129+
- Original: "김부장" → Synthetic: "박과장"
130+
131+
Remember: The synthetic table should look like a completely different business dataset from the same domain.
114132
115133
generate_synthetic_table_from_image: |
116134
You are a Synthetic Data Generator specializing in Business Data.
117-
Your task is to generate a new HTML table that mirrors the structure of the provided image but contains entirely new, realistic synthetic business data.
135+
136+
**CRITICAL INSTRUCTION: DO NOT TRANSCRIBE - GENERATE NEW DATA**
137+
Your task is NOT to OCR/transcribe the image. Instead, you must:
138+
1. Understand the table's STRUCTURE from the image
139+
2. Understand it's a BUSINESS table
140+
3. Generate COMPLETELY NEW synthetic business data that fits the domain but uses different values
118141
119142
**Inputs:**
120-
1. **Image:** An image of a business table.
143+
1. **Image:** An image of a business table. Use this to understand structure and domain ONLY.
121144
122145
**Requirements:**
123146
1. **Structure Preservation:** Accurately reconstruct the table structure.
124-
2. **Data Generation:** Replace ALL cell values with new, synthetic business data.
125-
- Use realistic Korean company names, products, sales figures.
126-
3. **Styling:** Use **Tailwind CSS** classes (same as default).
147+
2. **Headers:** Keep header text (column names, category labels) the same as in the image.
148+
3. **Data Generation - CRITICAL:**
149+
- **DO NOT copy the data values from the image** - this is NOT an OCR task
150+
- Generate COMPLETELY NEW synthetic business values for all data cells
151+
- For company/team names: Generate new names (different from what you see)
152+
- For sales/revenue figures: Generate new realistic amounts
153+
- For employee names: Generate new Korean names
154+
4. **Styling:** Use **Tailwind CSS** classes (same as default).
127155
- `class="border-collapse border border-slate-400 w-full text-sm text-left rtl:text-right text-gray-500"` on `<table>`.
128156
- `class="border border-slate-300 p-2 bg-gray-50 font-semibold"` on `<th>`.
129157
- `class="border border-slate-300 p-2"` on `<td>`.
130-
4. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
158+
5. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
159+
160+
Remember: The output should be a new synthetic business dataset, not a transcription of the original.

generate_synthetic_table/prompts/default.yaml

Lines changed: 59 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -78,43 +78,82 @@ generate_qa_from_image: |
7878
Return ONLY the JSON object, no additional text.
7979
8080
generate_synthetic_table: |
81-
You are a Synthetic Data Generator.
82-
Your task is to generate a new HTML table that mirrors the structure of the provided original table but contains entirely new, realistic synthetic data.
81+
You are a Synthetic Data Generator specialized in creating completely NEW data while preserving table structure.
82+
83+
**CRITICAL INSTRUCTION: DO NOT COPY ORIGINAL DATA**
84+
Your task is to generate a new HTML table that has the SAME STRUCTURE as the original but with COMPLETELY DIFFERENT, newly generated data values.
85+
The goal is to create realistic synthetic data that looks like it could come from the same domain, but with entirely different entities, names, numbers, and values.
8386
8487
**Inputs:**
85-
1. **Original Table Structure:**
88+
1. **Original Table Structure (for structure reference ONLY - DO NOT copy the data values):**
8689
{html}
8790
88-
2. **Table Summary:**
91+
2. **Table Summary (describes the data patterns to follow):**
8992
{summary}
9093
9194
**Requirements:**
92-
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, merges) as the original table.
93-
2. **Data:** Replace ALL cell values with new, synthetic data.
94-
- Use realistic Korean names, organizations, and values suitable for the context.
95-
- Ensure the data is consistent with the column types and patterns described in the summary.
96-
- Do NOT use real private data.
97-
3. **Consistency:** Ensure mathematical consistency if applicable (e.g., sums, percentages).
98-
4. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. Do not include markdown code blocks.
95+
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, rowspan, colspan, merges) as the original.
96+
2. **Headers:** Keep header text the same (column names, row labels that describe categories).
97+
3. **Data Transformation - MANDATORY:**
98+
- **ALL data cell values MUST be replaced with completely new synthetic values.**
99+
- **DO NOT copy any original data values** - generate fresh, realistic alternatives.
100+
- For names: Generate new Korean names (e.g., 김철수 → 이영희, 박민수 → 정하늘)
101+
- For organizations: Generate new realistic Korean organization names
102+
- For numbers: Generate new realistic numbers that follow the same pattern/range but are different values
103+
- For dates: Generate new plausible dates
104+
- For addresses: Generate new realistic Korean addresses
105+
- For any other text: Generate semantically similar but different content
106+
4. **Domain Consistency:**
107+
- Analyze the summary to understand the domain context (finance, medical, public, etc.)
108+
- Generate data that is realistic for that specific domain
109+
- Maintain internal consistency (e.g., totals should sum correctly, percentages should add up)
110+
5. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.
111+
112+
**Example Transformation (showing the expected behavior):**
113+
- Original: "삼성전자" → Synthetic: "한국테크"
114+
- Original: "1,500,000" → Synthetic: "2,340,000"
115+
- Original: "2024-01-15" → Synthetic: "2024-03-22"
116+
- Original: "서울시 강남구" → Synthetic: "부산시 해운대구"
117+
118+
Remember: The synthetic table should look like a completely different dataset from the same domain, not a copy of the original.
99119
100120
generate_synthetic_table_from_image: |
101-
You are a Synthetic Data Generator specialized in Korean documents.
102-
Your task is to generate a new HTML table that mirrors the structure of the provided image but contains entirely new, realistic synthetic data.
121+
You are a Synthetic Data Generator specialized in creating completely NEW data from Korean document images.
122+
123+
**CRITICAL INSTRUCTION: DO NOT TRANSCRIBE - GENERATE NEW DATA**
124+
Your task is NOT to OCR/transcribe the image. Instead, you must:
125+
1. Understand the table's STRUCTURE from the image
126+
2. Understand the DOMAIN and data patterns from the image
127+
3. Generate COMPLETELY NEW synthetic data that fits the same domain but uses different values
103128
104129
**Inputs:**
105-
1. **Image:** An image of a table containing Korean text.
130+
1. **Image:** An image of a table containing Korean text. Use this to understand structure and domain ONLY.
106131
107132
**Requirements:**
108133
1. **Structure Preservation:** Accurately reconstruct the table structure, including `rowspan` and `colspan` attributes for merged cells.
109-
2. **Data Generation:** Replace ALL cell values with new, synthetic data.
110-
- Use realistic Korean names, organizations, and values suitable for the context of the table in the image.
111-
- Ensure the data is consistent with the column types (e.g., dates, numbers, text).
112-
- Do NOT use real private data.
113-
3. **Styling:** Use **Tailwind CSS** classes to style the table.
134+
2. **Headers:** Keep header text (column names, category labels) the same as in the image.
135+
3. **Data Generation - CRITICAL:**
136+
- **DO NOT copy the data values from the image** - this is NOT an OCR task
137+
- Generate COMPLETELY NEW synthetic values for all data cells
138+
- Analyze the domain from the image (finance, medical, public, etc.) and generate appropriate data
139+
- For names: Generate new Korean names different from what you see
140+
- For organizations: Generate new realistic Korean organization names
141+
- For numbers: Generate new realistic numbers in similar ranges but different values
142+
- For dates: Generate new plausible dates
143+
- For addresses: Generate new realistic Korean addresses
144+
- The synthetic table should look like a DIFFERENT dataset from the same domain
145+
4. **Styling:** Use **Tailwind CSS** classes to style the table.
114146
- Add `class="border-collapse border border-slate-400 w-full text-sm text-left rtl:text-right text-gray-500"` to the `<table>` tag.
115147
- Add `class="border border-slate-300 p-2 bg-gray-50 font-semibold"` to `<th>` tags.
116148
- Add `class="border border-slate-300 p-2"` to `<td>` tags.
117-
4. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. Do not include markdown code blocks.
149+
5. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.
150+
151+
**Example of Expected Behavior:**
152+
If the image shows a table with employee "김철수" and salary "3,500,000":
153+
- WRONG (OCR copy): "김철수", "3,500,000"
154+
- CORRECT (synthetic): "이영희", "4,200,000"
155+
156+
Remember: The output should be a new synthetic dataset, not a transcription of the original.
118157
119158
image_to_html: |
120159
You are an AI assistant specialized in OCR and HTML generation.

0 commit comments

Comments
 (0)