-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/problems in db #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
5a1b114
Remove example app
XanderVertegaal 2384fb8
Add problem app, model, migration
XanderVertegaal ce0410a
Import SICK command
XanderVertegaal 78ebdc5
Import FraCaS data
XanderVertegaal 157ce44
Remove default path in import_sick
XanderVertegaal 9f5691b
Update README
XanderVertegaal d29f4bf
Remove example app from frontend
XanderVertegaal 105960b
Black
XanderVertegaal b2d7717
Fix validator; squash migrations
XanderVertegaal 0b7e423
Add TQDM
XanderVertegaal 0a63f18
Add logger
XanderVertegaal a12e07a
Unified Problem model
XanderVertegaal b79e1cb
Remove progress util
XanderVertegaal 1e8ccbb
Add types and converters for Sick and Fracas problems
XanderVertegaal d32798d
Rework import scripts
XanderVertegaal 07b75e6
Replace TextField with JSONField
XanderVertegaal a6c1eaf
Rerun pip-compile
XanderVertegaal 0d8becf
Use logger instead of print
XanderVertegaal 84bfdcd
Remove unnecessary atomic()
XanderVertegaal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,3 +31,6 @@ venv/ | |
| ENV/ | ||
| env.bak/ | ||
| venv.bak/ | ||
|
|
||
| # Data files | ||
| problem/data/* | ||
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| import logging | ||
|
|
||
| logger = logging.getLogger('LangProAnnotator') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| from django.apps import AppConfig | ||
|
|
||
|
|
||
| class ProblemConfig(AppConfig): | ||
| default_auto_field = "django.db.models.BigAutoField" | ||
| name = "problem" |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| import json | ||
| import xml.etree.ElementTree as ET | ||
|
|
||
| from django.core.management.base import BaseCommand | ||
| from django.db import transaction | ||
| from tqdm import tqdm | ||
|
|
||
| from langpro_annotator.logger import logger | ||
| from problem.services import get_fracas_problems | ||
| from problem.models import Problem | ||
|
|
||
|
|
||
| class Command(BaseCommand): | ||
| help = "Import FraCaS problems from fracas.xml." | ||
|
|
||
| def add_arguments(self, parser): | ||
| parser.add_argument( | ||
| "--fracas_path", | ||
| type=str, | ||
| default="problem/data/fracas.xml", | ||
| help="Path to the fracas.xml file.", | ||
| ) | ||
|
|
||
| def handle(self, *args, **options): | ||
| fracas_path = options["fracas_path"] | ||
| self.import_fracas_problems(fracas_path) | ||
|
|
||
| @staticmethod | ||
| def _text_from_element(element: ET.Element) -> str: | ||
| """ | ||
| Extracts stripped text from an XML element, returning an empty string if the element is None or has no text. | ||
| """ | ||
| return element.text.strip() if element is not None and element.text else "" | ||
|
|
||
| @staticmethod | ||
| def _annotate_section_subsections(tree: ET.ElementTree) -> None: | ||
| """ | ||
| Annotates each problem in the XML tree with its corresponding section, subsection, and subsubsection. | ||
| """ | ||
| current_section = None | ||
| current_subsection = None | ||
| current_subsubsection = None | ||
|
|
||
| root = tree.getroot() | ||
|
|
||
| for element in root: | ||
| if element.tag == "comment" and element.attrib.get("class") == "section": | ||
| current_section = element.text.strip() | ||
| elif ( | ||
| element.tag == "comment" and element.attrib.get("class") == "subsection" | ||
| ): | ||
| current_subsection = element.text.strip() | ||
| elif ( | ||
| element.tag == "comment" | ||
| and element.attrib.get("class") == "subsubsection" | ||
| ): | ||
| current_subsubsection = element.text.strip() | ||
| elif element.tag == "problem": | ||
| if current_section: | ||
| element.set("section", current_section) | ||
| if current_subsection: | ||
| element.set("subsection", current_subsection) | ||
| if current_subsubsection: | ||
| element.set("subsubsection", current_subsubsection) | ||
|
|
||
| def import_fracas_problems(self, fracas_path: str) -> None: | ||
| tree = ET.parse(fracas_path) | ||
| self._annotate_section_subsections(tree) | ||
| root = tree.getroot() | ||
| all_problems = root.findall("problem") | ||
|
|
||
| created = 0 | ||
| skipped = 0 | ||
|
|
||
| existing_fracas_problems = get_fracas_problems() | ||
| existing_fracas_ids = {p.fracas_id for p in existing_fracas_problems} | ||
|
|
||
| for problem in tqdm(all_problems, desc="Importing FraCaS problems"): | ||
| problem_id = problem.get("id") | ||
| if problem_id is None: | ||
| raise ValueError( | ||
| "Problem ID is missing in the XML file for problem: {}".format( | ||
| problem | ||
| ) | ||
| ) | ||
|
|
||
| if int(problem_id) in existing_fracas_ids: | ||
| skipped += 1 | ||
| continue | ||
|
|
||
| question = self._text_from_element(problem.find("q")) | ||
| hypothesis = self._text_from_element(problem.find("h")) | ||
| answer = self._text_from_element(problem.find("a")) | ||
| note = self._text_from_element(problem.find("note")) | ||
|
|
||
| section = problem.get("section") | ||
| subsection = problem.get("subsection") | ||
| fracas_answer = problem.get("fracas_answer") | ||
| fracas_nonstandard = problem.get("fracas_nonstandard", False) == "true" | ||
|
|
||
| premise_nodes = problem.findall("p") | ||
| premises = [node.text.strip() for node in premise_nodes if node.text] | ||
|
|
||
| Problem.objects.create( | ||
| type=Problem.ProblemType.FRACAS, | ||
| content=json.dumps( | ||
| { | ||
| "fracas_id": int(problem_id), | ||
| "question": question, | ||
| "hypothesis": hypothesis, | ||
| "answer": answer, | ||
| "fracas_answer": fracas_answer, | ||
| "fracas_non_standard": fracas_nonstandard, | ||
| "note": note, | ||
| "section_name": section, | ||
| "subsection_name": subsection, | ||
| "premises": premises, | ||
| } | ||
| ), | ||
| ) | ||
| created += 1 | ||
|
|
||
| logger.info( | ||
| f"FraCaS problems import complete! Total: {created} | Skipped: {skipped}" | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| import csv | ||
| import json | ||
|
|
||
| from django.core.management.base import BaseCommand | ||
| from tqdm import tqdm | ||
|
|
||
| from langpro_annotator.logger import logger | ||
| from problem.models import Problem | ||
| from problem.services import get_sick_problems | ||
|
|
||
|
|
||
| class Command(BaseCommand): | ||
| help = "Import SICK problems from SICK.txt (a TSV file)." | ||
|
|
||
| def add_arguments(self, parser): | ||
| parser.add_argument( | ||
| "--sick_path", | ||
| type=str, | ||
| default="problem/data/SICK.txt", | ||
| help="Path to the SICK.txt file.", | ||
| ) | ||
|
|
||
| def handle(self, *args, **options): | ||
| sick_path = options["sick_path"] | ||
| self.import_sick_problems(sick_path) | ||
|
|
||
| def import_sick_problems(self, sick_path: str) -> None: | ||
| """ | ||
| Import SICK problems from SICK.txt (a TSV file) and enter them into the database. | ||
| """ | ||
|
|
||
| skipped = 0 | ||
| created = 0 | ||
|
|
||
| existing_sick_problems = get_sick_problems() | ||
| existing_pair_ids = {p.pair_id for p in existing_sick_problems} | ||
|
|
||
| with open(sick_path, "r", encoding="utf-8") as file: | ||
| reader = csv.DictReader(file, delimiter="\t") | ||
| problem_list = list(reader) | ||
|
|
||
| for problem in tqdm(problem_list, desc="Importing SICK problems"): | ||
| if problem["pair_ID"] in existing_pair_ids: | ||
| skipped += 1 | ||
| continue | ||
|
|
||
| created += 1 | ||
| Problem.objects.create( | ||
| type=Problem.ProblemType.SICK, | ||
| content=json.dumps(problem), | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"SICK problems import complete! Created: {created} | Skipped: {skipped}" | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # Generated by Django 4.2.20 on 2025-05-22 13:40 | ||
|
|
||
| from django.db import migrations, models | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
|
|
||
| initial = True | ||
|
|
||
| dependencies = [] | ||
|
|
||
| operations = [ | ||
| migrations.CreateModel( | ||
| name="Problem", | ||
| fields=[ | ||
| ( | ||
| "id", | ||
| models.BigAutoField( | ||
| auto_created=True, | ||
| primary_key=True, | ||
| serialize=False, | ||
| verbose_name="ID", | ||
| ), | ||
| ), | ||
| ( | ||
| "type", | ||
| models.CharField( | ||
| choices=[("sick", "Sick"), ("fracas", "FraCaS")], max_length=255 | ||
| ), | ||
| ), | ||
| ("content", models.JSONField()), | ||
| ], | ||
| ), | ||
| ] |
Empty file.
|
XanderVertegaal marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| from django.db import models | ||
|
|
||
|
|
||
| class Problem(models.Model): | ||
| class ProblemType(models.TextChoices): | ||
| SICK = "sick", "Sick" | ||
| FRACAS = "fracas", "FraCaS" | ||
|
|
||
| type = models.CharField( | ||
| max_length=255, | ||
| choices=ProblemType.choices, | ||
| ) | ||
|
|
||
| content = models.JSONField() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.