Data Engineering - Project 1

Assignment

Choose a computer science research topic (e.g., query optimization) and download HTML scientific papers on the chosen topic
From the HTML source files, extract:
- tables
- tables’ captions
- footnotes in the tables or in the captions
- paragraphs containing references to the extracted tables (paragraph text, references text or link)
For each paper:
- Using XPATH, extract all the tables;
- for each table extract its caption, footnotes (if present), paragraphs with references to the table
- Store the extracted data in a .json file, named exactly as the identifier
Structure of the json file:

{ 
    "id_table_1": 
    {
        "caption": "text",
        "table": "html_table",
        "footnotes": ["footnote1", "footnote2", ...],
        "references": ["paragraph",  "paragraph", ...]
    },

    "id_table_2": {
        ...
    }, 
}

Store the json files in a directory with name "extraction"
To assess the quality of the extraction, check:
- If all tables are extracted
- If all footnotes and captions and paragraphs with references were extracted. If something is missing, refine the XPATH expression
- Note that every table should have at least one referencing paragraph

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Engineering - Project 1

Assignment

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Engineering - Project 1

Assignment