Skip to content

parse_pubmed_paragraph does not propagate parent section titles for nested <sec> #174

@wzhou122

Description

@wzhou122

Hi,

I am trying to use the parse_pubmed_paragraph function to find section names and text and use a regex expression to find methods sections based on section name. However, looking into on the currently implementation, I found it might not do the thing that I expected.

I just want to double check that if I understand it correctly.

Currently, the parse_pubmed_paragraph function in pubmed_oa_parser.py uses paragraph.find("../title") which only goes up one level to get the immediate parent section. This means paragraphs in nested subsections (e.g., Methods > Step 1 > Step 2) would only get "Step 2" as their section name, not "Methods", causing them to be missed when filtering for "Methods" sections.

For example, given the below xml

<title>Methods</title>

Direct paragraph in Methods

<title>Step 1</title>

Paragraph in Methods > Step 1

<title>Step 2</title>

Paragraph in Methods > Step 1 > Step 2

<title>Results</title>

Paragraph in Results

Paragraph in "Methods" should have section = "Methods" ✓
Paragraph in "Step 1" should have section = "Step 1" (missing parent "Methods") ✗
Paragraph in "Step 2" should have section = "Step 2" (missing parents "Methods > Step 1") ✗

Please let me know if I am understanding it correctly. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions