Hi,
I am trying to use the parse_pubmed_paragraph function to find section names and text and use a regex expression to find methods sections based on section name. However, looking into on the currently implementation, I found it might not do the thing that I expected.
I just want to double check that if I understand it correctly.
Currently, the parse_pubmed_paragraph function in pubmed_oa_parser.py uses paragraph.find("../title") which only goes up one level to get the immediate parent section. This means paragraphs in nested subsections (e.g., Methods > Step 1 > Step 2) would only get "Step 2" as their section name, not "Methods", causing them to be missed when filtering for "Methods" sections.
For example, given the below xml
<title>Methods</title>
Direct paragraph in Methods
<title>Step 1</title>
Paragraph in Methods > Step 1
<title>Step 2</title>
Paragraph in Methods > Step 1 > Step 2
<title>Results</title>
Paragraph in Results
Paragraph in "Methods" should have section = "Methods" ✓
Paragraph in "Step 1" should have section = "Step 1" (missing parent "Methods") ✗
Paragraph in "Step 2" should have section = "Step 2" (missing parents "Methods > Step 1") ✗
Please let me know if I am understanding it correctly. Thanks!
Hi,
I am trying to use the parse_pubmed_paragraph function to find section names and text and use a regex expression to find methods sections based on section name. However, looking into on the currently implementation, I found it might not do the thing that I expected.
I just want to double check that if I understand it correctly.
Currently, the parse_pubmed_paragraph function in pubmed_oa_parser.py uses paragraph.find("../title") which only goes up one level to get the immediate parent section. This means paragraphs in nested subsections (e.g., Methods > Step 1 > Step 2) would only get "Step 2" as their section name, not "Methods", causing them to be missed when filtering for "Methods" sections.
For example, given the below xml
<title>Methods</title>Direct paragraph in Methods
<title>Step 1</title>Paragraph in Methods > Step 1
<title>Step 2</title>Paragraph in Methods > Step 1 > Step 2
<title>Results</title>Paragraph in Results
Paragraph in "Methods" should have section = "Methods" ✓
Paragraph in "Step 1" should have section = "Step 1" (missing parent "Methods") ✗
Paragraph in "Step 2" should have section = "Step 2" (missing parents "Methods > Step 1") ✗
Please let me know if I am understanding it correctly. Thanks!