Skip to content

Commit f92cfab

Browse files
committed
added datatables and corpus FIRST PASS MAY HAV BUGS
1 parent 6e31194 commit f92cfab

444 files changed

Lines changed: 85593 additions & 9 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.DS_Store

-8 KB
Binary file not shown.

corpus_module/README.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Corpus Module
2+
3+
A standalone Python module for managing and analyzing document corpora. Extracted from the AmiLib project for use in pygetpapers Streamlit UI and other projects.
4+
5+
## Features
6+
7+
- Hierarchical corpus management (directories and files)
8+
- Search functionality across corpus files
9+
- Query management and execution
10+
- Results visualization and HTML generation
11+
- Integration with pygetpapers output
12+
- DataTables generation from corpus data
13+
- Support for multiple document formats
14+
15+
## Installation
16+
17+
```bash
18+
pip install corpus-module
19+
```
20+
21+
## Quick Start
22+
23+
```python
24+
from corpus_module import AmiCorpus, AmiCorpusContainer
25+
26+
# Create a new corpus
27+
corpus = AmiCorpus(
28+
topdir="my_documents",
29+
globstr="**/*.html",
30+
make_descendants=True,
31+
mkdir=True
32+
)
33+
34+
# Create a container for a specific document type
35+
reports = corpus.create_corpus_container(
36+
"reports",
37+
bib_type="report",
38+
mkdir=True
39+
)
40+
41+
# Create a document
42+
doc = reports.create_document(
43+
"analysis.html",
44+
text="<html><body><h1>Analysis Report</h1></body></html>"
45+
)
46+
47+
# List files in corpus
48+
files = corpus.list_files("**/*.html")
49+
print(f"Found {len(files)} HTML files")
50+
```
51+
52+
## Search Functionality
53+
54+
```python
55+
from corpus_module import CorpusQuery, CorpusSearch
56+
57+
# Create a query
58+
query = CorpusQuery(
59+
query_id="climate_search",
60+
phrases=["climate change", "global warming", "carbon emissions"],
61+
outfile="results.html"
62+
)
63+
64+
# Search corpus files
65+
infiles = corpus.list_files("**/*.html")
66+
results = CorpusSearch.search_files_with_phrases_write_results(
67+
infiles=infiles,
68+
phrases=query.phrases,
69+
outfile=query.outfile,
70+
debug=True
71+
)
72+
```
73+
74+
## DataTables Integration
75+
76+
```python
77+
# Create DataTables from corpus
78+
corpus.make_datatables(
79+
indir="my_documents",
80+
outdir="output",
81+
outfile_h="corpus_table.html"
82+
)
83+
84+
# Create DataTables with filenames
85+
corpus.create_datatables_html_with_filenames(
86+
html_glob="**/*.html",
87+
labels=["File", "Type", "Size"],
88+
table_id="corpus_files",
89+
outpath="files_table.html"
90+
)
91+
```
92+
93+
## Query Management
94+
95+
```python
96+
# Create multiple queries
97+
climate_query = corpus.get_or_create_corpus_query(
98+
query_id="climate",
99+
phrases=["climate", "temperature", "weather"]
100+
)
101+
102+
energy_query = corpus.get_or_create_corpus_query(
103+
query_id="energy",
104+
phrases=["energy", "power", "electricity"]
105+
)
106+
107+
# Run multiple queries
108+
results = corpus.search_files_with_queries([
109+
"climate",
110+
"energy"
111+
], debug=True)
112+
113+
# Process results
114+
for query_id, html_result in results.items():
115+
print(f"Query {query_id} found results")
116+
# Save or process HTML result
117+
```
118+
119+
## Advanced Usage
120+
121+
### Hierarchical Corpus Structure
122+
123+
```python
124+
# Create a complex corpus structure
125+
corpus = AmiCorpus("research_papers", mkdir=True)
126+
127+
# Create year-based containers
128+
for year in ["2020", "2021", "2022"]:
129+
year_container = corpus.create_corpus_container(year, bib_type="year", mkdir=True)
130+
131+
# Create subject containers within each year
132+
for subject in ["climate", "energy", "health"]:
133+
subject_container = year_container.create_corpus_container(
134+
subject,
135+
bib_type="subject",
136+
mkdir=True
137+
)
138+
139+
# Add documents
140+
subject_container.create_document(
141+
f"paper_{subject}_{year}.html",
142+
text=f"<html><body><h1>{subject} research from {year}</h1></body></html>"
143+
)
144+
```
145+
146+
### Custom Search with XPath
147+
148+
```python
149+
# Search with custom XPath for paragraph elements
150+
results = CorpusSearch.search_files_with_phrases_write_results(
151+
infiles=corpus.list_files("**/*.html"),
152+
phrases=["methane", "emissions"],
153+
para_xpath="//p[@class='content']", # Custom XPath
154+
outfile="methane_results.html"
155+
)
156+
```
157+
158+
### Integration with pygetpapers
159+
160+
```python
161+
# After running pygetpapers, create DataTables from results
162+
import json
163+
from pathlib import Path
164+
165+
# Assuming pygetpapers created a directory with results
166+
pygetpapers_dir = Path("pygetpapers_output")
167+
168+
# Create DataTables from pygetpapers JSON results
169+
if (pygetpapers_dir / "eupmc_result.json").exists():
170+
AmiCorpus.make_datatables(
171+
indir=pygetpapers_dir,
172+
outdir=pygetpapers_dir,
173+
outfile_h="search_results_table.html"
174+
)
175+
```
176+
177+
## Streamlit Integration
178+
179+
```python
180+
import streamlit as st
181+
from corpus_module import AmiCorpus
182+
import lxml.etree as ET
183+
184+
# Create corpus
185+
corpus = AmiCorpus("documents", globstr="**/*.html")
186+
187+
# Create DataTable
188+
htmlx, tbody = corpus.create_datatables_html_with_filenames(
189+
html_glob="**/*.html",
190+
labels=["File", "Type"],
191+
table_id="corpus_files"
192+
)
193+
194+
# Display in Streamlit
195+
st.components.v1.html(
196+
ET.tostring(htmlx, encoding='unicode'),
197+
height=600
198+
)
199+
```
200+
201+
## Configuration
202+
203+
The module supports various configuration options:
204+
205+
```python
206+
# Create corpus with specific options
207+
corpus = AmiCorpus(
208+
topdir="documents",
209+
globstr="**/*.{html,pdf,txt}", # Multiple file types
210+
make_descendants=True, # Create containers for all subdirectories
211+
mkdir=True, # Create directories if they don't exist
212+
eupmc=True # Enable EuropePMC specific features
213+
)
214+
```
215+
216+
## Dependencies
217+
218+
- **lxml**: XML/HTML processing
219+
- **datatables-module**: For DataTables generation
220+
- **pathlib**: Path manipulation (Python standard library)
221+
- **json**: JSON processing (Python standard library)
222+
223+
## Contributing
224+
225+
1. Fork the repository
226+
2. Create a feature branch
227+
3. Make your changes
228+
4. Add tests
229+
5. Submit a pull request
230+
231+
## License
232+
233+
This project is licensed under the MIT License - see the LICENSE file for details.
234+
235+
## Acknowledgments
236+
237+
This module was extracted from the [AmiLib](https://github.com/amilib/amilib) project and adapted for standalone use.

corpus_module/__init__.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
"""
2+
Standalone Corpus Module
3+
4+
This module provides functionality for managing and analyzing document corpora.
5+
Extracted from amilib for use in pygetpapers Streamlit UI and other projects.
6+
7+
Features:
8+
- Hierarchical corpus management (directories and files)
9+
- Search functionality across corpus files
10+
- Query management and execution
11+
- Results visualization and HTML generation
12+
- Integration with pygetpapers output
13+
- Datatables generation from corpus data
14+
"""
15+
16+
from .corpus import AmiCorpus, AmiCorpusContainer
17+
from .query import CorpusQuery
18+
from .search import CorpusSearch
19+
20+
__version__ = "0.1.0"
21+
__all__ = ["AmiCorpus", "AmiCorpusContainer", "CorpusQuery", "CorpusSearch"]

0 commit comments

Comments
 (0)