Skip to content

Commit 7feea83

Browse files
Merge pull request #586 from laughingman7743/add-claude-md
Add CLAUDE.md development guide for AI assistants
2 parents 1815c62 + 17a5346 commit 7feea83

1 file changed

Lines changed: 153 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# PyAthena Development Guide for AI Assistants
2+
3+
## Project Overview
4+
PyAthena is a Python DB API 2.0 (PEP 249) compliant client library for Amazon Athena. It enables Python applications to execute SQL queries against data stored in S3 using AWS Athena's serverless query engine.
5+
6+
**License**: MIT
7+
**Version**: See `pyathena/__init__.py`
8+
**Python Support**: See `requires-python` in `pyproject.toml`
9+
10+
## Key Architectural Principles
11+
12+
### 1. DB API 2.0 Compliance
13+
- Strictly follow PEP 249 specifications for all cursor and connection implementations
14+
- Maintain compatibility with standard Python database usage patterns
15+
- All cursor implementations must support the standard methods: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`
16+
17+
### 2. Multiple Cursor Types
18+
The project supports different cursor implementations for various use cases:
19+
- **Standard Cursor** (`pyathena.cursor.Cursor`): Basic DB API cursor
20+
- **Async Cursor** (`pyathena.async_cursor.AsyncCursor`): For asynchronous operations
21+
- **Pandas Cursor** (`pyathena.pandas.cursor.PandasCursor`): Returns results as DataFrames
22+
- **Arrow Cursor** (`pyathena.arrow.cursor.ArrowCursor`): Returns results in Apache Arrow format
23+
- **Spark Cursor** (`pyathena.spark.cursor.SparkCursor`): For PySpark integration
24+
25+
### 3. Type System and Conversion
26+
- Data type conversion is handled in `pyathena/converter.py`
27+
- Custom converters can be registered for specific Athena data types
28+
- Always preserve type safety and handle NULL values appropriately
29+
- Follow the type mapping defined in the converters for each cursor type
30+
31+
## Development Guidelines
32+
33+
### Code Style and Quality
34+
```bash
35+
# Format code (auto-fix imports and format)
36+
make fmt
37+
38+
# Run all checks (lint, format check, type check)
39+
make chk
40+
41+
# Run tests (includes running checks first)
42+
make test
43+
44+
# Run SQLAlchemy-specific tests
45+
make test-sqla
46+
47+
# Run full test suite with tox
48+
make tox
49+
50+
# Build documentation
51+
make docs
52+
```
53+
54+
### Testing Requirements
55+
1. **Unit Tests**: All new features must include unit tests
56+
2. **Integration Tests**: Test actual AWS Athena interactions when modifying query execution logic
57+
3. **SQLAlchemy Compliance**: Ensure SQLAlchemy dialect tests pass when modifying dialect code
58+
4. **Mock AWS Services**: Use `moto` or similar for testing AWS interactions without real resources
59+
60+
### Common Development Tasks
61+
62+
#### Adding a New Feature
63+
1. Check if it aligns with DB API 2.0 specifications
64+
2. Consider impact on all cursor types (standard, pandas, arrow, spark)
65+
3. Update type hints and ensure mypy passes
66+
4. Add comprehensive tests
67+
5. Update documentation if adding public APIs
68+
69+
#### Modifying Query Execution
70+
- The core query execution logic is in `cursor.py` and `async_cursor.py`
71+
- Always handle query cancellation properly (SIGINT should cancel running queries)
72+
- Respect the `kill_on_interrupt` parameter
73+
- Maintain compatibility with Athena engine versions 2 and 3
74+
75+
#### Working with AWS Services
76+
- All AWS interactions use `boto3`
77+
- Credentials are managed through standard AWS credential chain
78+
- Always handle AWS exceptions appropriately (see `error.py`)
79+
- S3 operations for result retrieval are in `result_set.py`
80+
81+
### Project Structure Conventions
82+
83+
```
84+
pyathena/
85+
├── {cursor_type}/ # Cursor-specific implementations
86+
│ ├── __init__.py
87+
│ ├── cursor.py # Cursor implementation
88+
│ ├── converter.py # Type converters
89+
│ └── result_set.py # Result handling
90+
91+
├── sqlalchemy/ # SQLAlchemy dialect implementations
92+
│ ├── base.py # Base dialect
93+
│ ├── {dialect}.py # Specific dialects (rest, pandas, arrow)
94+
│ └── requirements.py # SQLAlchemy requirements
95+
96+
└── filesystem/ # S3 filesystem abstractions
97+
```
98+
99+
### Important Implementation Details
100+
101+
#### Parameter Formatting
102+
- Two parameter styles supported: `pyformat` (default) and `qmark`
103+
- Parameter formatting logic in `formatter.py`
104+
- PyFormat: `%(name)s` style
105+
- Qmark: `?` style
106+
- Always escape special characters in parameter values
107+
108+
#### Result Set Handling
109+
- Results are typically staged in S3 (configured via `s3_staging_dir`)
110+
- Large result sets should be streamed, not loaded entirely into memory
111+
- Different result set implementations for different data formats (CSV, JSON, Parquet)
112+
113+
#### Error Handling
114+
- All exceptions inherit from `pyathena.error.Error`
115+
- Follow DB API 2.0 exception hierarchy
116+
- Provide meaningful error messages that include Athena query IDs when available
117+
118+
### Performance Considerations
119+
1. **Result Caching**: Utilize Athena's result reuse feature (engine v3) when possible
120+
2. **Batch Operations**: Support `executemany()` for bulk operations
121+
3. **Memory Efficiency**: Stream large results instead of loading all into memory
122+
4. **Connection Pooling**: Connections are relatively lightweight, but avoid creating excessive connections
123+
124+
### Security Best Practices
125+
1. **Never log sensitive data** (credentials, query results with PII)
126+
2. **Support encryption** (SSE-S3, SSE-KMS, CSE-KMS) for S3 operations
127+
3. **Validate and sanitize** all user inputs, especially in query construction
128+
4. **Use parameterized queries** to prevent SQL injection
129+
130+
### Debugging Tips
131+
1. Enable debug logging: `logging.getLogger("pyathena").setLevel(logging.DEBUG)`
132+
2. Check Athena query history in AWS Console for failed queries
133+
3. Verify S3 permissions for both staging directory and data access
134+
4. Use `EXPLAIN` or `SHOW` statements to debug query plans
135+
136+
### Common Pitfalls to Avoid
137+
1. Don't assume all Athena data types map directly to Python types
138+
2. Remember that Athena queries are asynchronous - always wait for completion
139+
3. Handle the case where S3 results might be deleted or inaccessible
140+
4. Don't forget to close cursors and connections to clean up resources
141+
5. Be aware of Athena service quotas and rate limits
142+
143+
### Release Process
144+
1. Update version in `pyathena/__init__.py`
145+
2. Ensure all tests pass
146+
3. Create a git tag for the release
147+
4. Build and publish to PyPI
148+
149+
## Contact and Resources
150+
- **Repository**: https://github.com/laughingman7743/PyAthena
151+
- **Documentation**: https://laughingman7743.github.io/PyAthena/
152+
- **Issues**: Report bugs or request features via GitHub Issues
153+
- **AWS Athena Docs**: https://docs.aws.amazon.com/athena/

0 commit comments

Comments
 (0)