|
1 | 1 | """ |
2 | | -# TextSpitter — User Guide |
| 2 | +# TextSpitter Documentation |
3 | 3 |
|
4 | | -Welcome to the TextSpitter documentation. |
5 | | -TextSpitter extracts plain text from documents and source-code files with a |
6 | | -single call, normalising every input type (file path, `BytesIO`, `SpooledTemporaryFile`, |
7 | | -raw `bytes`) into a `str`. |
| 4 | +## Welcome to TextSpitter |
| 5 | +
|
| 6 | +**Transforming documents into insights, effortlessly and efficiently.** |
| 7 | +
|
| 8 | +TextSpitter extracts plain text from documents and source-code files with a single call. |
| 9 | +It normalises every input type — file paths, `BytesIO` streams, `SpooledTemporaryFile` objects, |
| 10 | +and raw `bytes` — into plain strings, making it ideal for LLM pipelines, search engines, |
| 11 | +and data-processing workflows. |
8 | 12 |
|
9 | 13 | --- |
10 | 14 |
|
11 | | -## Pages in this guide |
| 15 | +## 📚 Start Here |
| 16 | +
|
| 17 | +Choose your path based on what you want to do: |
| 18 | +
|
| 19 | +<details open> |
| 20 | +<summary><strong>⚡ I want to extract text right now</strong></summary> |
| 21 | +
|
| 22 | +Start with **[Quick Start](quickstart.html)** to install and run your first extraction in under 2 minutes. |
| 23 | +
|
| 24 | +```python |
| 25 | +from TextSpitter import TextSpitter |
| 26 | +
|
| 27 | +text = TextSpitter(filename="report.pdf") |
| 28 | +print(text[:500]) |
| 29 | +``` |
| 30 | +
|
| 31 | +</details> |
| 32 | +
|
| 33 | +<details> |
| 34 | +<summary><strong>🎯 I need to understand how TextSpitter works</strong></summary> |
| 35 | +
|
| 36 | +Read the **[Technical Overview](overview.html)** for architecture, module design, and implementation details. |
| 37 | +
|
| 38 | +Covers: three-layer design, input resolution, PDF fallback chains, encoding strategy, and logging. |
| 39 | +
|
| 40 | +</details> |
| 41 | +
|
| 42 | +<details> |
| 43 | +<summary><strong>🔍 I want to learn by example</strong></summary> |
| 44 | +
|
| 45 | +Follow the **[Tutorial](tutorial.html)** for a format-by-format walkthrough covering: |
| 46 | +- PDF extraction (with PyMuPDF + pypdf fallback) |
| 47 | +- DOCX extraction via FastAPI |
| 48 | +- TXT & CSV with encoding handling |
| 49 | +- Source code files (50+ extensions) |
| 50 | +- Direct `FileExtractor` and `WordLoader` usage |
| 51 | +
|
| 52 | +</details> |
| 53 | +
|
| 54 | +<details> |
| 55 | +<summary><strong>💼 I'm building a real application</strong></summary> |
| 56 | +
|
| 57 | +Check **[Common Use Cases](usecases.html)** for production patterns: |
| 58 | +- Web APIs (FastAPI, Django/DRF) |
| 59 | +- Cloud storage (AWS S3) |
| 60 | +- LLM pipelines (LangChain, OpenAI embeddings) |
| 61 | +- Batch processing (directory trees, parallel extraction) |
| 62 | +- Logging strategies |
| 63 | +
|
| 64 | +</details> |
12 | 65 |
|
13 | | -| Page | Description | |
14 | | -|------|-------------| |
15 | | -| `TextSpitter.guide.overview` | Architecture and design decisions | |
16 | | -| `TextSpitter.guide.quickstart` | Install and run your first extraction | |
17 | | -| `TextSpitter.guide.tutorial` | Format-by-format walkthrough | |
18 | | -| `TextSpitter.guide.usecases` | FastAPI, S3, LangChain, batch processing … | |
19 | | -| `TextSpitter.guide.recipes` | Copy-paste snippets | |
| 66 | +<details> |
| 67 | +<summary><strong>📋 I need a code snippet</strong></summary> |
| 68 | +
|
| 69 | +Browse **[Recipes](recipes.html)** for copy-paste snippets covering: |
| 70 | +- Input handling (BytesIO, SpooledTemporaryFile, raw bytes) |
| 71 | +- Format-specific extraction |
| 72 | +- Error and encoding handling |
| 73 | +- Testing patterns |
| 74 | +
|
| 75 | +</details> |
20 | 76 |
|
21 | 77 | --- |
22 | 78 |
|
23 | | -## Supported formats |
| 79 | +## ✨ Supported Formats |
24 | 80 |
|
25 | | -| Format | Reader | Notes | |
| 81 | +| Format | Method | Notes | |
26 | 82 | |--------|--------|-------| |
27 | | -| PDF | `pdf_file_read` | PyMuPDF → pypdf fallback | |
28 | | -| DOCX | `docx_file_read` | python-docx paragraph extraction | |
29 | | -| TXT | `text_file_read` | UTF-8 → latin-1 → UTF-8-replace | |
30 | | -| CSV | `csv_file_read` | Same encoding cascade as TXT | |
31 | | -| Source code | `code_file_read` | 50 + extensions | |
| 83 | +| **PDF** | `pdf_file_read()` | PyMuPDF → pypdf fallback | |
| 84 | +| **DOCX** | `docx_file_read()` | python-docx paragraph extraction | |
| 85 | +| **TXT** | `text_file_read()` | UTF-8 → latin-1 → UTF-8-replace | |
| 86 | +| **CSV** | `csv_file_read()` | Same encoding cascade as TXT | |
| 87 | +| **Source code** | `code_file_read()` | 50+ extensions (py, js, ts, go, rs, java, …) | |
32 | 88 |
|
33 | 89 | --- |
34 | 90 |
|
35 | | -## Quick example |
| 91 | +## 🚀 Quick Start |
| 92 | +
|
| 93 | +### Install |
| 94 | +
|
| 95 | +```sh |
| 96 | +pip install textspitter |
| 97 | +
|
| 98 | +# With optional loguru logging |
| 99 | +pip install "textspitter[logging]" |
| 100 | +``` |
| 101 | +
|
| 102 | +### Extract |
36 | 103 |
|
37 | 104 | ```python |
38 | 105 | from TextSpitter import TextSpitter |
39 | 106 |
|
| 107 | +# From a file |
40 | 108 | text = TextSpitter(filename="report.pdf") |
41 | | -print(text[:200]) |
| 109 | +
|
| 110 | +# From a stream |
| 111 | +from io import BytesIO |
| 112 | +text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf") |
| 113 | +
|
| 114 | +# From raw bytes |
| 115 | +text = TextSpitter(file_obj=docx_bytes, filename="contract.docx") |
42 | 116 | ``` |
43 | 117 |
|
44 | | -Install with optional loguru logging: |
| 118 | +### CLI |
45 | 119 |
|
46 | 120 | ```sh |
47 | | -pip install "textspitter[logging]" |
| 121 | +# Single file to stdout |
| 122 | +textspitter report.pdf |
| 123 | +
|
| 124 | +# Multiple files to combined output |
| 125 | +textspitter chapter1.pdf chapter2.pdf -o book.txt |
48 | 126 | ``` |
| 127 | +
|
| 128 | +--- |
| 129 | +
|
| 130 | +## 🔗 Navigation |
| 131 | +
|
| 132 | +| Page | Purpose | Best for | |
| 133 | +|------|---------|----------| |
| 134 | +| [Overview](overview.html) | Architecture & design | Understanding the internals | |
| 135 | +| [Quick Start](quickstart.html) | Installation & first extraction | Getting started fast | |
| 136 | +| [Tutorial](tutorial.html) | Format-by-format guide | Learning by example | |
| 137 | +| [Use Cases](usecases.html) | Production patterns | Building real applications | |
| 138 | +| [Recipes](recipes.html) | Code snippets | Copy-paste solutions | |
| 139 | +
|
| 140 | +--- |
| 141 | +
|
| 142 | +## 📖 Full API Reference |
| 143 | +
|
| 144 | +For complete API documentation including class definitions, method signatures, and parameters, |
| 145 | +see the **TextSpitter module reference** in the sidebar. |
| 146 | +
|
49 | 147 | """ |
0 commit comments