|
1 | 1 | """ |
2 | | -# TextSpitter Documentation |
| 2 | +API Reference and User Guides. |
3 | 3 |
|
4 | | -## Welcome to TextSpitter |
5 | | -
|
6 | | -**Transforming documents into insights, effortlessly and efficiently.** |
7 | | -
|
8 | | -TextSpitter extracts plain text from documents and source-code files with a single call. |
9 | | -It normalises every input type — file paths, `BytesIO` streams, `SpooledTemporaryFile` objects, |
10 | | -and raw `bytes` — into plain strings, making it ideal for LLM pipelines, search engines, |
11 | | -and data-processing workflows. |
12 | | -
|
13 | | ---- |
14 | | -
|
15 | | -## 📚 Start Here |
16 | | -
|
17 | | -Choose your path based on what you want to do: |
18 | | -
|
19 | | -<details open> |
20 | | -<summary><strong>⚡ I want to extract text right now</strong></summary> |
21 | | -
|
22 | | -Start with **[Quick Start](quickstart.html)** to install and run your first extraction in under 2 minutes. |
23 | | -
|
24 | | -```python |
25 | | -from TextSpitter import TextSpitter |
26 | | -
|
27 | | -text = TextSpitter(filename="report.pdf") |
28 | | -print(text[:500]) |
29 | | -``` |
30 | | -
|
31 | | -</details> |
32 | | -
|
33 | | -<details> |
34 | | -<summary><strong>🎯 I need to understand how TextSpitter works</strong></summary> |
35 | | -
|
36 | | -Read the **[Technical Overview](overview.html)** for architecture, module design, and implementation details. |
37 | | -
|
38 | | -Covers: three-layer design, input resolution, PDF fallback chains, encoding strategy, and logging. |
39 | | -
|
40 | | -</details> |
41 | | -
|
42 | | -<details> |
43 | | -<summary><strong>🔍 I want to learn by example</strong></summary> |
44 | | -
|
45 | | -Follow the **[Tutorial](tutorial.html)** for a format-by-format walkthrough covering: |
46 | | -- PDF extraction (with PyMuPDF + pypdf fallback) |
47 | | -- DOCX extraction via FastAPI |
48 | | -- TXT & CSV with encoding handling |
49 | | -- Source code files (50+ extensions) |
50 | | -- Direct `FileExtractor` and `WordLoader` usage |
51 | | -
|
52 | | -</details> |
53 | | -
|
54 | | -<details> |
55 | | -<summary><strong>💼 I'm building a real application</strong></summary> |
56 | | -
|
57 | | -Check **[Common Use Cases](usecases.html)** for production patterns: |
58 | | -- Web APIs (FastAPI, Django/DRF) |
59 | | -- Cloud storage (AWS S3) |
60 | | -- LLM pipelines (LangChain, OpenAI embeddings) |
61 | | -- Batch processing (directory trees, parallel extraction) |
62 | | -- Logging strategies |
63 | | -
|
64 | | -</details> |
65 | | -
|
66 | | -<details> |
67 | | -<summary><strong>📋 I need a code snippet</strong></summary> |
68 | | -
|
69 | | -Browse **[Recipes](recipes.html)** for copy-paste snippets covering: |
70 | | -- Input handling (BytesIO, SpooledTemporaryFile, raw bytes) |
71 | | -- Format-specific extraction |
72 | | -- Error and encoding handling |
73 | | -- Testing patterns |
74 | | -
|
75 | | -</details> |
76 | | -
|
77 | | ---- |
78 | | -
|
79 | | -## ✨ Supported Formats |
80 | | -
|
81 | | -| Format | Method | Notes | |
82 | | -|--------|--------|-------| |
83 | | -| **PDF** | `pdf_file_read()` | PyMuPDF → pypdf fallback | |
84 | | -| **DOCX** | `docx_file_read()` | python-docx paragraph extraction | |
85 | | -| **TXT** | `text_file_read()` | UTF-8 → latin-1 → UTF-8-replace | |
86 | | -| **CSV** | `csv_file_read()` | Same encoding cascade as TXT | |
87 | | -| **Source code** | `code_file_read()` | 50+ extensions (py, js, ts, go, rs, java, …) | |
88 | | -
|
89 | | ---- |
90 | | -
|
91 | | -## 🚀 Quick Start |
92 | | -
|
93 | | -### Install |
94 | | -
|
95 | | -```sh |
96 | | -pip install textspitter |
97 | | -
|
98 | | -# With optional loguru logging |
99 | | -pip install "textspitter[logging]" |
100 | | -``` |
101 | | -
|
102 | | -### Extract |
103 | | -
|
104 | | -```python |
105 | | -from TextSpitter import TextSpitter |
106 | | -
|
107 | | -# From a file |
108 | | -text = TextSpitter(filename="report.pdf") |
109 | | -
|
110 | | -# From a stream |
111 | | -from io import BytesIO |
112 | | -text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf") |
113 | | -
|
114 | | -# From raw bytes |
115 | | -text = TextSpitter(file_obj=docx_bytes, filename="contract.docx") |
116 | | -``` |
117 | | -
|
118 | | -### CLI |
119 | | -
|
120 | | -```sh |
121 | | -# Single file to stdout |
122 | | -textspitter report.pdf |
123 | | -
|
124 | | -# Multiple files to combined output |
125 | | -textspitter chapter1.pdf chapter2.pdf -o book.txt |
126 | | -``` |
127 | | -
|
128 | | ---- |
129 | | -
|
130 | | -## 🔗 Navigation |
131 | | -
|
132 | | -| Page | Purpose | Best for | |
133 | | -|------|---------|----------| |
134 | | -| [Overview](overview.html) | Architecture & design | Understanding the internals | |
135 | | -| [Quick Start](quickstart.html) | Installation & first extraction | Getting started fast | |
136 | | -| [Tutorial](tutorial.html) | Format-by-format guide | Learning by example | |
137 | | -| [Use Cases](usecases.html) | Production patterns | Building real applications | |
138 | | -| [Recipes](recipes.html) | Code snippets | Copy-paste solutions | |
139 | | -
|
140 | | ---- |
141 | | -
|
142 | | -## 📖 Full API Reference |
143 | | -
|
144 | | -For complete API documentation including class definitions, method signatures, and parameters, |
145 | | -see the **TextSpitter module reference** in the sidebar. |
| 4 | +See the [main documentation](../index.html) for quick start, tutorials, use cases, and recipes. |
146 | 5 |
|
| 6 | +Module Overview: |
| 7 | +- `TextSpitter.main.WordLoader` — Format dispatcher |
| 8 | +- `TextSpitter.core.FileExtractor` — Low-level file reader |
| 9 | +- `TextSpitter.logger` — Optional loguru / stdlib logging shim |
147 | 10 | """ |
0 commit comments