Skip to content

Commit 825cbda

Browse files
authored
Merge pull request #7 from SimonFrings/documentation
Update documentation and examples to match version in early access
2 parents 7727fa9 + 49cbbc5 commit 825cbda

7 files changed

Lines changed: 462 additions & 9 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/composer.lock
2+
/vendor/

README.md

Lines changed: 304 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,34 @@
22

33
Streaming TSV (Tab-Separated Values) parser and encoder for [ReactPHP](https://reactphp.org/).
44

5+
**Table of contents**
6+
7+
* [Support us](#support-us)
8+
* [Quickstart example](#quickstart-example)
9+
* [TSV format](#tsv-format)
10+
* [Usage](#usage)
11+
* [TsvDecoder](#tsvdecoder)
12+
* [TsvEncoder](#tsvencoder)
13+
* [Install](#install)
14+
* [Tests](#tests)
15+
* [License](#license)
16+
* [More](#more)
17+
18+
## Support us
19+
20+
[![A clue·access project](https://raw.githubusercontent.com/clue-access/clue-access/main/clue-access.png)](https://github.com/clue-access/clue-access)
21+
22+
*This project is currently under active development,
23+
you're looking at a temporary placeholder repository.*
24+
25+
The code is available in early access to my sponsors here: https://github.com/clue-access/reactphp-tsv
26+
27+
Do you sponsor me on GitHub? Thank you for supporting sustainable open-source, you're awesome! ❤️ Have fun with the code! 🎉
28+
29+
Seeing a 404 (Not Found)? Sounds like you're not in the early access group. Consider becoming a [sponsor on GitHub](https://github.com/sponsors/clue) for early access. Check out [clue·access](https://github.com/clue-access/clue-access) for more details.
30+
31+
This way, more people get a chance to take a look at the code before the public release.
32+
533
## Quickstart example
634

735
TSV (Tab-Separated Values) is a very simple text-based format for storing a
@@ -43,26 +71,293 @@ Carol's birthday is 2006-01-01
4371
Dave's birthday is 1995-01-01
4472
```
4573
74+
## TSV format
75+
76+
TSV (Tab-Separated Values) is a very simple text-based format for storing a
77+
large number of (uniform) records, such as a list of temparature records or log
78+
entries.
79+
80+
```
81+
name birthday ip
82+
Alice 2017-01-01 1.1.1.1
83+
Carol 2006-01-01 2.1.1.1
84+
Dave 1995-01-01 3.1.1.1
85+
```
86+
87+
While this may look somewhat trivial, this simplicity comes at a price. TSV is
88+
limited to untyped, two-dimensional data, so there's no standard way of storing
89+
any nested structures or to differentiate a boolean value from a string or
90+
integer.
91+
92+
While TSV may look somewhat similar to CSV (Comma-Separated Values or less
93+
commonly Character-Separated Values), it is more than just a small variation.
94+
95+
* TSV always uses a tab stop (`\t`) as a delimiter between fields, CSV uses a
96+
comma (`,`) by default, but some applications use variations such as a
97+
semicolon (`;`) or other application-dependant characters (this is
98+
particularly common for systems in Europe (and elsewhere) that use a comma as
99+
decimal separator).
100+
* TSV always uses field names in the first row, CSV allows for optional field
101+
names (which is application-dependant).
102+
* TSV always uses the same number of fields for all rows, CSV allows for rows
103+
with different number of fields (though this is rarely used).
104+
* CSV requires quoting
105+
* CSV supports newlines and thus requires more advanced parsing rules
106+
* MIME type CSV is text/csv and for TSV text/tab-separated-values.
107+
* TSV is defined in a [simple document](https://www.iana.org/assignments/media-types/text/tab-separated-values),
108+
while CSV is defined in a dedicated [RFC 4180](https://tools.ietf.org/html/rfc4180).
109+
However many applications started using some CSV-variant long before this
110+
standard was defined, so parsing rules differ somewhat between implementations.
111+
112+
TSV files are commonly limited to only ASCII characters for best interoperability.
113+
However, many legacy TSV files often use ISO 8859-1 encoding or some other
114+
variant. Newer TSV files are usually best saved as UTF-8 and may thus also
115+
contain special characters from the Unicode range. The text-encoding is usually
116+
application-dependant, so your best bet would be to convert to (or assume) UTF-8
117+
consistently.
118+
119+
Despite its shortcomings, TSV is widely used and this is unlikely to change any
120+
time soon. In particular, TSV is a very common export format for a lot of tools
121+
to interface with spreadsheet processors (such as Excel, Calc etc.). This means
122+
that TSV is often used for historical reasons and using TSV to store structured
123+
application data is usually not a good idea nowadays – but exporting to TSV for
124+
known applications continues to be a very reasonable approach.
125+
126+
As an alternative, if you want to process structured data in a more modern
127+
JSON-based format, you may want to use [clue/reactphp-ndjson](https://github.com/clue/reactphp-ndjson)
128+
to process newline-delimited JSON (NDJSON) files (`.ndjson` file extension).
129+
130+
```json
131+
{"name":"Alice","age":30,"comment":"Yes, I like cheese"}
132+
{"name":"Bob","age":50,"comment":"Hello\nWorld!"}
133+
```
134+
135+
## Usage
136+
137+
### TsvDecoder
138+
139+
The `TsvDecoder` (parser) class can be used to make sure you only get back
140+
complete, valid TSV elements when reading from a stream.
141+
It wraps a given
142+
[`ReadableStreamInterface`](https://github.com/reactphp/stream#readablestreaminterface)
143+
and exposes its data through the same interface, but emits the TSV elements
144+
as parsed values instead of just chunks of strings:
145+
146+
```
147+
name age
148+
Alice 20
149+
Carol 30
150+
```
151+
152+
```php
153+
$stdin = new React\Stream\ReadableResourceStream(STDIN);
154+
$stream = new Clue\React\Tsv\TsvDecoder($stdin);
155+
156+
$stream->on('data', function ($data) {
157+
// data is a parsed element from the TSV stream
158+
// line 1: $data = array('name' => 'Alice', 'age' => '20');
159+
// line 2: $data = array('name' => 'Carol', 'age' => '30');
160+
var_dump($data);
161+
});
162+
```
163+
164+
ReactPHP's streams emit chunks of data strings and make no assumption about their lengths.
165+
These chunks do not necessarily represent complete TSV elements, as an
166+
element may be broken up into multiple chunks.
167+
This class reassembles these elements by buffering incomplete ones.
168+
169+
Accordingly, the `TsvDecoder` limits the maximum buffer size (maximum line
170+
length) to avoid buffer overflows due to malformed user input. Usually, there
171+
should be no need to change this value, unless you know you're dealing with some
172+
unreasonably long lines. It accepts an additional argument if you want to change
173+
this from the default of 64 KiB:
174+
175+
```php
176+
$stream = new Clue\React\Tsv\TsvDecoder($stdin, 64 * 1024);
177+
```
178+
179+
If the underlying stream emits an `error` event or the plain stream contains
180+
any data that does not represent a valid TSV stream,
181+
it will emit an `error` event and then `close` the input stream:
182+
183+
```php
184+
$stream->on('error', function (Exception $error) {
185+
// an error occured, stream will close next
186+
});
187+
```
188+
189+
If the underlying stream emits an `end` event, it will flush any incomplete
190+
data from the buffer, thus either possibly emitting a final `data` event
191+
followed by an `end` event on success or an `error` event for
192+
incomplete/invalid TSV data as above:
193+
194+
```php
195+
$stream->on('end', function () {
196+
// stream successfully ended, stream will close next
197+
});
198+
```
199+
200+
If either the underlying stream or the `TsvDecoder` is closed, it will forward
201+
the `close` event:
202+
203+
```php
204+
$stream->on('close', function () {
205+
// stream closed
206+
// possibly after an "end" event or due to an "error" event
207+
});
208+
```
209+
210+
The `close(): void` method can be used to explicitly close the `TsvDecoder` and
211+
its underlying stream:
212+
213+
```php
214+
$stream->close();
215+
```
216+
217+
The `pipe(WritableStreamInterface $dest, array $options = array(): WritableStreamInterface`
218+
method can be used to forward all data to the given destination stream.
219+
Please note that the `TsvDecoder` emits decoded/parsed data events, while many
220+
(most?) writable streams expect only data chunks:
221+
222+
```php
223+
$stream->pipe($logger);
224+
```
225+
226+
For more details, see ReactPHP's
227+
[`ReadableStreamInterface`](https://github.com/reactphp/stream#readablestreaminterface).
228+
229+
### TsvEncoder
230+
231+
The `TsvEncoder` (serializer) class can be used to make sure anything you write to
232+
a stream ends up as valid TSV elements in the resulting TSV stream.
233+
It wraps a given
234+
[`WritableStreamInterface`](https://github.com/reactphp/stream#writablestreaminterface)
235+
and accepts its data through the same interface, but handles any data as complete
236+
TSV elements instead of just chunks of strings:
237+
238+
```php
239+
$stdout = new React\Stream\WritableResourceStream(STDOUT);
240+
$stream = new Clue\React\Tsv\TsvEncoder($stdout);
241+
242+
$stream->write(array('name' => 'Alice', 'age' => '20'));
243+
$stream->write(array('name' => 'Carol', 'age' => '30'));
244+
```
245+
246+
```
247+
name age
248+
Alice 20
249+
Carol 30
250+
```
251+
252+
If the underlying stream emits an `error` event or the given data contains
253+
any data that can not be represented as a valid TSV stream,
254+
it will emit an `error` event and then `close` the input stream:
255+
256+
```php
257+
$stream->on('error', function (Exception $error) {
258+
// an error occured, stream will close next
259+
});
260+
```
261+
262+
If either the underlying stream or the `TsvEncoder` is closed, it will forward
263+
the `close` event:
264+
265+
```php
266+
$stream->on('close', function () {
267+
// stream closed
268+
// possibly after an "end" event or due to an "error" event
269+
});
270+
```
271+
272+
The `end(mixed $data = null): void` method can be used to optionally emit
273+
any final data and then soft-close the `TsvEncoder` and its underlying stream:
274+
275+
```php
276+
$stream->end();
277+
```
278+
279+
The `close(): void` method can be used to explicitly close the `TsvEncoder` and
280+
its underlying stream:
281+
282+
```php
283+
$stream->close();
284+
```
285+
286+
For more details, see ReactPHP's
287+
[`WritableStreamInterface`](https://github.com/reactphp/stream#writablestreaminterface).
288+
46289
## Install
47290
48-
[![A clue·access project](https://raw.githubusercontent.com/clue-access/clue-access/main/clue-access.png)](https://github.com/clue-access/clue-access)
291+
The recommended way to install this library is [through Composer](https://getcomposer.org/).
292+
[New to Composer?](https://getcomposer.org/doc/00-intro.md)
49293
50-
*This project is currently under active development,
51-
you're looking at a temporary placeholder repository.*
294+
This project does not yet follow [SemVer](https://semver.org/).
295+
This will install the latest supported version:
52296
53-
The code is available in early access to my sponsors here: https://github.com/clue-access/reactphp-tsv
297+
While in [early access](#support-us), you first have to manually change your
298+
`composer.json` to include these lines to access the supporters-only repository:
54299
55-
Do you sponsor me on GitHub? Thank you for supporting sustainable open-source, you're awesome! ❤️ Have fun with the code! 🎉
300+
```json
301+
{
302+
"repositories": [
303+
{
304+
"type": "vcs",
305+
"url": "https://github.com/clue-access/reactphp-tsv"
306+
}
307+
]
308+
}
309+
```
56310
57-
Seeing a 404 (Not Found)? Sounds like you're not in the early access group. Consider becoming a [sponsor on GitHub](https://github.com/sponsors/clue) for early access. Check out [clue·access](https://github.com/clue-access/clue-access) for more details.
311+
Then install this package as usual:
58312
59-
This way, more people get a chance to take a look at the code before the public release.
313+
```bash
314+
$ composer require clue/reactphp-tsv:dev-main
315+
```
316+
317+
This project aims to run on any platform and thus does not require any PHP
318+
extensions and supports running on legacy PHP 5.3 through current PHP 8+.
319+
It's *highly recommended to use the latest supported PHP version* for this project.
320+
321+
# Tests
322+
323+
To run the test suite, you first need to clone this repo and then install all
324+
dependencies [through Composer](https://getcomposer.org/):
325+
326+
```bash
327+
$ composer install
328+
```
329+
330+
To run the test suite, go to the project root and run:
60331
61-
Rock on 🤘
332+
```bash
333+
$ vendor/bin/phpunit
334+
```
62335
63336
## License
64337
65-
This project will be released under the permissive [MIT license](LICENSE).
338+
This project is released under the permissive [MIT license](LICENSE).
66339
67340
> Did you know that I offer custom development services and issuing invoices for
68341
sponsorships of releases and for contributions? Contact me (@clue) for details.
342+
343+
## More
344+
345+
* If you want to learn more about processing streams of data, refer to the documentation of
346+
the underlying [react/stream](https://github.com/reactphp/stream) component.
347+
348+
* If you want to process a more common text-based format,
349+
you may want to use [clue/reactphp-csv](https://github.com/clue/reactphp-csv)
350+
to process Comma-Separated Values (CSV) files (`.csv` file extension).
351+
352+
* If you want to process structured data in a more modern JSON-based format,
353+
you may want to use [clue/reactphp-ndjson](https://github.com/clue/reactphp-ndjson)
354+
to process newline-delimited JSON (NDJSON) files (`.ndjson` file extension).
355+
356+
* If you want to process compressed TSV files (`.tsv.gz` file extension)
357+
you may want to use [clue/reactphp-zlib](https://github.com/clue/reactphp-zlib)
358+
on the compressed input stream before passing the decompressed stream to the TSV decoder.
359+
360+
* If you want to create compressed TSV files (`.tsv.gz` file extension)
361+
you may want to use [clue/reactphp-zlib](https://github.com/clue/reactphp-zlib)
362+
on the resulting TSV encoder output stream before passing the compressed
363+
stream to the file output stream.

examples/91-benchmark-count.php

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<?php
2+
3+
// To run the example, execute the following command:
4+
// $ php examples/91-benchmark-count.php < examples/users.tsv
5+
//
6+
// If you want to achieve more convincing result,
7+
// you can run the same example with some larger TSV files.
8+
// Take a look at:
9+
// @link https://datasets.imdbws.com/
10+
11+
use React\EventLoop\Loop;
12+
13+
require __DIR__ . '/../vendor/autoload.php';
14+
15+
if (extension_loaded('xdebug')) {
16+
echo 'NOTICE: The "xdebug" extension is loaded, this has a major impact on performance.' . PHP_EOL;
17+
}
18+
19+
$decoder = new Clue\React\Tsv\TsvDecoder(new React\Stream\ReadableResourceStream(STDIN));
20+
21+
$count = 0;
22+
$decoder->on('data', function () use (&$count) {
23+
++$count;
24+
});
25+
26+
$start = microtime(true);
27+
$report = Loop::addPeriodicTimer(0.05, function () use (&$count, $start) {
28+
printf("\r%d records in %0.3fs...", $count, microtime(true) - $start);
29+
});
30+
31+
$decoder->on('close', function () use (&$count, $report, $start) {
32+
$now = microtime(true);
33+
Loop::cancelTimer($report);
34+
35+
printf("\r%d records in %0.3fs => %d records/s\n", $count, $now - $start, $count / ($now - $start));
36+
});

0 commit comments

Comments
 (0)