You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cookbook.md
+21-45Lines changed: 21 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,8 +5,6 @@
5
5
Use the [`obstore.list`][] method.
6
6
7
7
```py
8
-
import obstore as obs
9
-
10
8
store =...# store of your choice
11
9
12
10
# Recursively list all files below the 'data' path.
@@ -15,7 +13,7 @@ store = ... # store of your choice
15
13
prefix ="data"
16
14
17
15
# Get a stream of metadata objects:
18
-
list_stream =obs.list(store, prefix)
16
+
list_stream =store.list(prefix)
19
17
20
18
# Print info
21
19
for batch in list_stream:
@@ -32,12 +30,10 @@ Instead, you may consider passing `return_arrow=True` to [`obstore.list`][] to r
32
30
This Arrow integration requires the [`arro3-core` dependency](https://kylebarron.dev/arro3/latest/), a lightweight Arrow implementation. You can pass the emitted `RecordBatch` to [`pyarrow`](https://arrow.apache.org/docs/python/index.html) (zero-copy) by passing it to [`pyarrow.record_batch`][] or to [`polars`](https://pola.rs/) (also zero-copy) by passing it to `polars.DataFrame`.
# Convert to pyarrow (zero-copy), then to pandas for easy export to a
@@ -86,30 +81,28 @@ The Arrow record batch looks like the following:
86
81
87
82
## Fetch objects
88
83
89
-
Use the [`obstore.get`][] function to fetch data bytes from remote storage or files in the local filesystem.
84
+
Use the `get` method to fetch data bytes from remote storage or files in the local filesystem.
90
85
91
86
```py
92
-
import obstore as obs
93
-
94
87
store =...# store of your choice
95
88
96
89
# Retrieve a specific file
97
90
path ="data/file01.parquet"
98
91
99
92
# Fetch just the file metadata
100
-
meta =obs.head(store, path)
93
+
meta =store.head(path)
101
94
print(meta)
102
95
103
96
# Fetch the object including metadata
104
-
result =obs.get(store, path)
97
+
result =store.get(path)
105
98
assert result.meta == meta
106
99
107
100
# Buffer the entire object in memory
108
101
buffer = result.bytes()
109
102
assertlen(buffer) == meta.size
110
103
111
104
# Alternatively stream the bytes from object storage
112
-
stream =obs.get(store, path).stream()
105
+
stream =store.get(path).stream()
113
106
114
107
# We can now iterate over the stream
115
108
total_buffer_len =0
@@ -125,9 +118,7 @@ Using the response as an iterator ensures that we don't buffer the entire file
125
118
into memory.
126
119
127
120
```py
128
-
import obstore as obs
129
-
130
-
resp = obs.get(store, path)
121
+
resp = store.get(path)
131
122
132
123
withopen("output/file", "wb") as f:
133
124
for chunk in resp:
@@ -139,65 +130,56 @@ with open("output/file", "wb") as f:
139
130
Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
140
131
141
132
```py
142
-
import obstore as obs
143
-
144
133
store =...# store of your choice
145
134
path ="data/file1"
146
135
content =b"hello"
147
-
obs.put(store, path, content)
136
+
store.put(path, content)
148
137
```
149
138
150
139
You can also upload local files:
151
140
152
141
```py
153
142
from pathlib import Path
154
-
import obstore as obs
155
143
156
144
store =...# store of your choice
157
145
path ="data/file1"
158
146
content = Path("path/to/local/file")
159
-
obs.put(store, path, content)
147
+
store.put(path, content)
160
148
```
161
149
162
150
Or file-like objects:
163
151
164
152
```py
165
-
import obstore as obs
166
-
167
153
store =...# store of your choice
168
154
path ="data/file1"
169
155
withopen("path/to/local/file", "rb") as content:
170
-
obs.put(store, path, content)
156
+
store.put(path, content)
171
157
```
172
158
173
159
Or iterables:
174
160
175
161
```py
176
-
import obstore as obs
177
-
178
162
defbytes_iter():
179
163
for i inrange(5):
180
164
yieldb"foo"
181
165
182
166
store =...# store of your choice
183
167
path ="data/file1"
184
168
content = bytes_iter()
185
-
obs.put(store, path, content)
169
+
store.put(path, content)
186
170
```
187
171
188
172
Or async iterables:
189
173
190
174
```py
191
-
import obstore as obs
192
-
193
175
asyncdefbytes_stream():
194
176
for i inrange(5):
195
177
yieldb"foo"
196
178
197
179
store =...# store of your choice
198
180
path ="data/file1"
199
181
content = bytes_stream()
200
-
obs.put(store, path, content)
182
+
store.put(path, content)
201
183
```
202
184
203
185
## Copy objects from one store to another
@@ -209,16 +191,14 @@ Perhaps you have data in one store, say AWS S3, that you need to copy to another
209
191
Download the file, collect its bytes in memory, then upload it. Note that this will materialize the entire file in memory.
210
192
211
193
```py
212
-
import obstore as obs
213
-
214
194
store1 =...# store of your choice
215
195
store2 =...# store of your choice
216
196
217
197
path1 ="data/file1"
218
198
path2 ="data/file2"
219
199
220
-
buffer =obs.get(store1, path1).bytes()
221
-
obs.put(store2, path2, buffer)
200
+
buffer =store1.get(path1).bytes()
201
+
store2.put(path2, buffer)
222
202
```
223
203
224
204
### Local file
@@ -227,22 +207,21 @@ First download the file to disk, then upload it.
227
207
228
208
```py
229
209
from pathlib import Path
230
-
import obstore as obs
231
210
232
211
store1 =...# store of your choice
233
212
store2 =...# store of your choice
234
213
235
214
path1 ="data/file1"
236
215
path2 ="data/file2"
237
216
238
-
resp =obs.get(store1, path1)
217
+
resp =store1.get(path1)
239
218
240
219
withopen("temporary_file", "wb") as f:
241
220
for chunk in resp:
242
221
f.write(chunk)
243
222
244
223
# Upload the path
245
-
obs.put(store2, path2, Path("temporary_file"))
224
+
store2.put(path2, Path("temporary_file"))
246
225
```
247
226
248
227
### Streaming
@@ -254,30 +233,27 @@ It's easy to **stream** a download from one store directly as the upload to anot
254
233
Using the async API is currently required to use streaming copies.
255
234
256
235
```py
257
-
import obstore as obs
258
-
259
236
store1 =...# store of your choice
260
237
store2 =...# store of your choice
261
238
262
239
path1 ="data/file1"
263
240
path2 ="data/file2"
264
241
265
242
# This only constructs the stream, it doesn't materialize the data in memory
266
-
resp =awaitobs.get_async(store1, path1)
243
+
resp =awaitstore1.get_async(path1)
267
244
# A streaming upload is created to copy the file to path2
268
-
awaitobs.put_async(store2, path2, resp)
245
+
awaitstore2.put_async(path2, resp)
269
246
```
270
247
271
248
Or, by customizing the chunk size and the upload concurrency you can control memory overhead.
272
249
273
250
```py
274
-
resp =awaitobs.get_async(store1, path1)
251
+
resp =awaitstore1.get_async(path1)
275
252
chunk_size =5*1024*1024# 5MB
276
253
stream = resp.stream(min_chunk_size=chunk_size)
277
254
278
255
# A streaming upload is created to copy the file to path2
0 commit comments