You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: en/lessons/fetch-and-parse-data-with-openrefine.md
+16-12Lines changed: 16 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,7 +126,7 @@ To make examining the HTML easier, click on the URL in *Column 1* to open the li
126
126
In this case the sonnets page does not have distinctive semantic markup, but each poem is contained inside a single `<p>` element.
127
127
Thus, if all the paragraphs are selected, the sonnets can be extracted from the group.
128
128
129
-
{% include figure.html caption="Each sonnet is a \<p\> with lines separated by \<br /\>" filename="refine-sonnet-markup.png" %}
129
+
{% include figure.html caption="Each sonnet is a \<p\> with lines separated by \<br />" filename="refine-sonnet-markup.png" %}
130
130
131
131
On the *fetch* column, click on the menu arrow > *edit column* > *Add column based on this column*.
132
132
Give the new column the name "parse", then click in the *Expression* text box.
@@ -245,8 +245,8 @@ The entities will be replaced with normal whitespace.
245
245
[GREL array functions](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Array-Functions) provide a powerful way to manipulate text data and can be used to finish processing the sonnets.
246
246
Any string value can be turned into an array using the `split()` function by providing the character or expression that separates the items (basically the opposite of `join()`).
247
247
248
-
In the sonnets each line ends with `<br />`, providing a convenient separator for splitting.
249
-
The expression `value.split("<br />")` will create an array of the lines of each sonnet.
248
+
In the sonnets each line ends with `<br>`, providing a convenient separator for splitting.
249
+
The expression `value.split("<br>")` will create an array of the lines of each sonnet.
250
250
Index numbers and slices can then be used to populate new columns.
251
251
Keep in mind that Refine will not output an array directly to a cell.
252
252
Be sure to select one element from the array using an index number or convert it back to a string with `join()`.
@@ -258,8 +258,8 @@ Trim automatically removes all leading and trailing white space in a cell, an es
258
258
Using these concepts, a single line can be extracted and trimmed to create clean columns representing the sonnet number and first line.
259
259
Create two new columns from the *parse* column using these names and expressions:
260
260
261
-
- "number", `value.split("<br />")[0].trim()`
262
-
- "first", `value.split("<br />")[1].trim()`
261
+
- "number", `value.split("<br>")[0].trim()`
262
+
- "first", `value.split("<br>")[1].trim()`
263
263
264
264
{% include figure.html caption="GREL split and trim" filename="refine-add-num-column.png" %}
265
265
@@ -271,18 +271,18 @@ From the *parse* column, create a new column named "text", and click in the *Exp
271
271
A `forEach()` statement asks for an array, a variable name, and an expression applied to the variable.
272
272
Following the form `forEach(array, variable, expression)`, construct the loop using these parameters:
273
273
274
-
- array: `value.split("<br />")`, creates an array from the lines of the sonnet in each cell.
274
+
- array: `value.split("<br>")`, creates an array from the lines of the sonnet in each cell.
275
275
- variable: `line`, each item in the array is then represented as the variable (it could be anything, `v` is often used).
276
276
- expression: `line.trim()`, each item is then evaluated separately with the specified expression. In this case, `trim()` cleans the white space from each sonnet line in the array.
277
277
278
-
At this point, the statement should look like `forEach(value.split("<br />"), line, line.trim())` in the *Expression* box.
278
+
At this point, the statement should look like `forEach(value.split("<br>"), line, line.trim())` in the *Expression* box.
279
279
Notice that the *Preview* now shows an array where the first element is the sonnet number.
280
280
Since the results of the `forEach()` are returned as a new array, additional array functions can be applied, such as slice and join.
281
281
Add `slice(1)` to remove the sonnet number, and `join("\n")` to concatenate the lines in to a string value (`\n` is the symbol for new line in plain text).
282
282
Thus, the final expression to extract and clean the full sonnet text is:
Finally, numeric columns can be added using the `length()` function.
@@ -417,19 +417,19 @@ GREL's `parseJson()` function allows us to select a key name to retrieve the cor
417
417
Add a new column based on *fetch* with the name "items" and enter this expression:
418
418
419
419
```
420
-
value.parseJson()['items'].join("|||")
420
+
value.parseJson()['items'].join("^^^")
421
421
```
422
422
423
423
{% include figure.html caption="parse json items" filename="refine-parse-items.png" %}
424
424
425
425
Selecting `['items']` exposes the array of newspaper records nested inside the JSON response.
426
426
The `join()` function concatenates the array with the given separator resulting in a string value.
427
-
Since the newspaper records contain an OCR text field, the strange separator "|||" is necessary to ensure that it is unique and can be used to split the values.
427
+
Since the newspaper records contain an OCR text field, the strange separator "^^^" is necessary to ensure that it is unique and can be used to split the values.
428
428
429
429
## Split Multivalued Cells
430
430
431
431
With the individual newspapers isolated, separate rows can be created by splitting the cells.
432
-
On the *items* column, select *Edit cells* > *Split multivalued cells*, and enter the join used in the last step, `|||`.
432
+
On the *items* column, select *Edit cells* > *Split multivalued cells*, and enter the join used in the last step, `^^^`.
433
433
After the operation, the top of the project table should read 20 rows.
434
434
Clicking on Show as *records* should read 4, representing the original CSV rows.
435
435
@@ -461,6 +461,10 @@ Create a new column from *items* for each newspaper metadata element by parsing
461
461
- "lccn", `value.parseJson()['lccn']`
462
462
- "text", `value.parseJson()['ocr_eng']`
463
463
464
+
<divclass="alert alert-info">
465
+
Some users of this lesson have noted that a recent change to the output of OCR'ed text from the Library of Congress introduces unexpected line breaks in the text column. These can be removed using the Expression <code>value.replace("\n","")</code>. (Nov. 2022)
466
+
</div>
467
+
464
468
After the desired information is extracted, the *items* column can be removed by selecting *Edit column* > *Remove this column*.
465
469
466
470
{% include figure.html caption="Final ChronAm project columns" filename="refine-chronam-final.png" %}
0 commit comments