Skip to content

Commit 18b88ac

Browse files
authored
Merge pull request #2732 from programminghistorian/Issue-2687
Issue 2687
2 parents 347e02e + 1d4b377 commit 18b88ac

4 files changed

Lines changed: 16 additions & 12 deletions

File tree

en/lessons/fetch-and-parse-data-with-openrefine.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ To make examining the HTML easier, click on the URL in *Column 1* to open the li
126126
In this case the sonnets page does not have distinctive semantic markup, but each poem is contained inside a single `<p>` element.
127127
Thus, if all the paragraphs are selected, the sonnets can be extracted from the group.
128128

129-
{% include figure.html caption="Each sonnet is a \<p\> with lines separated by \<br /\>" filename="refine-sonnet-markup.png" %}
129+
{% include figure.html caption="Each sonnet is a \<p\> with lines separated by \<br />" filename="refine-sonnet-markup.png" %}
130130

131131
On the *fetch* column, click on the menu arrow > *edit column* > *Add column based on this column*.
132132
Give the new column the name "parse", then click in the *Expression* text box.
@@ -245,8 +245,8 @@ The entities will be replaced with normal whitespace.
245245
[GREL array functions](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Array-Functions) provide a powerful way to manipulate text data and can be used to finish processing the sonnets.
246246
Any string value can be turned into an array using the `split()` function by providing the character or expression that separates the items (basically the opposite of `join()`).
247247

248-
In the sonnets each line ends with `<br />`, providing a convenient separator for splitting.
249-
The expression `value.split("<br />")` will create an array of the lines of each sonnet.
248+
In the sonnets each line ends with `<br>`, providing a convenient separator for splitting.
249+
The expression `value.split("<br>")` will create an array of the lines of each sonnet.
250250
Index numbers and slices can then be used to populate new columns.
251251
Keep in mind that Refine will not output an array directly to a cell.
252252
Be sure to select one element from the array using an index number or convert it back to a string with `join()`.
@@ -258,8 +258,8 @@ Trim automatically removes all leading and trailing white space in a cell, an es
258258
Using these concepts, a single line can be extracted and trimmed to create clean columns representing the sonnet number and first line.
259259
Create two new columns from the *parse* column using these names and expressions:
260260

261-
- "number", `value.split("<br />")[0].trim()`
262-
- "first", `value.split("<br />")[1].trim()`
261+
- "number", `value.split("<br>")[0].trim()`
262+
- "first", `value.split("<br>")[1].trim()`
263263

264264
{% include figure.html caption="GREL split and trim" filename="refine-add-num-column.png" %}
265265

@@ -271,18 +271,18 @@ From the *parse* column, create a new column named "text", and click in the *Exp
271271
A `forEach()` statement asks for an array, a variable name, and an expression applied to the variable.
272272
Following the form `forEach(array, variable, expression)`, construct the loop using these parameters:
273273

274-
- array: `value.split("<br />")`, creates an array from the lines of the sonnet in each cell.
274+
- array: `value.split("<br>")`, creates an array from the lines of the sonnet in each cell.
275275
- variable: `line`, each item in the array is then represented as the variable (it could be anything, `v` is often used).
276276
- expression: `line.trim()`, each item is then evaluated separately with the specified expression. In this case, `trim()` cleans the white space from each sonnet line in the array.
277277

278-
At this point, the statement should look like `forEach(value.split("<br />"), line, line.trim())` in the *Expression* box.
278+
At this point, the statement should look like `forEach(value.split("<br>"), line, line.trim())` in the *Expression* box.
279279
Notice that the *Preview* now shows an array where the first element is the sonnet number.
280280
Since the results of the `forEach()` are returned as a new array, additional array functions can be applied, such as slice and join.
281281
Add `slice(1)` to remove the sonnet number, and `join("\n")` to concatenate the lines in to a string value (`\n` is the symbol for new line in plain text).
282282
Thus, the final expression to extract and clean the full sonnet text is:
283283

284284
```
285-
forEach(value.split("<br />"), line, line.trim()).slice(1).join("\n")
285+
forEach(value.split("<br>"), line, line.trim()).slice(1).join("\n")
286286
```
287287

288288
{% include figure.html caption="GREL forEach expression" filename="refine-foreach.png" %}
@@ -291,7 +291,7 @@ Click "OK" to create the column.
291291
Following the same technique, add another new column from *parse* named "last" to represent the final couplet lines using:
292292

293293
```
294-
forEach(value.split("<br />"), line, line.trim()).slice(-3).join("\n")
294+
forEach(value.split("<br>"), line, line.trim()).slice(-3).join("\n")
295295
```
296296

297297
Finally, numeric columns can be added using the `length()` function.
@@ -417,19 +417,19 @@ GREL's `parseJson()` function allows us to select a key name to retrieve the cor
417417
Add a new column based on *fetch* with the name "items" and enter this expression:
418418

419419
```
420-
value.parseJson()['items'].join("|||")
420+
value.parseJson()['items'].join("^^^")
421421
```
422422

423423
{% include figure.html caption="parse json items" filename="refine-parse-items.png" %}
424424

425425
Selecting `['items']` exposes the array of newspaper records nested inside the JSON response.
426426
The `join()` function concatenates the array with the given separator resulting in a string value.
427-
Since the newspaper records contain an OCR text field, the strange separator "|||" is necessary to ensure that it is unique and can be used to split the values.
427+
Since the newspaper records contain an OCR text field, the strange separator "^^^" is necessary to ensure that it is unique and can be used to split the values.
428428

429429
## Split Multivalued Cells
430430

431431
With the individual newspapers isolated, separate rows can be created by splitting the cells.
432-
On the *items* column, select *Edit cells* > *Split multivalued cells*, and enter the join used in the last step, `|||`.
432+
On the *items* column, select *Edit cells* > *Split multivalued cells*, and enter the join used in the last step, `^^^`.
433433
After the operation, the top of the project table should read 20 rows.
434434
Clicking on Show as *records* should read 4, representing the original CSV rows.
435435

@@ -461,6 +461,10 @@ Create a new column from *items* for each newspaper metadata element by parsing
461461
- "lccn", `value.parseJson()['lccn']`
462462
- "text", `value.parseJson()['ocr_eng']`
463463

464+
<div class="alert alert-info">
465+
Some users of this lesson have noted that a recent change to the output of OCR'ed text from the Library of Congress introduces unexpected line breaks in the text column. These can be removed using the Expression <code>value.replace("\n","")</code>. (Nov. 2022)
466+
</div>
467+
464468
After the desired information is extracted, the *items* column can be removed by selecting *Edit column* > *Remove this column*.
465469

466470
{% include figure.html caption="Final ChronAm project columns" filename="refine-chronam-final.png" %}
159 KB
Loading
191 KB
Loading
202 KB
Loading

0 commit comments

Comments
 (0)