ggsql/doc/syntax/clause/draw.qmd at 134f847b4e33bade7ae93294bd764f6258ce42b5 · posit-dev/ggsql · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "Create layers with `DRAW`"
---

`DRAW` is perhaps the most important clause in ggsql as it defines a layer in your visualisation. A layer is a single instance of a visual representation of a dataset. A visualisation can contain multiple layers, either showing the same dataset in different ways or showing different data in each. Layers are drawn in the order they are defined, so that layers defined first will be behind layers defined last.

## Clause syntax
The `DRAW` clause takes a number of subclauses, all of them optional if the `VISUALISE` clause provides a global mapping and data source.

```ggsql
DRAW <layer-type>
  MAPPING <mapping>, ... FROM <data-source>
  REMAPPING <mapping>, ...
  SETTING <parameter/aesthetic> => <value>, ...
  FILTER <condition>
  PARTITION BY <column>, ...
  ORDER BY <column>, ...
```

The only required part is the layer type immediately following the `DRAW` clause, which specifies the type of layer to draw, e.g. `point` or `histogram`. It defines how the remaining settings are interpreted. The [main syntax page](../index.qmd#layers) has a list of all available layer types

### `MAPPING`
```ggsql
MAPPING <mapping>, ... FROM <data-source>
```
The `MAPPINGS` clause define how data from the dataset are related to visual aesthetics or statistical properties. Multiple mappings can be provided by separating them with a comma. Mapped aesthetics are always scaled by their respective scale. This means that if you map the value 'red' to fill, then fill will not take the color red, but whatever the scale decides should represent the string 'red'. Layer mappings are merged with the global mapping from the `VISUALISE` clause with the one in the layer taking precedence. This means that it is not necessary to provide any mappings in the `DRAW` clause if sufficient global mappings are provided.

`MAPPING` may have a `FROM` clause that defines the source of the data. If this is missing the global data source from `VISUALISE` is used.

The `mapping` can take one of three forms and all three can be mixed in the same `MAPPING` clause:

* `<data> AS <aesthetic/property>` (explicit mapping)
* `<data>` (implicit mapping)
* `*` (wildcard mapping)

`data` can be either:

* *Column name*: If you provide the name of a column in the layer data (or global data in the absence of layer data) then the values in that column are mapped to the aesthetic or property. If the name of the column is the same as the aesthetic or property you can provide it without the following `AS <aesthetic/property>` (implicit mapping).
* *Constant*: If you provide a constant like a string, number, or boolean then this value is repeated for every record in the data and mapped to the given aesthetic or property. When mapping a constant you must use the explicit form since the aesthetic/property cannot be derived.
* `null`: If you map `null` to an aesthetic you prevent that aesthetic from being inherited from the global mapping without mapping any data to it. `null` can only be used with explicit mappings.

If an asterisk is given (wildcard mapping) it indicate that every column in the layer data with a name matching a supported aesthetic or property are implicitly mapped to said aesthetic or property. If the aesthetic or property has been mapped elsewhere then that gains precedence (i.e. if writing `MAPPING *, revenue AS y` then y will take on the data in the revenue column even if a y column exist in the data)

An `aesthetic` is a visual characteristic of what you are rendering. Different aesthetics are available depending on the layer type since e.g. linetype is not relevant for points and shape are not relevant for lines.

A `property` is a value used by the statistical transformation done by the layer, e.g. the weight property in the histogram layer that allows weighted histogram calculation.

> Mapped aesthetics are always scaled by their respective scale, whereas properties are not (there are no scales for properties).

The documentation for each layer type provides an overview of the aesthetics and properties available for them.

A layer may use a data source different than the global data by appending a `FROM <data-source>` to the mapping. The data source can be either:

* *Table/CTE*: If providing an unquoted identifier it is assumed that the data is available in the backend, either as a CTE defined in the pre-query, or as a proper table in the database.
* *Filepath*: If a string is provided (single quoted), it is assumed to point to a file that can be read directly by the backend.

### `REMAPPING`
```ggsql
REMAPPING <mapping>, ...
```

Some layer types like histogram runs the data through a statistical transformation in order to arrive at the data that needs to be displayed (e.g. count per bin). During this, one or more columns are added to the data and you can access these in the `REMAPPING` clause. Layers that perform statistical transformations always have a default remapping (e.g. `count AS y` in histogram) so it never required to specify a remapping, but you can do so, either to change the default (e.g. `density AS y`to access the calculated density instead of count), or to use the calculated property for another aesthetic (e.g. `count AS fill` to also shade the bars according to their height).

Remappings have to be explicit since the property name never coincide with an aesthetic. Further, remappings must always map to a visual aesthetic since the statistical properties have already been consumed.

### `SETTING`
```ggsql
SETTING <parameter/aesthetic> => <value>, ...
```

The `SETTING` clause can be used for two different things:

* *Setting parameters*: Some layers take additional arguments that control how they behave. Often, but not always, these modify the statistical transformation in some way. An example would be the binwidth parameter in histogram which controls the width of each bin during histogram calculation. This is not a statistical property since it is not related to each record, but to the calculation as a whole.
* *Setting aesthetics*: If you wish to set a specific aesthetic to a literal value, e.g. 'red' (as in the color red) then you can do so in the `SETTING` clause. Aesthetics that are set will not go through a scale but will use the provided value as-is. You cannot set an aesthetic to a column, only to a scalar literal value.

#### Position
A special setting is `position` which controls how overlapping objects are repositioned to avoid overlapping etc. Position adjustments have special mapping requirements so all position adjustments will not be relevant for all layer types. Different layers have different defaults as detailed in their documentation. You can read about each different position adjustment at [their own documentation sites](../index.qmd#position-adjustments).

#### Aggregate
Some layers support aggregation of their data through the `aggregate` setting. These layers will state this. `aggregate` collapses each group to a single row, replacing every numeric mapping in place with its aggregated value. Groups are defined by `PARTITION BY` together with all discrete mappings.

The setting takes a single string or an array of strings. Each string is one of:

* **Untargeted** — `'<func>'` (no prefix). With one untargeted aggregation the function applies to every numeric mapping that doesn't have a targeted aggregation. With two untargeted aggregations the first is used for the lower side of range layers (e.g. `x`/`xmin`) plus all non-range layers, and the second is used for the upper side of range layers (e.g. `xend`/`xmax`). More than two untargeted aggregations is an error.
* **Targeted** — `'<aes>:<func>'`. Applies `func` to the named aesthetic only (`<aes>` is a user-facing name like `x`, `y`, `xmin`, `xmax`, `xend`, `yend`, `color`, `size`, …). A target overrides any untargeted aggregation for that aesthetic.

A numeric mapping is dropped from the layer with a warning, when it has neither a target nor an applicable default.

The simple functions are:

* `'count'`: Non-null tally of the bound column.
* `'sum'` and `'prod'`: The sum or product
* `'min'`, `'max'`, `'range'`, and `'mid'`: Extremes, max - min, and (min + max) / 2
* `'mean'`, and `'median'`: Central tendency
* `'geomean'`, `'harmean'`, and `'rms'`: Geometric, harmonic, and root-mean-square
* `'sdev'`, `'var'`, `'iqr'`, and `'se'`: Standard deviation, variance, interquartile range, and standard error
* `'p05'`, `'p10'`, `'p25'`, `'p50'`, `'p75'`, `'p90'`, and `'p95'`: Percentiles
* `'first'` and `'last'`: The first or last value in the group, in row order. Note that the row order within a group is engine-defined unless the source query has an `ORDER BY` — these are most useful when the upstream SQL provides an explicit ordering.
* `'diff'`: `last - first`. The change between the first and last value in row order — same ordering caveat applies.

For band functions you combine an offset with an expansion, potentially multiplied. An example could be `'mean-1.96sdev'` which does exactly what you'd expect it to be. The general form is `<offset>±<multiplier><expansion>` with `<multiplier>` being optional (defaults to `1`).

Allowed offsets are: `'mean'`, `'median'`, `'geomean'`, `'harmean'`, `'rms'`, `'sum'`, `'prod'`, `'min'`, `'max'`, `'mid'`, and `'p05'`–`'p95'`

Allowed expansions are: `'sdev'`, `'se'`, `'var'`, `'iqr'`, and `'range'`

You can also target the same aesthetic more than once to produce **multiple rows per group** — one for each function. For example `aggregate => ('y:min', 'y:max')` emits a min row and a max row per group, so a single `DRAW line` produces two summary lines that connect within each group rather than across them. When multiple rows are created a synthetic `aggregate` column is made that tags each row with the aggregation function. You can use this with a `REMAPPING` to drive another aesthetic — e.g. `REMAPPING aggregate AS stroke` to colour the two lines differently. The column's value is built from the per-row function names of the *exploded* targets, deduplicated, and joined with `/`:

* `aggregate => ('y:min', 'y:max')` → rows tagged `'min'`, `'max'`.
* `aggregate => ('y:min', 'y:max', 'color:median')` → rows tagged `'min'`, `'max'` (the single-function `color` target is recycled across rows and is not part of the label).
* `aggregate => ('y:min', 'y:max', 'color:sum', 'color:prod')` → rows tagged `'min/sum'`, `'max/prod'`.
* `aggregate => ('y:mean', 'y:max', 'color:mean', 'color:prod')` → rows tagged `'mean'`, `'max/prod'` (the duplicate `'mean'` collapses).

When several aesthetics are targeted with the same number of functions, they explode in lockstep (row 1 uses each aesthetic's first function, row 2 the second, and so on); aesthetics with a single function — and the unprefixed defaults — are reused unchanged across every row. Mixing different lengths above 1 is an error.

In the single-row (reduction) case aggregation applies in place — no `REMAPPING` is needed and no synthetic column is added. Only the multi-row (explosion) case described above introduces the synthetic `aggregate` column.

### `FILTER`
```ggsql
FILTER <condition>
```

You may not want to use all data provided from the data source in the layer. You can limit the data to plot with the `FILTER` clause. The content of `condition` is used directly in a `WHERE` clause when querying the backend for the layer data, so whatever type of expression you database backend supports there will work.

### `PARTITION BY`
```ggsql
PARTITION BY <column>, ...
```

During drawing the records in the layer data are grouped by all the discrete data that has been mapped. This grouping is not relevant to all layer types but critical for some, e.g. line layers where the grouping defines which records are connected with a line.

Often the implicit grouping from the aesthetic mapping is enough, e.g. mapping a discrete value to colour will create one line per colour, but sometimes you need a grouping not reflected in the aesthetic mapping. In that case you can use the `PARTITION BY` clause to define data columns used for grouping in addition to the ones from the mapping.

### `ORDER BY`
```ggsql
ORDER BY <column>, ...
```

For some layers the order of records in the data is important, e.g. the path layer which connect records in the order they are provided. Since databases often doesn't guarantee a specific order of the data, the `ORDER BY` clause can be used to enforce such and order. Even for layers where the order doesn't immediately seem to matter it may have an effect, e.g. an overplottet scatterplot where the records in the end of the data are plottet on top of the one in the start.

## Layer orientation
Some layer types treat the two axes differently. For instance, a boxplot has categories along a discrete axis and summary statistics along a continuous one. While we are used to seeing boxplots with categories along the x-axis, this is not a necessity. The orientation can be deduced directly from the mappings in the layer. So, if you map discrete data to the x axis and continuous data to the y axis you get a boxplot in the standard orientation, whereas if you switch the mapping the boxes will "lay down" instead. The vast majority of layers that have an orientation also have a unique mapping pattern that allows us to deduce the orientation directly from the mapping. The few layers where the mapping is ambiguous (e.g. `line`) have an `orientation` setting that allows you to set the orientation explicitly.