Skip to content

Computing stats between groups #6476

@kieran-mace

Description

@kieran-mace

In situations when we want to calculate a group stat that requires knowledge of other groups, it would be useful for compute_group to have access to the rest of the data

I would like to be able to create a new property, bin_prop, applied to StatBin, that returns the proportion of data in that bin, that belongs to the group.

In the example below, I want to analyze the number of plays, by each player in the lakers. I will use geom_freqpoly to show the counts, but what I really want is the proportion of plays per player within the bin.

Set up data

library(lubridate)
library(ggplot2) 
library(dplyr)


# set up data
laker_player_plays = lakers |> 
  tibble::as_tibble() |> 
  filter(team == 'LAL', stringr::str_length(player) > 0) |> 
  mutate(date = ymd(date))

Just counts, close to what I want, but I would love to use a after_stat(bin_prop) instead.

# I'd like to do this, but instead cerate a new property `bin_prop` that shows the percentage of plays by that player
ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(count)
  ),
  binwidth = 31)

Side note

I do see that something equivalent can be done with geom_histogram + position = 'fill' - but I do not believe this is being done by the stat layer, but maybe by the scales layer?

# I do notice this is done to some extent using geom_histogram + position = fill, but I believe this position is not computed during the stat step
ggplot(laker_player_plays) +
  geom_histogram(aes(x = date, fill = player), position = 'fill', binwidth = 31)

<!-- →

Desired output

Here is an example of what I'd like to achieve, but by using stats instead of precomputing the proportion_of_plays ahed of time`

# This is the type of plot I think we should be able to create, without having to pre-calculate the proportions (should be computed in StatBin)
# calculate breaks, for solutions that can't use stat_bin

breaks = seq(min(laker_player_plays$date), max(laker_player_plays$date)+31, by = 31)

laker_player_plays |> 
  mutate(date_group = cut(date, breaks = breaks, )) |>
  group_by(player, date_group) |> 
  count(name = 'plays') |> 
  group_by(date_group) |> 
  mutate(proportion_of_plays = plays/sum(plays)) |> 
  ggplot(aes(x = date_group, 
             y = proportion_of_plays,
             color = player,
             group = player)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(labels=scales::percent)

Created on 2025-05-22 with reprex v2.1.1

Suggested API

ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(bin_prop)
  ),
  binwidth = 31)

I've attempted to create a PR for this, but noticed that each group is calculated independently. Is there a solution, or workaround that you propose to create a PR that enables the calculation of bin_prop in StatBin that requires calculation of proportions between groups? I do see that after_stat(prop) is available for geom_bar so I suspect this pattern has been solved for before?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions