Skip to content

Improving serialization of ggplot2-objects #6856

@koefoeden

Description

@koefoeden

Hi! First of all, thanks for maintaining an amazing tool! I'm wondering if the maintainers would re-consider their stance on improving how serialization of ggplot2-objects are handled. Currently, as described in multiple old issues (#3619, #3994, #4056), serialized ggplot-objects retain any (potentially large) objects that were present during the "plot build process " - see below for another minimal reprex:

f <- function(create_large_object = FALSE) {
  if (create_large_object) {
    object_in_env <- rnorm(1e6)
  }

  p <- ggplot(cars) +
    geom_point(aes(speed, dist))

  path <- tempfile(fileext = ".rds")
  saveRDS(p, path)
  (file.info(path)$size) / 1024^2
}

cat("without large object: ", f(FALSE), "MB\n") # returns ~1 MB
cat("with large object: ", f(TRUE), "MB\n") # returns ~8 MB

I think this behavior is quite unexpected, and is likely resulting in many frustrations - especially because this behavior (and the fact that serialization is not recommended) is not well documented. Furthermore, I argue that, in contrast to previous opinions by the maintainers, there is a lot of value in being able to (unproblematically) serialize ggplots. As an example, serialization enables a structured and "tiered" analysis process, where initial plots during an explanatory phase (saved potentially both as image files and ggplot-objects) can be re-labelled, combined and/or re-formatted using the regular, powerful ggplot semantics when it's time for publication (or a similar goal). This is especially true when using ggplots in data analysis pipelines, facilitated by "targets" or similar frameworks, where serialization is used heavily, and where concerns of reproducibility and ggplot2-version incompatibility are naturally handled.

Furthermore, I think the suggested alternatives, such as manipulating how the plotting is done (e.g. in it's own environment), is often not feasible, because the plots will be created by external packages outside of the user's control. Another suggested alternative is to re-create plots from scratch, but this of course introduces a whole host of other problems. I am a huge fan of how ggplot2 has become a standard of plotting in R, but I think it is a big limitation that you cannot use serialization to edit the plots yourself after-the-fact in these problematic cases.

I realize that I have no grasp on how challenging/problematic this is to revise - but I wonder if it might be possible to (at least) alleviate some of the most common and unexpected cases, such as the reprex above, where a totally irrelevant object is contaminating the serialized file.

Looking forward to hearing what you think!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions