Guide-Testing-Statistical-Graphics/diff.tex at main · srvanderplas/Guide-Testing-Statistical-Graphics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL index-revised.tex   Wed Jun  4 09:33:00 2025
%DIF ADD index.tex           Wed Jun  4 09:38:05 2025
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor}
%
\documentclass[
  10pt,
]{article}

\usepackage{amsmath,amssymb}
\usepackage{iftex}
\ifPDFTeX
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
  \usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
  \usepackage{unicode-math}
  \defaultfontfeatures{Scale=MatchLowercase}
  \defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
\usepackage{lmodern}
\ifPDFTeX\else
    % xetex/luatex font selection
    \setmainfont[]{Helvetica}
    \setmonofont[]{Roboto}
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
  \usepackage[]{microtype}
  \UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
  \IfFileExists{parskip.sty}{%
    \usepackage{parskip}
  }{% else
    \setlength{\parindent}{0pt}
    \setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
  \KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\setcounter{secnumdepth}{2}
% Make \paragraph and \subparagraph free-standing
\makeatletter
\ifx\paragraph\undefined\else
  \let\oldparagraph\paragraph
  \renewcommand{\paragraph}{
    \@ifstar
      \xxxParagraphStar
      \xxxParagraphNoStar
  }
  \newcommand{\xxxParagraphStar}[1]{\oldparagraph*{#1}\mbox{}}
  \newcommand{\xxxParagraphNoStar}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
  \let\oldsubparagraph\subparagraph
  \renewcommand{\subparagraph}{
    \@ifstar
      \xxxSubParagraphStar
      \xxxSubParagraphNoStar
  }
  \newcommand{\xxxSubParagraphStar}[1]{\oldsubparagraph*{#1}\mbox{}}
  \newcommand{\xxxSubParagraphNoStar}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
\makeatother


\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\newsavebox\pandoc@box
\newcommand*\pandocbounded[1]{% scales image to fit in text height/width
  \sbox\pandoc@box{#1}%
  \Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}%
  \Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}%
  \ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi% select the smaller of both
  \ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}%
  \else\usebox{\pandoc@box}%
  \fi%
}
% Set default figure placement to htbp
\def\fps@figure{htbp}
\makeatother
% definitions for citeproc citations
\NewDocumentCommand\citeproctext{}{}
\NewDocumentCommand\citeproc{mm}{%
  \begingroup\def\citeproctext{#2}\cite{#1}\endgroup}
\makeatletter
 % allow citations to break across lines
 \let\@cite@ofmt\@firstofone
 % avoid brackets around text for \cite:
 \def\@biblabel#1{}
 \def\@cite#1#2{{#1\if@tempswa , #2\fi}}
\makeatother
\newlength{\cslhangindent}
\setlength{\cslhangindent}{1.5em}
\newlength{\csllabelwidth}
\setlength{\csllabelwidth}{3em}
\newenvironment{CSLReferences}[2] % #1 hanging-indent, #2 entry-spacing
 {\begin{list}{}{%
  \setlength{\itemindent}{0pt}
  \setlength{\leftmargin}{0pt}
  \setlength{\parsep}{0pt}
  % turn on hanging indent if param 1 is 1
  \ifodd #1
   \setlength{\leftmargin}{\cslhangindent}
   \setlength{\itemindent}{-1\cslhangindent}
  \fi
  % set entry spacing
  \setlength{\itemsep}{#2\baselineskip}}}
 {\end{list}}
\usepackage{calc}
\newcommand{\CSLBlock}[1]{\hfill\break\parbox[t]{\linewidth}{\strut\ignorespaces#1\strut}}
\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{\strut#1\strut}}
\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{\strut#1\strut}}
\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1}

\usepackage{orcidlink}
\definecolor{mypink}{RGB}{219, 48, 122}
\usepackage[dvipsnames]{xcolor} % colors
\renewcommand{\thefootnote}{\arabic{footnote}}
\newcommand{\ear}[1]{{\textcolor{blue}{#1}}}
\newcommand{\svp}[1]{{\textcolor{RedOrange}{#1}}}
\newcommand{\hh}[1]{{\textcolor{Green}{#1}}}
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\AtBeginDocument{%
\ifdefined\contentsname
  \renewcommand*\contentsname{Table of contents}
\else
  \newcommand\contentsname{Table of contents}
\fi
\ifdefined\listfigurename
  \renewcommand*\listfigurename{List of Figures}
\else
  \newcommand\listfigurename{List of Figures}
\fi
\ifdefined\listtablename
  \renewcommand*\listtablename{List of Tables}
\else
  \newcommand\listtablename{List of Tables}
\fi
\ifdefined\figurename
  \renewcommand*\figurename{Figure}
\else
  \newcommand\figurename{Figure}
\fi
\ifdefined\tablename
  \renewcommand*\tablename{Table}
\else
  \newcommand\tablename{Table}
\fi
}
\@ifpackageloaded{float}{}{\usepackage{float}}
\floatstyle{ruled}
\@ifundefined{c@chapter}{\newfloat{codelisting}{h}{lop}}{\newfloat{codelisting}{h}{lop}[chapter]}
\floatname{codelisting}{Listing}
\newcommand*\listoflistings{\listof{codelisting}{List of Listings}}
\makeatother
\makeatletter
\makeatother
\makeatletter
\@ifpackageloaded{caption}{}{\usepackage{caption}}
\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}}
\makeatother

\usepackage{bookmark}

\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same} % disable monospaced font for URLs
\hypersetup{
  pdftitle={A Guide to Designing Experiments to Test Statistical Graphics},
  colorlinks=true,
  linkcolor={blue},
  filecolor={Maroon},
  citecolor={Blue},
  urlcolor={red},
  pdfcreator={LaTeX via pandoc}}


\title{A Guide to Designing Experiments to Test Statistical
Graphics\thanks{The author(s) received no specific funding for this
work.}}

%% Author information
\author{%
%
Emily Robinson%\footnote{Email: erobin17@calpoly.edu}
~\orcidlink{0000-0001-9800-7304}%
%
\\{\footnotesize Statistics Department}, {\footnotesize California
Polytechnic State University}\\%
Heike Hofmann%\footnote{Email: hhofmann4@unl.edu}
~\orcidlink{0000-0001-6216-5183}%
%
\\{\footnotesize Statistics Department}, {\footnotesize University of
Nebraska - Lincoln}\\%
Susan Vanderplas%\footnote{Email: svanderplas2@unl.edu}
~\orcidlink{0000-0002-3803-0972}%
\footnote{Corresponding author: Susan Vanderplas, svanderplas2@unl.edu}%
\\{\footnotesize Statistics Department}, {\footnotesize University of
Nebraska - Lincoln}\\
}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}} %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
\newcommand{\DIFscaledelfig}{0.5}
%DIF HIGHLIGHTGRAPHICS PREAMBLE %DIF PREAMBLE
\RequirePackage{settobox} %DIF PREAMBLE
\RequirePackage{letltxmacro} %DIF PREAMBLE
\newsavebox{\DIFdelgraphicsbox} %DIF PREAMBLE
\newlength{\DIFdelgraphicswidth} %DIF PREAMBLE
\newlength{\DIFdelgraphicsheight} %DIF PREAMBLE
% store original definition of \includegraphics %DIF PREAMBLE
\LetLtxMacro{\DIFOincludegraphics}{\includegraphics} %DIF PREAMBLE
\newcommand{\DIFaddincludegraphics}[2][]{{\color{blue}\fbox{\DIFOincludegraphics[#1]{#2}}}} %DIF PREAMBLE
\newcommand{\DIFdelincludegraphics}[2][]{% %DIF PREAMBLE
\sbox{\DIFdelgraphicsbox}{\DIFOincludegraphics[#1]{#2}}% %DIF PREAMBLE
\settoboxwidth{\DIFdelgraphicswidth}{\DIFdelgraphicsbox} %DIF PREAMBLE
\settoboxtotalheight{\DIFdelgraphicsheight}{\DIFdelgraphicsbox} %DIF PREAMBLE
\scalebox{\DIFscaledelfig}{% %DIF PREAMBLE
\parbox[b]{\DIFdelgraphicswidth}{\usebox{\DIFdelgraphicsbox}\\[-\baselineskip] \rule{\DIFdelgraphicswidth}{0em}}\llap{\resizebox{\DIFdelgraphicswidth}{\DIFdelgraphicsheight}{% %DIF PREAMBLE
\setlength{\unitlength}{\DIFdelgraphicswidth}% %DIF PREAMBLE
\begin{picture}(1,1)% %DIF PREAMBLE
\thicklines\linethickness{2pt} %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,0){\framebox(1,1){}}}% %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,0){\line( 1,1){1}}}% %DIF PREAMBLE
{\color[rgb]{1,0,0}\put(0,1){\line(1,-1){1}}}% %DIF PREAMBLE
\end{picture}% %DIF PREAMBLE
}\hspace*{3pt}}} %DIF PREAMBLE
} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddbegin}{\DIFaddbegin} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddend}{\DIFaddend} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelbegin}{\DIFdelbegin} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelend}{\DIFdelend} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddbegin}{\DIFOaddbegin \let\includegraphics\DIFaddincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddend}{\DIFOaddend \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelbegin}{\DIFOdelbegin \let\includegraphics\DIFdelincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelend}{\DIFOaddend \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddbeginFL}{\DIFaddbeginFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOaddendFL}{\DIFaddendFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelbeginFL}{\DIFdelbeginFL} %DIF PREAMBLE
\LetLtxMacro{\DIFOdelendFL}{\DIFdelendFL} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddbeginFL}{\DIFOaddbeginFL \let\includegraphics\DIFaddincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFaddendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelbeginFL}{\DIFOdelbeginFL \let\includegraphics\DIFdelincludegraphics} %DIF PREAMBLE
\DeclareRobustCommand{\DIFdelendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
%DIF AMSMATHULEM PREAMBLE %DIF PREAMBLE
\makeatletter %DIF PREAMBLE
\let\sout@orig\sout %DIF PREAMBLE
\renewcommand{\sout}[1]{\ifmmode\text{\sout@orig{\ensuremath{#1}}}\else\sout@orig{#1}\fi} %DIF PREAMBLE
\makeatother %DIF PREAMBLE
%DIF COLORLISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
	language=DIFcode, %DIF PREAMBLE
	basicstyle=\ttfamily, %DIF PREAMBLE
	columns=fullflexible, %DIF PREAMBLE
	keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
\lstset{extendedchars=\true,inputencoding=utf8}

%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF

\begin{document}
\maketitle
\begin{abstract}
In this paper, we discuss considerations and methods for experimentally
testing visualizations. We discuss levels of user engagement with
graphics, common issues when developing a sampling or data generation
model, the importance of pilot testing, and data analysis methods. Along
the way, we also provide recommendations of how to avoid some of the
unique pitfalls of human testing in statistical and visualization
research.
\end{abstract}


\section{Introduction}\label{introduction}

Data visualizations are a critically important tool for communicating
scientific information to the public in what creators hope is an
easy-to-digest, visually attractive form. There are many strategies for
creating charts and graphs, from Tufte-esque minimalism (Tufte, 1991) to
charts designed with extra imagery and aesthetic appeal that draw the
viewer's attention and persist in memory (Cairo, 2012). For a specific
type of data, there are also usually many different chart forms to
display that data: for instance, if we have a set of categorical data
and we wish to show the relative proportions of each category, we could
do so using a stacked bar chart or the polar equivalent, a pie chart.
There have been several attempts to list out all of the types of charts
(Ribecca, 2022), create a taxonomy of charts (Bertin \& Berg, 1983;
Desnoyers, 2011), and even to create charts using a domain-specific
grammar of graphics (Wilkinson, 1999) that is also useful for
classification. One extremely useful reference is from Data to Viz
(https://www.data-to-viz.com/), which uses a decision tree to show
different visualizations compatible with the data; R, python, D3.js, and
React code are provided to demonstrate how to create those
visualizations. With all of the different design choices available, how
are chart creators to know what is the best approach for communicating
data to the appropriate audience?

While there are heuristics, general guidelines, and best practices
(Allen \& Erhardt, 2016; Few, 2006; Haemer, 1948; Kosslyn, 2006;
presentation, 1915) for creating useful and visually attractive data
displays, the best way to establish the efficacy of various design
decisions is to test the visualization on humans, evaluating different
variants under controlled conditions (Cleveland et al., 1988; Cleveland
\& McGill, 1985). Empirical assessments of visualizations, when
carefully designed, allow statisticians to determine which
representation of the same data is most effective along one or more
dimension(s) of interest: estimation or prediction accuracy, within or
between group comparisons, response time, and ability to draw real-world
decisions are common goals for charts.

It is extremely challenging to design studies which strike the right
balance between experimental control (i.e.~internal validity) and
generalizability to a wider context (i.e.~external validity). Simply
asking people to read quantities off of a graph may not generalize
beyond the questions asked or the data used in the chart (Croxton, 1932;
Croxton \& Stryker, 1927; Eells, 1926; Huhn, 1927), but designing a
study that is sufficiently robust to those issues requires manipulation
or control of so many factors that the amount of participants and trials
quickly becomes daunting or unaffordable. In addition, when conducting
graphics experiments, researchers are in the unusual position of being
both the subject matter expert and the statistician, providing an
unusual amount of control over not just the experimental design but also
the specific treatments, levels, and experimental protocols. The amount
of choices required to develop, pilot, and run an experiment can be
overwhelming. In this paper, we attempt to distill the experience gained
from conducting several different types of graphics experiments (Hofmann
et al., 2012; Robinson, 2022; Vanderplas et al., 2019, 2024; Vanderplas
\& Hofmann, 2015, 2017), discussing the use of different testing methods
(Vanderplas et al., 2020), the process of designing a graphical
experiment, and analysis of the resulting empirical data. It is our hope
that this paper will lower the barriers that exist for conducting
empirical graphics research and reduce the probability of costly
mistakes.

Section~\ref{sec-testing-methods} discusses different methods for
testing graphics, and which methods best address different levels of
user engagement. In Section~\ref{sec-model-dev}, we discuss the process
of developing the data-generating model used to control the statistical
features of data in the tested visualizations. Model development is a
nuanced and iterative process that ultimately determines the success and
generalizability of the experimental results. In
Section~\ref{sec-exp-dev}, we discuss the design of the experimental
protocol - the choice of platform, number and type of trials, and flow
of the experiment. We briefly consider different experimental design
considerations in Section~\ref{sec-exp-design}, but focus primarily on
factors specific to graphics experiments, and then move to the
importance of pilot testing in Section~\ref{sec-pilot-test}. Finally, we
provide some common analysis strategies in Section~\ref{sec-analysis},
including strategies for handling the unexpected data features which are
so common in graphical testing experiments.

\section{Testing Methods and User Engagement}\label{sec-testing-methods}

There are many different testing methods used to empirically assess
statistical graphics. This paper uses studies conducted online without
additional equipment as primary examples, though many of the same
considerations apply to in-person experiments conducted using additional
equipment, including 3D printed charts, eye-tracking equipment, and
interactive data displays. Online experiments have lower overhead, offer
relatively fast data collection, and provide useful results for
well-designed experiments. The toolkit used for these experiments is
R-based (R Core Team, 2022), and includes ggplot2 (Wickham, 2016) and
Shiny (Chang et al., 2021) as primary components. In many experiments,
we customized the Shiny interface with JavaScript and D3 (Bostock et
al., 2011), enabling interactive graphics, use of svgs, and other useful
extensions. While we prefer this set of tools, most of the observations
described here apply to a wide variety of different workflows for
graphical experimentation, including in-person experiments.

It is important to consider the level of user engagement which is
necessary to complete a particular visual or graphical task. For
instance, testing whether someone can detect an effect such as a linear
trend in noisy data is a perceptual question. Perceptual questions are
often examined experimentally using methods which allow the user to
interact with the data on a basic visual level: users are presented with
a visual stimulus and answer yes/no questions to indicate whether the
effect is detected. Numerical estimation is another common task when
testing graphics: in these experiments, the participant views a chart,
estimates the requested numerical quantity, and enters the estimate into
the application through a numerical input, slider, or other form
element. Sometimes, it is possible to set up a scenario where the user
adjusts the plot using a set of controls designed to provide a fixed set
of interactive operations. This type of user engagement was used to
assess the strength of the sine illusion (Vanderplas \& Hofmann, 2015):
users adjusted the strength of a transformation designed to correct the
illusion until the lines appeared to be the same length, as shown in
Figure~\ref{fig-sine-illusion}, providing a direct measure of the
magnitude of the sine illusion's effect. In other situations, it may be
preferable to have the user directly interact with the visual stimulus.
In Vanderplas et al. (2024), participants were asked to rotate and
interact with a 3D rendered bar chart; the application recorded user
interactions and corresponding rotation matrices, providing insight into
the visual comparisons the user may have been performing. This
information was used as a supplement to the explicitly provided
estimates, providing some contextual information as well as the ability
to identify the level of participant engagement with the questions. When
experiments are conducted as part of classroom experiential learning, it
is sometimes helpful to be able to separate the low-effort participants
from those who were fully intellectually engaged in the task.
Interactive graphics provide another level of user engagement that can
be much more open-ended. With interactive graphics, researchers can ask
participants to directly annotate plots, toggle aesthetics, and
highlight groups and plot features. Careful implementation of the
experiment application may allow for each of these interactions to be
recorded and analyzed, producing a rich, if messy, set of data that may
allow researchers to tease apart visual estimation error from common
shortcuts such as rounding used during direct numerical estimation.

\begin{figure}

\centering{

\pandocbounded{\includegraphics[keepaspectratio]{images/sine_illusion_screenshot.png}}

}

\caption{\label{fig-sine-illusion}Direct adjustment of a plot in a
perceptual task. In this experiment, designed to assess the strength of
the sine illusion, the user adjusts the plot using - and + buttons,
which control the strength of a transformation designed to correct the
effect of the sine illusion. When the user is satisfied that the lines
are of equal length, they select the `Finished' button to move to the
next task. The experiment used a psychophysics experimental design, the
method of adjustment, but leveraged the interactive Shiny interface to
record the entire sequence of adjustments made by the user for each
trial. A demo version of this application can be found at
https://shiny.srvanderplas.com/sine-illusion/.}

\end{figure}%

Visual inference (Buja et al., 2009; Wickham et al., 2010) is another
useful testing tool for perceptual questions such as ``which chart
displays this data more clearly'' (Hofmann et al., 2012) while
simultaneously assessing the statistical significance of the graphical
finding in a chart. Visual inference charts are often called ``lineups''
in analogy to the criminal procedure where the suspect is placed in a
line with several other individuals with similar characteristics. In a
graphical lineup procedure, there is a target plot containing the real
data, embedded in an array of (typically) 19 innocent ``null'' plots
generated through resampling or simulation, for a total of 20 panels. If
viewers consistently pick the target plot at a higher rate than any of
the null plots, the target plot is said to be visually significant (Loy
\& Hofmann, 2013; Majumder et al., 2013) and a ``see'' value, the visual
analogue of a \(p\)-value (Chowdhury et al., 2020), can be calculated
using the \texttt{vinference} R package or the process described in
Vanderplas et al. (2021). The details of this calculation are beyond the
scope of this broader discussion of how to test charts, but more detail
on visual inference is provided in \textless insert citation to visual
inference WIRE article under development\textgreater.

In another variation of the statistical lineup procedure, data generated
from two models are compared, with target plots from each model embedded
in the array of \(K\) total plots. The \(K-2\) null plots are
constructed from a mixture model that blends the two competing models
(Vanderplas \& Hofmann, 2017). Viewers are asked to select the panel(s)
which are most different, and the primary source of information are
trials in which viewers selected the target from one model but not the
other, indicating that the display method used allowed viewers to
differentiate one model's data (but not the other) from the nulls
created through a mixture model. This variation allows the experimenter
to assess graphical design choices to determine whether they effectively
emphasize structural differences in the data (Vanderplas \& Hofmann,
2017).

One advantage of the visual inference technique is that the experimenter
can ask a very general question, such as ``which of these plots is the
most different?'', rather than a specific question about the displayed
data which may require more quantitative sophistication. All of the
necessary information to make the decision is embedded in the choice of
the model used to generate the null plots. This feature is extremely
convenient when conducting the experiment and even allows small children
to complete the task. The downside is that as a result, visual inference
experiments do not allow experimenters to assess the viewer's
understanding of the information shown in the chart. In most cases,
visual inference experiments remove any contextual information from the
charts, including axis labels and values, plot titles, and so on, in
order to encourage participants to make decisions based solely on the
graphical presentation. This lack of context is a double-edged sword:
visual inference can involve participants who do not have any
mathematical training or instincts (including children), but researchers
also cannot use this technique to assess higher levels of engagement
with a chart, such as estimation, prediction, or reasoning based on
displayed information.\\

To assess the viewer's \emph{understanding} of information shown in a
chart, we must ask questions and allow the user to provide feedback.
User feedback may be collected on a numerical scale or through the use
of written comments, recorded ``think-aloud'' processes, and other more
qualitative interaction methods. In some studies, asking users to
interpret a chart within a larger scenario can be effective, as in
Figure~\ref{fig-estimation-describe}, while in others it is more helpful
to ask users to explain answers. In visual inference studies, asking
users why a specific panel was chosen has been demonstrated to provide
rich insight into otherwise confusing numerical results (Vanderplas \&
Hofmann, 2017).

Think-aloud methods ask the viewer to narrate their internal thought
process, either during or after completing a task (Haak et al., 2003).
These recordings (or transcripts) can provide valuable insights into
conscious cognition, and are often used when conducting usability
studies. While we have not to date recorded users talking out loud about
what they are seeing during a study, think aloud methods could easily be
implemented within a Shiny application, with audio recordings saved to
the server for transcription and analysis (Dunbar, 1995; Kirschenbaum,
2003; Trafton et al., 2000). It is even possible that these recordings
could be automatically transcribed using speech-to-text models. We have
used think-aloud methods informally during pilot studies to ``harden''
graphical experiments and verify the selection of parameters used in an
experiment. The success of this approach, combined with the few studies
which used think-aloud to assess charts (Haider et al., 2021; Kulhavy et
al., 1992; Lee et al., 2016), suggests that think-aloud methods are an
often-overlooked but useful tool for assessing data visualizations.

\begin{figure}

\centering{

\pandocbounded{\includegraphics[keepaspectratio]{images/Estimation_describe_plot_crop.png}}

}

\caption{\label{fig-estimation-describe}This question asks users to
write out a description of how the population of Ewoks changes over
time, without any further cues, to determine whether participants
default to multiplicative or additive language descriptions.}

\end{figure}%

Of course, in an online, asynchronous experiment, every user interaction
with the testing materials (typically hosted on a web page) can also be
recorded along with time stamps, mouse positions, browser size and
screen resolution, and other information. While we have not used this
type of information heavily in our experimental analyses thus far, in
most experiments we collect time stamp data in order to assess how long
participants spend on each question. Typically, the first round of test
questions takes the longest for participants to complete. Additional
replicates do not usually affect accuracy (i.e.~there is no immediate
learning effect) until after `too many' tests cognitive fatigue proves
to be detrimental to accuracy (Chowdhury et al., 2018). This sweet spot
between replicates and fatigue depends on the cognitive burden in each
test and should factor into designing the experiment. In some
experiments, we have provided participants with supportive tools, such
as ``scratch pads'' and calculators built into the Shiny application to
support the complex calculations required to answer higher-level
numerical estimation questions (Figure~\ref{fig-estimation-calc}). In
order to be supportive, the tools must be easy to use, but assuming this
bar is met, the tools can reduce participant cognitive load while
recording a wealth of information. This information provides real
insight into how participants were looking at the data, what strategies
they tried and discarded for reading the chart, and what visual
estimation methods were used. While systematic analysis and modeling of
this data may be difficult, as it is usually messy and often must be
manually coded, the insights provided can be extremely useful. However,
unless participants are required to use these tools, it is difficult to
gather comprehensive information - those participants that don't use
supportive tools likely differ in meaningful ways from those who do. As
a result, the information gathered from supportive tools likely does not
generalize to the entire sample.

\begin{figure}

\centering{

\pandocbounded{\includegraphics[keepaspectratio]{images/Estimation_numerical_screenshot_crop.png}}

}

\caption{\label{fig-estimation-calc}This question asks participants for
a numerical estimate, but provides a basic calculator and scratchpad.
All user interactions with the calculator and scratchpad are logged,
providing insight into the user's thought process and estimation
strategy.}

\end{figure}%

One of the most difficult components of designing an experiment which
asks users to directly estimate information from a chart using a full
scenario (background information, etc. as well as contextual details
from the chart) is that the questions must be extremely carefully
constructed. Mathematics education researchers provide guidelines for
selecting different levels of questioning in order to assess graph
comprehension: literal reading of the data, reading between the data,
and reading beyond the data (Curcio, 1987; Friel et al., 2001; Glazer,
2011; Wood, 1968). In a recent study, we identified questions based on
this framework to evaluate direct estimates and extend those estimates
to make comparisons between two points.

Even when great care is taken with the construction of the question,
participant answer accuracy is fundamentally limited by the fact that
many participants do not read and interpret the question with the care
and precision that it was written. Questions that ask participants to
e.g.~estimate the multiplicative change in a quantity at two time points
may be misunderstood as asking for an estimate of the additive
difference, and the resulting estimates are then one or more orders of
magnitude off of the correct answer. This is one area where lineup
methods are convenient - they do not depend on participants to
understand the nuances of language or scenarios built around the chart
under investigation. However, in some situations it may be sufficient to
ask participants to estimate direct numerical quantities that have
little contextual information, as done in (Vanderplas et al., 2019) when
assessing the accuracy of framed plots re-created from the Statistical
Atlas.

Another useful measurement strategy is to require participants to
\emph{engage directly} with an interactive visualization. This is useful
in a directed task, where users are asked to interact with the chart in
a specific way and the result is recorded, but it is also possible to
use interactive visualizations in an open-ended task, recording how
users engage with the graphic in an exploratory (as opposed to
goal-directed) manner. In one recent experiment, we asked participants
to forecast an exponential trend, with data presented on either a linear
or log scale. Using JavaScript code modified from New York Times
interactive graphics ``You Draw It'' features (Katz, 2017), we had users
draw trend lines with their computer mouse and make forecasts directly
on interactive charts, with the data and user-drawn predictions recorded
to our database (Robinson et al., 2023b). With interactive graphics
rendered using JavaScript (or other web libraries), the only limit to
the types of questions one can ask in testing graphics is one's ability
to write code to interact with the visualization library. This type of
testing method can be extremely natural for participants, but it also is
hard to generalize when discussing testing methods because of the
potential range of applications where it might be employed.

Whichever testing method is chosen should be appropriate to the type of
question under investigation and the level of visual and cognitive
engagement required to answer that question. While lineups are excellent
tools for assessing perceptual questions, they cannot address questions
aimed at understanding how people use charts within the wider context of
a story or practical task; this requires more direct methods with higher
ecological validity.

All of the testing methods described here require significant work to
develop a strategy for data generation appropriate for testing the
underlying question. For instance, when testing the perception of
exponential growth, we had to develop a model which would generate data
with varying growth rates, but where the data had a pre-specified domain
and range. If the null plots fail to capture the key visual
characteristics - such as trend, spread, or clustering - then any
standout visual differences may be attributed to those unintended
features, rather than the perceptual cue being tested. In other words,
if the nulls are too obviously different, participants might detect the
real plot for the wrong reason. Each testing method has specific
requirements, but it is important to carefully calibrate the model
parameters to allow for some variability, but not too much, and to
ensure that participants can succeed at the task and do not feel like
they are being made to analyze random noise. This Goldilocks-style
problem is the focus of the next section.

\section{Experiment Development Life
Cycle}\label{experiment-development-life-cycle}

Developing a graphics experiment is often a highly iterative process,
but it can help to approach the design process by first optimizing the
model and data generation method \DIFaddbegin \DIFadd{(Section~\ref{sec-model-dev}) }\DIFaddend before
spending time on optimizing the specific stimuli or customizing the data
collection platform \DIFaddbegin \DIFadd{(Section~\ref{sec-exp-dev})}\DIFaddend . This is important
because the model parameters and data generation process inform the
\DIFdelbegin \DIFdel{experiment }\DIFdelend \DIFaddbegin \DIFadd{experimental }\DIFaddend structure and thus impact decisions made downstream.

Once the model and data generating mechanism are set, it is useful to
revisit the primary questions of interest and determine how to measure
the responses effectively. Secondary measures, such as response time,
free responses, and confidence level should also be determined. These
choices will inform the choice of a data collection platform and may
also inform the participant recruitment method.

Next, we recommend developing a preliminary data analysis plan,
specifying the general category of model which will be used
(e.g.~generalized linear mixed-effects model, t-test) and the contrasts
which are most interesting. This sets up the experimental design
decisions, but also ensures that as the data collection platform and
process is developed, any design constraints are considered. Development
of the data collection application is the next step, using draft
graphics and the set of participant response measures of interest.

There are at least 3 stages of testing in a graphics experiment:
informal tests, a pilot study, and the main experiment. The informal
tests are critical for identifying issues with the data collection
application, but can also be used to calibrate the number of tasks
required of each participant. As the number and complexity of tasks
increases, the number of trials we can ask participants to complete
during a session decreases. The informal testing stage allows
researchers to consider the \DIFdelbegin \DIFdel{tradeoffs }\DIFdelend \DIFaddbegin \DIFadd{trade offs }\DIFaddend inherent in the decision to
reduce the amount of information collected for each task, reduce the
number of tasks, or mitigate participant fatigue in other ways.

During preliminary testing, we use an optimistic number of trials per
participant, so that we can determine when participants become overly
fatigued. For instance, we might ask test participants to evaluate 20
graphical lineups (400 total plots), even though we expect to reduce the
number to 10 or 15 during the main experiment. We test the application
in individual or focus group sessions, often using graduate students,
colleagues, social media acquaintances, and conscripted family members.
After these participants complete the study, we ask questions about
fatigue to determine what range of trials per participant is reasonable.
At the end of our initial experiment tests, we have enough information
to determine the basic parameters of the experimental design (e.g.~how
many blocks in an incomplete block design can we have with the factors
under investigation). The number of trials a single participant can
complete without excessive fatigue impacts the number of blocks and the
strategy by which we allocate trials to each participant.

In addition, we must consider how long participants take to complete the
required number of trials. Completion time is used to determine
participant compensation (if using a participant recruiting platform).
Ethics boards and some recruitment platforms require that participants
are paid a reasonable wage for their time (currently, around \$15 US per
hour), and platforms may ask for median completion time and
automatically reject submissions from participants who are too far under
or over the specified time limits; they may also require additional
participant payments if the median time estimate is too far below the
actual average completion time during the experiment. Platforms may also
calculate fees based on both the participant payment and number of
participants recruited, with additional fees to recruit
e.g.~demographically representative samples; as a result, it can be
advantageous to balance cognitive load concerns with the fee structure
used by the selected recruitment platform.

The findings from the initial test of the experimental procedure are
then used to revise the data collection procedure in preparation for one
or more pilot studies. It is important to ensure that the software
platform, trial allocation, and other components of the experiment are
functioning as desired before a formal pilot study is conducted. In some
cases, the pilot study is as simple as a ``soft launch'' of the main
experiment, where the total number of trials is pre-specified and only a
few trials are released initially to ensure that data collection works
as expected. In others, the pilot study is conducted first, and results
from that study are used to determine the sample size for the main
experiment. At this point, data collection, analysis, and reporting
proceed much as in any other experiment.

\section{Developing a Model}\label{sec-model-dev}

Once the graphical task has been identified, it is necessary to develop
a model which can be used to explore the graphical features of interest
in a precise manner. This is the single longest part of the entire
experimental design and execution process, in part because choosing a
model that replicates important visual features of the data is extremely
complex (Cook et al., 2021; Hullman \& Gelman, 2021; Vanderplas, 2021).

There are two main options when developing a statistical model for
graphical testing: start with a large data set and sample from that data
set (Hofmann et al., 2012), or start from a model and sample data from
that model generating process (Robinson, 2022; Vanderplas \& Hofmann,
2015, 2017). This decision is largely determined by the availability of
a large data set containing the requisite features of interest and the
qualities being manipulated in the experiment. For instance, Hofmann et
al. (2012) used samples of different sizes from a pre-existing data set
to manipulate the amount of signal in each comparison; with a small
sample, there is less signal and the same amount of noise, making the
true plot harder to spot. In many situations, though, a convenient data
set with the right properties is harder to acquire, and it becomes
necessary to develop a sampling model to generate data for user
evaluation.

The tools we discuss in the remainder of this section can be applied
both to pre-existing data sets and to model-based sampling methods.

\subsection{Screening Parameters with
Simulation}\label{screening-parameters-with-simulation}

The choice of the parameter space used in testing is crucial to gain
insight from a study without putting too much burden on participants
with overlong studies. Choosing an appropriate space for testing
parameters is a well-known problem in psychometric testing: the space
considered should cover the area between `only some activation' to
`almost full activation' of an appropriate psychometric function (Schütt
et al., 2016; Valentin et al., 2024). When testing charts, visual
assessment is obviously key, but researchers can make use of statistical
indices related to the testing condition to narrow the parameter space
to a reasonable and efficient subset from which maximal information can
be acquired.\\
These statistical indices may also serve as quantitative proxies for the
difficulty of the visual task. To identify a statistical proxy for
visual difficulty that may help with narrowing the parameter space, it
can be useful to consider numerical measures used to estimate the same
types of visual information that will be assessed in the experiment. For
instance, we have used:

\begin{itemize}
\tightlist
\item
  \(R^2\) as a measure of the strength of a linear relationship,
\item
  Gini inequality as a measure of the strength of clustering, and
\item
  lack-of-fit statistics to assess the amount of curvature in an
  exponential relationship (shown in
  Figure~\ref{fig-lof-density-curves}).
\end{itemize}

Then, a wide range of potential combinations of parameter values or
sampling strategies can be explored and summarized graphically; if the
numerical statistic cannot differentiate between the null and target
under a condition, it is reasonable to think that a visual inspection of
the data may also not show significant results. As with any measure, it
is important that difficulty levels span a range from easy to hard; we
do not learn anything from finding out that everyone can distinguish all
of the combinations. This portion of the design is somewhat analogous to
selecting a range of doses of a chemical in a dose-response experiment.

\begin{figure}

\DIFdelbeginFL %DIFDELCMD < \centering{
%DIFDELCMD <

%DIFDELCMD < \pandocbounded{\includegraphics[keepaspectratio]{index-revised_files/figure-pdf/fig-lof-density-curves-1.pdf}}
%DIFDELCMD <

%DIFDELCMD < }
%DIFDELCMD < %%%
\DIFdelendFL \DIFaddbeginFL \centering{

\pandocbounded{\includegraphics[keepaspectratio]{index_files/figure-pdf/fig-lof-density-curves-1.pdf}}

}
\DIFaddendFL

\caption{\label{fig-lof-density-curves}Density plot of the lack of fit
statistic showing separation of selected difficulty levels: High
(obvious curvature), Medium (noticeable curvature), and Low (almost
linear). Each density plot is the result of 1000 simulations from a
model \(y_i = \alpha\cdot e^{\beta\cdot x_i + \epsilon_i} + \theta\),
where \(\epsilon \sim N(0, \sigma^2)\). \(\alpha\) and \(\theta\) were
selected after manipulation of \(\beta\) and \(\sigma\) to ensure that
all data generated had similar \(y\) ranges so as not to provide visual
cues about model differences outside of the plot curvature.}

\end{figure}%

While this method is certainly more critical for model-based sampling
methods, it is also important when data are generated by sampling from a
larger data set. When sampling from a larger dataset, parameters are
more often sample size and stratification methods, but it is still
important to iteratively assess the data generating procedure through
simulation. Using numerical proxies for visual characteristics of data
displays such as curvature, linearity, scatter, dispersion can assist
with identifying optimal parameter settings to use across different
experimental conditions. Even with this strategy, it is still critical
to fine-tune the parameter choices with visual calibration and pilot
testing.

\subsection{Fine-Tuning Parameter
Choices}\label{fine-tuning-parameter-choices}

Once an appropriate set of parameters are identified using the numerical
screening method, it is important to calibrate these parameter
selections visually. No numerical statistic is a perfect measure of what
we actually see: at best, they are approximations of what we might
potentially see. We have found it to be useful to have one experimenter
calibrate the model parameters at a gross level, and then have another
experimenter narrow in on the parameters which are visually reasonable
within the selected range. Then, both examiners visually inspect a large
number of plots generated using those parameters to get a sense for how
difficult the task at hand is (this strategy is also described by Lu et
al. (2022)). At some point, all experimenters become so visually
saturated with the nuances of the data generating mechanism that it may
become necessary to ``sanity check'' the protocol with family members,
friends, and colleagues. These informal focus groups provide extremely
useful feedback and can help to counteract the visual saturation of
being immersed in the design of a visualization experiment for months at
a time.

\subsection{Visual Assessment is
Critical}\label{visual-assessment-is-critical}

We cannot overstate the importance of visual assessment of your model
stimuli, preferably with fresh eyes. We highly recommend performing
several rounds of think-aloud pilot testing (e.g., focus groups) before
deploying an experiment. In support of this assessment, we offer up a
cautionary tale of our own experience: that of Vanderplas \& Hofmann
(2017), where we designed an experiment to test which plot aesthetics
promoted discovery of linear trends and/or clusters.

The experiment was a \DIFdelbegin \DIFdel{2x3x3 }\DIFdelend \DIFaddbegin \DIFadd{\(2\times 3\times 3\) }\DIFaddend factorial exploration of
three data generating parameters, with 3 replicates at each parameter
combination (54 data sets) and 10 aesthetic combinations (for a total of
540 lineups). Each lineup had 20 different sub-panels, so we should have
carefully visually inspected some 10,800 different panels. As is evident
from the fact that we're telling this story as a cautionary tale, we
missed a critical problem with our data-generating mechanism: when
clusters were assigned to randomly generated data after the fact, we
didn't control the cluster size, leading to clusters of one or two
points in relatively few sub-panels. This became particularly noticeable
when bounding ellipses were added to the plot, as the method used to
generate those ellipses required at least 3 points in the cluster. The
missing boundary ellipse in the corresponding sub-panels escaped our
notice during the stimuli proof-reading phase of the experiment, but did
not escape the notice of our participants, who only needed to examine
about 10 lineups each (around 200 panels). An example of one of the
problematic lineups is shown in Figure~\ref{fig-lineup-problems}: many
participants selected panel 16 because of the missing ellipse; not a
wrong choice, but certainly not the effect we intended to test.

\begin{figure}

\centering{

\pandocbounded{\includegraphics[keepaspectratio]{images/lineup-missing-ellipse.png}}

}

\caption{\label{fig-lineup-problems}A lineup from Vanderplas \& Hofmann
(2017). Panel 10 shows the clustered target data and panel 17 shows the
target data with a strong linear relationship; either of these target
panels was the expected choice. Unfortunately, panel 16 has only two
bounding ellipses shown, which is an unintentional difference that
resulted from a faulty method for assigning clusters to null plots; many
participants selected this panel instead of one of the target panels.}

\end{figure}%

One reason why it is so difficult to generate sampling models for visual
explorations is that our visual system is \DIFdelbegin \DIFdel{optimized for }\DIFdelend \DIFaddbegin \DIFadd{very good at creating groups
and }\DIFaddend identifying differences between \DIFdelbegin \DIFdel{groups}\DIFdelend \DIFaddbegin \DIFadd{them (Lupyan, 2008; Peterson \&
Berryhill, 2013; Pomerantz \& Portillo, 2011; Zeki \& Stutters, 2013)}\DIFaddend .
This ability can interfere with the natural \DIFaddbegin \DIFadd{inclination }\DIFaddend to use the null
sampling models that might be used in equivalent numerical tests when
running experiments that use visualizations. \DIFaddbegin \DIFadd{Numerical tests consider
one facet of the data; the power of graphical testing is that it allows
examination of a battery of hypotheses simultaneously. Unfortunately,
that makes designing an effective null sampling model much more
difficult than in the numerical test case, as the researcher must
effectively control for several different hypotheses when generating
null data.
}

\DIFaddend We re-ran the experiment using a different clustering method that
controlled the number of points in each group. Instead of noticing the
number of ellipses, participants instead used the differences in size
and shape of the ellipses formed when clustering after the data
generating procedure. That is, participants could still detect the
artificial nature of the induced clusters using other features. While it
can be difficult to get the data generating method right, it is
essential to conducting visual experiments that generalize well beyond
the effects shown in a single data set or phenomenon. This is also why
it is critical to include independent replications of the simulated
parameters, so that the results reflect variability due to the data
generation process, not just the specifics of a single simulated
dataset. The time and effort invested in this step at the outset of the
experiment pays dividends when it allows for clear generalization of the
experimental results to an entire statistical concept rather than a
single data set.

\section{Protocol Development}\label{sec-exp-dev}

It would be difficult to develop a full data generating model without
some idea of the experimental protocol: the basic equipment required for
the experiment, some idea of what questions users will answer, where and
how data will be \DIFdelbegin \DIFdel{collecteed}\DIFdelend \DIFaddbegin \DIFadd{collected}\DIFaddend , and so on. These experimental design factors
are fairly natural for scientists to accumulate over the course of
imagining and planning an experiment. When conducting graphical tests,
however, there are additional considerations beyond those taught in a
standard experimental design course. Experimenters must carefully