-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsupplementary-cocit.tex
More file actions
323 lines (303 loc) · 23.3 KB
/
Copy pathsupplementary-cocit.tex
File metadata and controls
323 lines (303 loc) · 23.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
\documentclass{article}
\usepackage{changepage}
\usepackage{graphicx}
\begin{document}
\centerline{Supplementary Information for ``Co-citations in context: disciplinary heterogeneity is relevant''}
\centerline{James Bradley, Sitaram Devarakonda, Avon Davey, Siyu Liu, Dmitriy Korobskiy, Tandy Warnow, George Chacko}
\section{Discussion of algorithmic approach in Uzzi et al (2013)}
The MCMC algorithmic approach in Uzzi et al (2013), DOI: 10.1126/science.1240474 for citation switching involves building three dicts
containing publications, references, and year of publication information, and using them as lookup tables for various operations. In plain language, an iteration process selects each publication in turn. Then each reference in
the selected publication is replaced by a random selection from the *set* of eligible references published in the same year. If the potential replacement candidate is not the same as the reference to be replaced then a replacement is made. If it is the same, then up to 20 tries are made to find a non-self replacement. This process occurs for all the references in the set of publications being analyzed. Thus, reference a in three publications A,B,C could be replaced by references [b,c,d] or [b,b,d] but not [a,b,c]. Secondly, a reciprocal switch is made with a publication that cites the replacement. Thus, if publication A cites reference a published in year X then a is substituted with reference b also published in year X and and a randomly selected publication, say publication B that cites b, will have b replaced with a. See 'satyam\_mukherjee\_mcmc.py' kindly provided by the authors of this paper DOI: 10.1126/science.1240474.
Our approach is roughly similar. References are first grouped by year of publication and then the sample function in R is used on the *multi-set* of potential replacements to permute all references in a single step. A check is then run to see if the permutation process has created any duplicate references within each publication. Those publications with duplicate references are then deleted (typically $<= 0.2\%$). See `permute\_script.R'. A key difference is that the pool of replacement candidates is the *set* in one case and the *multi-set* in the
other. Every substitution in the first approach is independent for instances of the same reference. Using the *multi-set* accounts for existing citation frequency when selecting possible replacements. Thus a publication in year X that has accumulated 10,000 citations is more likely to be selected than a publication that is cited only once. Reference a in publications A,B,C could be replaced by references [a,b,c]. This process is very fast in comparison and we have scaled it up even further by porting it to the Spark environment. In a recent comparison of publications in WoS in year 1985 (39,1860 pubs and 5,588,861 total references), ten simulations using the `satyam\_mukherjee\_mcmc.py' took roughly 22 hours per simulation on a 32 Gb CentOS VM. Run times using our approach on comparable hardware amounted to roughly 30 seconds per shuffle plus an overhead of 3-4 hours to consolidate the data. We, also performed 1,000 simulations in 60 hrs using a small Spark cluster for the much large 2005 data set (886,648 publications and 19,036,324 total references)
For input data we selected all publications of type `Article' in WoS for a given year. Articles are then filtered to those that have at least two references in them. Further, only those references that have complete records in the Web of Science Core Collection are considered. This eliminates those that have cryptic references to other data sources or are just placeholders. Publications and references are mapped to their respective journals using ISSNs as identifiers. Where a reference has more than one ISSN, the most popular one is assigned to ensure that each reference is associated only with one journal.
For n $<= 1000$ simulations on disciplinary networks (immunology, metabolism, applied physics) permute\_script.R is
used to generate n files each with shuffled references. Typically, run times are less than 2 min per simulation on a 32 Gb CentOS box with 8 vCPUs in MS Azure. The permutation\_testing\_script.sh shell script is then run, which calls four Python scripts in turn.
\begin{enumerate}
\item observed\_frequency.py: generates journal pair frequencies for the year slice of the WoS or disciplinary data set being analyzed.
\item background\_frequency.py: generates journal pair frequencies for the background model implemented using permute\_script.R
\item journal\_count.py: joins all permuted files generated by background\_frequency and calculates mean, sd, and z-scores
\item Table\_generator.py: final output file which contains all publications, reference pairs along with observed frequency and z-scores
\end{enumerate}
Thus, the workflow is (i) generate year slice (input data) (ii) generate background models by shuffling references (iii) calculate journal pair frequencies (iv) consolidate observed and simulated frequencies into a single table and calculate z-scores.
This process tends to slow down with large data set such as WoS in 2005 with ~886,000 publications and 5.8 million
journal pairs. Consequently, the entire process has been ported to Spark and provisioning a cluster, copying source data from the ERNIE PostgreSQL database over to Spark, conducting in-memory calculations, and copying a final table back to PostgreSQL has been automated (see Spark folder).\clearpage
\begin{table}
\begin{adjustwidth}{-2cm}{}
\caption{Comparison of Citation Switching Algorithms. The citation switching algorithm of Uzzi et al. (2013) \emph{umsj} has been implemented in Python. Ten simulations of the WoS 1985 data set were executed on a 32 Gb, 8 vCPU CentOS 7.4 virtual machine to generate the data in Col 2. Each simulation took roughly 22 hours to complete. Scaling up the experiment, 10 or 1000 simulations of our modifications of this algorithm \emph{repcs}were executed on a 4-node Apache Spark cluster for the Wos 1985 data set. 1000 simulations completed in less than 50 hours (roughly 3 minute/simulation).}
\vspace{3mm}
\begin{centering}
\scalebox{0.9}{
\begin{tabular}{|r lllll l|}
\hline
& data set & Wos 1985 & WoS 1985 & WoS 1985 & WoS 1995 & WoS 2005 \\
\hline
1 & Input publications & 391,860 & 391,860 & 391,860 & 537,160 & 886,648 \\
2 & Input journals & 8,075 & 8,075 & 8,075 & 10,983 & 15,203 \\
3 & Observed input journal pairs & 1,277,349 & 1,277,349 & 1,277,349 & 2,373,226 & 5,847,432 \\
4 & Simulated journal pairs & 961,487 & 959765 & 1,200,403 & 2,288,225 & 5,835,794 \\
5 & Journal pair coverage & 75.27\% & 75.14\% & 93.97\% & 96.41\% & 99.80\% \\
6 & Min z-score & -132.71 & -148.14 & -104.50 & -215.96 & -273.708 \\
7 & Q1 z-score & -2.131 & -2.151 & -1.43 & -1.49 & -1.56 \\
8 & Median z-score & -0.536 & -0.54 & -0.24 & -0.25 & 0.555 \\
9 & Q3 z-score & 3.333 & 3.365 & 4.29 & 4.15 & 2.423 \\
10 & Max z-score & 16598.534 & 22015.891 & 12,028.55 & 12,662.15 & 6,152.57 \\
11 & Environment & CentOS 7.4 & Spark 2.3 & Spark 2.3 & Spark 2.3 & Spark 2.3 \\
12 & Number of simulations & 10 & 10 & 1000 & 1000 & 1000 \\
13 & Run time & 2186h (22 hr /sim) & $<$ 1 hr & $<$ 50h & 50h & 60h \\
14 & Algorithm & umsj & repcs & repcs & repcs & repcs \\
\hline
\end{tabular}}
\end{centering}
\end{adjustwidth}
\end{table}
\begin{table}
\begin{adjustwidth}{-2cm}{}
\begin{centering}
\caption{Profile of Disciplinary Networks. Data shown represent the results of 1000 simulations for the applied physics (ap), immunology (imm),
and metabolism (metab) disciplinary networks. Summary statistics for z-scores are provided as
well as the number of publications in each data set that were the input to the simulation process.}
\vspace{3mm}
\scalebox{0.9}{
\begin{tabular}{|r l ll ll ll l|}
\hline
& data set & Input Publications & Journal Pairs & Min & Q1 & Median & Q3 & Max \\
\hline
1 & ap1985 & 10298 & 34,267 & -23.05 & -0.94 & -0.21 & 3.03 & 1490.42 \\
2 & ap1995 & 21012 & 60,340 & -45.36 & -0.97 & -0.24 & 2.86 & 646.03 \\
3 & ap2005 & 35600 & 199,928 & -47.76 & -0.80 & -0.20 & 3.53 & 2158.47 \\
4 & imm85 & 17942 & 159,107 & -48.33 & -1.09 & -0.27 & 2.49 & 934.63 \\
5 & imm95 & 22759 & 319,855 & -59.56 & -1.10 & -0.28 & 2.37 & 1507.61 \\
6 & imm2005 & 28539 & 751,950 & -74.54 & -0.99 & -0.30 & 1.84 & 2560.51 \\
7 & metab1985 & 67342 & 431,993 & -97.00 & -1.46 & -0.34 & 2.34 & 4193.49 \\
8 & metab1995 & 100350 & 865,406 & -132.85 & -1.56 & -0.37 & 2.16 & 3998.44 \\
9 & metab2005 & 159910 & 2,349,005 & -127.81 & -1.60 & -0.41 & 1.83 & 3472.77 \\
\hline
\end{tabular}}
\end{centering}
\end{adjustwidth}
\end{table}
\begin{table}
\begin{adjustwidth}{-1cm}{}
\begin{centering}
\caption{Comparison of top 5\% cited publications vs all publications in applied physics (ap) immunology (imm), metabolism (metab), and WoS data sets.
Numbers shown are percent of publications in each group. Data are shown for reference years 1985, 1995, and 2005 }
\vspace{3mm}
\scalebox{0.8}{
\begin{tabular}{|r rlrrrrrrr r|}
\hline
& year & category & ap\_5 & ap\_all & imm\_5 & imm\_all & metab\_5 & metab\_all & wos\_5 & wos\_all \\
\hline
1 & 1985 & HCHN & 34 & 25 & 29 & 33 & 24 & 29 & 6 & 6 \\
2 & 1985 & HCLN & 15 & 25 & 21 & 17 & 26 & 21 & 44 & 44 \\
3 & 1985 & LCHN & 50 & 47 & 48 & 49 & 48 & 48 & 32 & 29 \\
4 & 1985 & LCLN & 1 & 3 & 2 & 1 & 2 & 2 & 18 & 21 \\
5 & 1995 & HCHN & 35 & 27 & 29 & 34 & 26 & 31 & 6 & 7 \\
6 & 1995 & HCLN & 15 & 23 & 21 & 16 & 24 & 19 & 44 & 43 \\
7 & 1995 & LCHN & 50 & 48 & 48 & 49 & 48 & 48 & 33 & 29 \\
8 & 1995 & LCLN & 0 & 2 & 2 & 1 & 2 & 2 & 17 & 21 \\
9 & 2005 & HCHN & 31 & 29 & 36 & 36 & 30 & 30 & 8 & 7 \\
10 & 2005 & HCLN & 19 & 20 & 14 & 14 & 20 & 20 & 42 & 43 \\
11 & 2005 & LCHN & 48 & 49 & 50 & 49 & 49 & 49 & 30 & 27 \\
12 & 2005 & LCLN & 2 & 2 & 0 & 1 & 1 & 1 & 20 & 23 \\
\hline
\end{tabular}}
\end{centering}
\end{adjustwidth}
\end{table}
\begin{table}
\begin{centering}
\caption{Statistical Significance of Deviation from a Random Distribution of Hits. These are hypothesis test data for the null hypothesis that hits are distributed among the categories randomly in proportion to the
number of articles in each category using a Chi Square Goodness of Fit Test for novel articles defined as those with the 10th percentile $z$-score being negative and the 1st percentile $z$-score being negative.
Rejecting the null hypothesis supports the alternate hypothesis that hit rates vary among the categories. The $p$ values indicate the significance of the difference between the observed number of hits and the
expected number of hits given by the random null model. $\dagger$ denotes that the Chi Square Goodness of Fit Test is not valid because the expected number of hits in at least one category was less than
the required five hits. This was caused by the combination of a small number of articles in a category due to a low overall hit rate ($1\%$ or $2\%$) and a definition of novelty ($1\%$) that resulted in few articles
being defined as being of low novelty. Results that are significant at the 0.05 level are shown in bold font and those significant at the 0.10 level are shown in italics.}
\vspace{3mm}
\scalebox{0.7}{
\begin{tabular}{|l crr r|}
\hline
Data && \multicolumn{1}{c}{Highly Cited} & \multicolumn{2}{c}{$p$ value} \\
Set & Year & \multicolumn{1}{c}{Min. Percentile} & Novelty Def.: $1\%$ & Novelty Def.: $10\%$ \\
\hline
Immunology & 1985 & 1\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 1985 & 2\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 1985 & 5\% & $\dagger$ 0.000 & \textbf{0.000} \\
Immunology & 1985 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Immunology & 1995 & 1\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 1995 & 2\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 1995 & 5\% & $\dagger$ 0.000 & \textbf{0.000} \\
Immunology & 1995 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Immunology & 2005 & 1\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 2005 & 2\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Immunology & 2005 & 5\% & $\dagger$ 0.000 & \textbf{0.000} \\
Immunology & 2005 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1985 & 1\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1985 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1985 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1985 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1995 & 1\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1995 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1995 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 1995 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 2005 & 1\% & $\dagger$ 0.000 & \textbf{0.000} \\
Metabolism & 2005 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 2005 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Metabolism & 2005 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 1985 & 1\% & $\dagger$ 0.027 & $\dagger$ 0.025 \\
Applied Physics & 1985 & 2\% & $\dagger$ 0.000 & $\dagger$ 0.000 \\
Applied Physics & 1985 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 1985 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 1995 & 1\% & $\dagger$ 0.010 & $\dagger$ 0.013 \\
Applied Physics & 1995 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 1995 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 1995 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 2005 & 1\% & $\dagger$ 0.000 & \textbf{0.000} \\
Applied Physics & 2005 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 2005 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Applied Physics & 2005 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1985 & 1\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1985 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1985 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1985 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1995 & 1\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1995 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1995 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 1995 & 10\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 2005 & 1\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 2005 & 2\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 2005 & 5\% & \textbf{0.000} & \textbf{0.000} \\
Web of Science & 2005 & 10\% & \textbf{0.000} & \textbf{0.000} \\
\hline
\end{tabular}}
\end{centering}
\end{table}
\pagenumbering{gobble}
\begin{table}
\begin{centering}
\caption{Hit Rates By Category. The hit rate is the percentage of publications in the referenced category that are in the top $1\%$, $2\%$, $5\%$, or $10\%$ of papers according to citation count (see column 3) for novel articles defined as those with the 10th percentile z-score being negative. The z-scores are computed using the local network. The category with the highest percentile is boldfaced (the second highest is also boldfaced if within 0.3\% and greater than the overall percentage of articles considered to be hits). We also evaluated novelty defined as the 1st percentile of z-scores being negative. We report here on novelty defined at the most stringent parameter setting.}
\scalebox{0.7}{
\begin{tabular}{ccrrrrr}
\hline
Data && \multicolumn{1}{c}{Highly Cited} \\
Set & Year & \multicolumn{1}{c}{Min. Percentile} & LNLC & LNHC & HNLC & HNHC \\
\hline
Immunology & 1985&1\%& 0.0& \bf{1.5}& 0.6& \bf{1.4} \\
Immunology & 1985&2\%& 0.0& \bf{3.0}& 1.2& \bf{2.8} \\
Immunology & 1985&5\%& 1.5& 6.5& 3.2& \bf{7.1} \\
Immunology & 1985&10\%& 3.0& 12.0& 7.1& \bf{14.0} \\
Immunology & 1995&1\%& 0.0& \bf{1.9}& 0.5& 1.4 \\
Immunology & 1995&2\%& 0.0& \bf{3.4}& 1.1& 2.8 \\
Immunology & 1995&5\%& 0.6& \bf{7.2}& 3.2& 6.8 \\
Immunology & 1995&10\%& 1.7& \bf{12.8}& 7.6& \bf{12.9} \\
Immunology & 2005&1\%& 0.0& \bf{1.5}& 0.6& \bf{1.3} \\
Immunology & 2005&2\%& 0.0& \bf{2.7}& 1.3& \bf{2.7} \\
Immunology & 2005&5\%& 0.0& 6.0& 3.6& \bf{6.7} \\
Immunology & 2005&10\%& 2.0& 10.8& 8.1& \bf{12.9} \\
Metabolism & 1985&1\%& 0.1& \bf{1.5}& 0.5& \bf{1.5} \\
Metabolism & 1985&2\%& 0.3& \bf{2.7}& 1.3& \bf{2.9} \\
Metabolism & 1985&5\%& 0.7& 6.6& 3.4& \bf{7.0} \\
Metabolism & 1985&10\%& 2.3& 12.3& 7.2& \bf{13.8} \\
Metabolism & 1995&1\%& 0.1& \bf{1.7}& 0.6& \bf{1.4} \\
Metabolism & 1995&2\%& 0.3& \bf{3.1}& 1.2& \bf{2.8} \\
Metabolism & 1995&5\%& 0.7& \bf{7.}0& 3.4& \bf{6.7} \\
Metabolism & 1995&10\%& 1.9& \bf{13.0}& 7.4& \bf{13.3} \\
Metabolism & 2005&1\%& 0.6& \bf{1.3}& 0.7& \bf{1.3} \\
Metabolism & 2005&2\%& 1.0& \bf{2.5}& 1.5& \bf{2.6} \\
Metabolism & 2005&5\%& 2.3& 6.0& 3.9& \bf{6.6} \\
Metabolism & 2005&10\%& 4.1& 11.4& 8.1& \bf{12.6} \\
Applied Physics & 1985&1\%& 0.0& 0.5& \bf{1.2}& \bf{1.2} \\
Applied Physics & 1985&2\%& 0.9& 0.9& \bf{2.5}& \bf{2.4} \\
Applied Physics & 1985&5\%& 2.8& 3.0& 5.5& \bf{6.5} \\
Applied Physics & 1985&10\%& 5.2& 6.7& 10.6& \bf{13.2} \\
Applied Physics & 1995&1\%& 0.2& 0.7& \bf{1.2}& \bf{1.0} \\
Applied Physics & 1995&2\%& 0.2& 1.3& \bf{2.5}& 2.1 \\
Applied Physics & 1995&5\%& 0.9& 3.4& \bf{6.0}& 5.2 \\
Applied Physics & 1995&10\%& 4.7& 7.9& \bf{12.3}& 10.9 \\
Applied Physics & 2005&1\%& 0.8& 0.6& \bf{1.1}& \bf{1.3} \\
Applied Physics & 2005&2\%& 1.1& 1.3& \bf{2.1}& \bf{2.3}\\
Applied Physics & 2005&5\%& 1.6& 3.3& 5.4& \bf{5.9} \\
Applied Physics & 2005&10\%& 3.9& 7.5& 10.7& \bf{11.4} \\
Web of Science & 1985&1\%& 0.4& 1.2& 1.0& \bf{1.6} \\
Web of Science & 1985&2\%& 0.9& 2.4& 2.0& \bf{3.3} \\
Web of Science & 1985&5\%& 2.6& 5.9& 5.1& \bf{8.4} \\
Web of Science & 1985&10\%& 5.8& 11.4& 10.4& \bf{15.8} \\
Web of Science & 1995&1\%& 0.4& 1.3& 0.9& \bf{1.7} \\
Web of Science & 1995&2\%& 0.9& 2.4& 1.9& \bf{3.3} \\
Web of Science & 1995&5\%& 2.5& 6.0& 5.0& \bf{8.0} \\
Web of Science & 1995&10\%& 5.6& 11.5& 10.4& \bf{15.6} \\
Web of Science & 2005&1\%& 0.4& 1.2& 1.0& \bf{1.7} \\
Web of Science & 2005&2\%& 0.9& 2.3& 2.0& \bf{3.4} \\
Web of Science & 2005&5\%& 2.5& 5.7& 5.3& \bf{8.1} \\
Web of Science & 2005&10\%& 5.6& 11.2& 10.8& \bf{15.0} \\
\hline
\end{tabular}}
\end{centering}
\end{table}
\begin{table}
\begin{centering}
\caption{Explanatory Power of Novelty and Conventionality. This table lists $p$-values in the form of cumulative right-hand tail probabilities for the observed number of hits in the Low Novelty, High Novelty, Low Conventionality, and High Conventionality categories under the sampling distribution generated by the null hypothesis of a random distribution of hit articles in proportion to the number of articles in each of the categories. A small $p$-value, therefore, indicates a number of hits that exceeds the expected number. Results that indicate statistically significant numbers of hits in excess of the expected number at the 0.05 level using a two-tailed test are highlighted in bold font, and those significant at the 0.10 level are italicized. These data are for the circumstances where novel citation patterns are defined by whether an article's 10th percentile $z$-score is negative. The $z$-scores are computed using the local network. We report these data because this is the most stringent definition of novelty}
\vspace{2mm}
\scalebox{0.7}{
\begin{tabular}{|c rrrrr r|}
\hline
& & & \multicolumn{4}{c}{Cumulative Probabilities} \\
Data & & \multicolumn{1}{c}{Highly Cited} & \multicolumn{2}{c}{Conventionality} & \multicolumn{2}{c}{Novelty} \\ % inserts table
Set & Year& \multicolumn{1}{c}{Min. Percentile} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{High} & \multicolumn{1}{c}{Low} & \multicolumn{1}{c}{High} \\
\hline
Immunology & 1985 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.020} & 0.866 \\
Immunology & 1985 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.002} & 0.940 \\
Immunology & 1985 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.002} & 0.924 \\
Immunology & 1985 & 10\% & 1.000 & \textbf{0.000} & \textbf{0.017} & 0.853 \\
Immunology & 1995 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.985 \\
Immunology & 1995 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.991 \\
Immunology & 1995 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.988 \\
Immunology & 1995 & 10\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.972 \\
Immunology & 2005 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.005} & 0.882 \\
Immunology & 2005 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.007} & 0.862 \\
Immunology & 2005 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.012} & 0.837 \\
Immunology & 2005 & 10\% & 1.000 & \textbf{0.000} & 0.265 & 0.609 \\
Metabolism & 1985 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.991 \\
Metabolism & 1985 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.986 \\
Metabolism & 1985 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.999 \\
Metabolism & 1985 & 10\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.997 \\
Metabolism & 1995 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 1.000 \\
Metabolism & 1995 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 1.000 \\
Metabolism & 1995 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 1.000 \\
Metabolism & 1995 & 10\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 1.000 \\
Metabolism & 2005 & 1\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.997 \\
Metabolism & 2005 & 2\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.998 \\
Metabolism & 2005 & 5\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.998 \\
Metabolism & 2005 & 10\% & 1.000 & \textbf{0.000} & \textbf{0.000} & 0.996 \\
Applied Physics & 1985 & 1\% & 0.177 & 0.860 & 0.998 & 0.071 \\
Applied Physics & 1985 & 2\% & 0.066 & 0.952 & 1.000 & \textbf{0.013} \\
Applied Physics & 1985 & 5\% & 0.200 & 0.818 & 1.000 & \textbf{0.003} \\
Applied Physics & 1985 & 10\% & 0.390 & 0.625 & 1.000 & \textbf{0.000} \\
Applied Physics & 1995 & 1\% & 0.062 & 0.955 & 0.996 & 0.090 \\
Applied Physics & 1995 & 2\% & \textbf{0.018} & 0.988 & 1.000 & \textbf{0.011} \\
Applied Physics & 1995 & 5\% & \textbf{0.002} & 0.999 & 1.000 & \textbf{0.001} \\
Applied Physics & 1995 & 10\% & \textbf{0.000} & 1.000 & 1.000 & \textbf{0.000} \\
Applied Physics & 2005 & 1\% & 0.319 & 0.706 & 1.000 & \textit{0.028} \\
Applied Physics & 2005 & 2\% & 0.272 & 0.748 & 1.000 & \textbf{0.015} \\
Applied Physics & 2005 & 5\% & 0.117 & 0.897 & 1.000 & \textbf{0.000} \\
Applied Physics & 2005 & 10\% & 0.102 & 0.909 & 1.000 & \textbf{0.000} \\
Web of Science & 1985 & 1\% & 1.000 & \textbf{0.000} & 0.986 & \textbf{0.002} \\
Web of Science & 1985 & 2\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 1985 & 5\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 1985 & 10\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 1995 & 1\% & 1.000 & \textbf{0.000} & 0.969 & \textbf{0.007} \\
Web of Science & 1995 & 2\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 1995 & 5\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 1995 & 10\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 2005 & 1\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 2005 & 2\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 2005 & 5\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
Web of Science & 2005 & 10\% & 1.000 & \textbf{0.000} & 1.000 & \textbf{0.000} \\
\hline
\end{tabular}}
\end{centering}
\end{table}
%%%%%
\end{document}
%%%%%%
\end{document}