Skip to content

Commit 6832e43

Browse files
committed
Update plots for April 2026 crawl (CC-MAIN-2026-17)
Signed-off-by: Luca Foppiano <luca@foppiano.org>
1 parent 971c1eb commit 6832e43

43 files changed

Lines changed: 5556 additions & 5126 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

plots/charsets-top-100.html

Lines changed: 44 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
<thead>
33
<tr style="text-align: right;">
44
<th>crawl</th>
5-
<th>CC-MAIN-2026-04</th>
65
<th>CC-MAIN-2026-08</th>
76
<th>CC-MAIN-2026-12</th>
7+
<th>CC-MAIN-2026-17</th>
88
</tr>
99
<tr>
1010
<th>charset</th>
@@ -22,15 +22,15 @@
2222
</tr>
2323
<tr>
2424
<th>&lt;unknown&gt;</th>
25-
<td>1.6272</td>
2625
<td>1.8164</td>
2726
<td>1.8232</td>
27+
<td>1.8918</td>
2828
</tr>
2929
<tr>
3030
<th>Big5</th>
31-
<td>0.0253</td>
3231
<td>0.0219</td>
3332
<td>0.0198</td>
33+
<td>0.0298</td>
3434
</tr>
3535
<tr>
3636
<th>Big5-HKSCS</th>
@@ -40,117 +40,117 @@
4040
</tr>
4141
<tr>
4242
<th>EUC-JP</th>
43-
<td>0.1285</td>
4443
<td>0.1311</td>
4544
<td>0.1340</td>
45+
<td>0.1266</td>
4646
</tr>
4747
<tr>
4848
<th>EUC-KR</th>
49-
<td>0.0770</td>
5049
<td>0.0785</td>
5150
<td>0.0819</td>
51+
<td>0.0771</td>
5252
</tr>
5353
<tr>
5454
<th>GB18030</th>
55-
<td>0.0165</td>
5655
<td>0.0149</td>
5756
<td>0.0165</td>
57+
<td>0.0150</td>
5858
</tr>
5959
<tr>
6060
<th>GB2312</th>
61-
<td>0.2195</td>
6261
<td>0.2307</td>
6362
<td>0.2388</td>
63+
<td>0.2011</td>
6464
</tr>
6565
<tr>
6666
<th>GBK</th>
67-
<td>0.0991</td>
6867
<td>0.1013</td>
6968
<td>0.1000</td>
69+
<td>0.0934</td>
7070
</tr>
7171
<tr>
7272
<th>IBM420</th>
73-
<td>0.0040</td>
7473
<td>0.0051</td>
7574
<td>0.0049</td>
75+
<td>0.0051</td>
7676
</tr>
7777
<tr>
7878
<th>IBM424</th>
79-
<td>0.0018</td>
8079
<td>0.0023</td>
8180
<td>0.0021</td>
81+
<td>0.0011</td>
8282
</tr>
8383
<tr>
8484
<th>IBM500</th>
85-
<td>0.0010</td>
8685
<td>0.0011</td>
8786
<td>0.0011</td>
87+
<td>0.0008</td>
8888
</tr>
8989
<tr>
9090
<th>IBM855</th>
91-
<td>0.0000</td>
9291
<td>NaN</td>
9392
<td>0.0000</td>
93+
<td>NaN</td>
9494
</tr>
9595
<tr>
9696
<th>IBM866</th>
97-
<td>0.0002</td>
9897
<td>0.0003</td>
9998
<td>0.0002</td>
99+
<td>0.0001</td>
100100
</tr>
101101
<tr>
102102
<th>ISO-2022-JP</th>
103-
<td>0.0009</td>
104103
<td>0.0010</td>
105104
<td>0.0012</td>
105+
<td>0.0010</td>
106106
</tr>
107107
<tr>
108108
<th>ISO-8859-1</th>
109-
<td>4.3993</td>
110109
<td>6.6472</td>
111110
<td>3.3383</td>
111+
<td>2.7125</td>
112112
</tr>
113113
<tr>
114114
<th>ISO-8859-13</th>
115-
<td>0.0001</td>
115+
<td>0.0000</td>
116116
<td>0.0000</td>
117117
<td>0.0000</td>
118118
</tr>
119119
<tr>
120120
<th>ISO-8859-15</th>
121-
<td>0.0456</td>
122121
<td>0.0466</td>
123122
<td>0.0478</td>
123+
<td>0.0444</td>
124124
</tr>
125125
<tr>
126126
<th>ISO-8859-16</th>
127-
<td>0.0003</td>
128127
<td>0.0002</td>
129128
<td>0.0003</td>
129+
<td>0.0003</td>
130130
</tr>
131131
<tr>
132132
<th>ISO-8859-2</th>
133-
<td>0.0824</td>
134133
<td>0.0906</td>
135134
<td>0.0882</td>
135+
<td>0.0817</td>
136136
</tr>
137137
<tr>
138138
<th>ISO-8859-3</th>
139-
<td>0.0000</td>
140139
<td>0.0002</td>
141140
<td>0.0003</td>
141+
<td>0.0003</td>
142142
</tr>
143143
<tr>
144144
<th>ISO-8859-4</th>
145145
<td>0.0006</td>
146-
<td>0.0006</td>
146+
<td>0.0007</td>
147147
<td>0.0007</td>
148148
</tr>
149149
<tr>
150150
<th>ISO-8859-5</th>
151-
<td>0.0015</td>
152151
<td>0.0016</td>
153152
<td>0.0015</td>
153+
<td>0.0008</td>
154154
</tr>
155155
<tr>
156156
<th>ISO-8859-6</th>
@@ -160,27 +160,27 @@
160160
</tr>
161161
<tr>
162162
<th>ISO-8859-7</th>
163-
<td>0.0046</td>
164163
<td>0.0043</td>
165164
<td>0.0043</td>
165+
<td>0.0040</td>
166166
</tr>
167167
<tr>
168168
<th>ISO-8859-8</th>
169-
<td>0.0006</td>
170169
<td>0.0007</td>
171170
<td>0.0008</td>
171+
<td>0.0007</td>
172172
</tr>
173173
<tr>
174174
<th>ISO-8859-9</th>
175-
<td>0.0198</td>
176175
<td>0.0196</td>
177176
<td>0.0201</td>
177+
<td>0.0235</td>
178178
</tr>
179179
<tr>
180180
<th>KOI8-R</th>
181-
<td>0.0057</td>
182181
<td>0.0067</td>
183182
<td>0.0066</td>
183+
<td>0.0076</td>
184184
</tr>
185185
<tr>
186186
<th>KOI8-U</th>
@@ -190,45 +190,45 @@
190190
</tr>
191191
<tr>
192192
<th>Shift_JIS</th>
193-
<td>0.1606</td>
194193
<td>0.1749</td>
195194
<td>0.1813</td>
195+
<td>0.1479</td>
196196
</tr>
197197
<tr>
198198
<th>TIS-620</th>
199-
<td>0.0037</td>
200199
<td>0.0038</td>
201200
<td>0.0038</td>
201+
<td>0.0036</td>
202202
</tr>
203203
<tr>
204204
<th>US-ASCII</th>
205-
<td>0.0194</td>
206205
<td>0.0189</td>
207206
<td>0.0200</td>
207+
<td>0.0203</td>
208208
</tr>
209209
<tr>
210210
<th>UTF-16</th>
211-
<td>0.0025</td>
212211
<td>0.0027</td>
213212
<td>0.0023</td>
213+
<td>0.0022</td>
214214
</tr>
215215
<tr>
216216
<th>UTF-16BE</th>
217217
<td>0.0002</td>
218218
<td>0.0002</td>
219-
<td>0.0002</td>
219+
<td>0.0003</td>
220220
</tr>
221221
<tr>
222222
<th>UTF-16LE</th>
223-
<td>0.0012</td>
224223
<td>0.0015</td>
225224
<td>0.0015</td>
225+
<td>0.0039</td>
226226
</tr>
227227
<tr>
228228
<th>UTF-32</th>
229229
<td>0.0001</td>
230230
<td>0.0001</td>
231-
<td>0.0001</td>
231+
<td>0.0000</td>
232232
</tr>
233233
<tr>
234234
<th>UTF-32LE</th>
@@ -238,57 +238,57 @@
238238
</tr>
239239
<tr>
240240
<th>UTF-8</th>
241-
<td>92.3263</td>
242241
<td>89.8190</td>
243242
<td>93.0693</td>
243+
<td>93.7434</td>
244244
</tr>
245245
<tr>
246246
<th>windows-1250</th>
247-
<td>0.0668</td>
248247
<td>0.0681</td>
249248
<td>0.0689</td>
249+
<td>0.0666</td>
250250
</tr>
251251
<tr>
252252
<th>windows-1251</th>
253-
<td>0.4681</td>
254253
<td>0.4820</td>
255254
<td>0.5023</td>
255+
<td>0.4838</td>
256256
</tr>
257257
<tr>
258258
<th>windows-1252</th>
259-
<td>0.1271</td>
260259
<td>0.1433</td>
261260
<td>0.1557</td>
261+
<td>0.1526</td>
262262
</tr>
263263
<tr>
264264
<th>windows-1253</th>
265-
<td>0.0020</td>
266265
<td>0.0019</td>
267266
<td>0.0024</td>
267+
<td>0.0024</td>
268268
</tr>
269269
<tr>
270270
<th>windows-1254</th>
271-
<td>0.0100</td>
272271
<td>0.0104</td>
273272
<td>0.0129</td>
273+
<td>0.0123</td>
274274
</tr>
275275
<tr>
276276
<th>windows-1255</th>
277277
<td>0.0059</td>
278278
<td>0.0059</td>
279-
<td>0.0059</td>
279+
<td>0.0053</td>
280280
</tr>
281281
<tr>
282282
<th>windows-1256</th>
283-
<td>0.0310</td>
284283
<td>0.0303</td>
285284
<td>0.0276</td>
285+
<td>0.0240</td>
286286
</tr>
287287
<tr>
288288
<th>windows-1257</th>
289-
<td>0.0059</td>
290289
<td>0.0062</td>
291290
<td>0.0059</td>
291+
<td>0.0053</td>
292292
</tr>
293293
<tr>
294294
<th>windows-31j</th>
@@ -300,13 +300,13 @@
300300
<th>x-iso-8859-11</th>
301301
<td>0.0001</td>
302302
<td>0.0001</td>
303-
<td>0.0001</td>
303+
<td>0.0000</td>
304304
</tr>
305305
<tr>
306306
<th>x-windows-874</th>
307-
<td>0.0070</td>
308307
<td>0.0072</td>
309308
<td>0.0067</td>
309+
<td>0.0057</td>
310310
</tr>
311311
<tr>
312312
<th>x-windows-949</th>

plots/charsets.csv

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3523,3 +3523,52 @@ CC-MAIN-2026-12,windows-31j,7656,7636,0.0004
35233523
CC-MAIN-2026-12,x-iso-8859-11,2578,2562,0.0001
35243524
CC-MAIN-2026-12,x-windows-874,132392,132120,0.0067
35253525
CC-MAIN-2026-12,x-windows-949,71,70,0.0000
3526+
CC-MAIN-2026-17,<other>,2318,2314,0.0001
3527+
CC-MAIN-2026-17,<unknown>,41467368,41467368,1.8918
3528+
CC-MAIN-2026-17,Big5,652206,651651,0.0298
3529+
CC-MAIN-2026-17,Big5-HKSCS,772,772,0.0000
3530+
CC-MAIN-2026-17,EUC-JP,2774903,2767865,0.1266
3531+
CC-MAIN-2026-17,EUC-KR,1691059,1688850,0.0771
3532+
CC-MAIN-2026-17,GB18030,328133,327466,0.0150
3533+
CC-MAIN-2026-17,GB2312,4407671,4401246,0.2011
3534+
CC-MAIN-2026-17,GBK,2047427,2043670,0.0934
3535+
CC-MAIN-2026-17,IBM420,110960,110719,0.0051
3536+
CC-MAIN-2026-17,IBM424,25193,25072,0.0011
3537+
CC-MAIN-2026-17,IBM500,17952,17908,0.0008
3538+
CC-MAIN-2026-17,IBM866,2526,2523,0.0001
3539+
CC-MAIN-2026-17,ISO-2022-JP,22833,22808,0.0010
3540+
CC-MAIN-2026-17,ISO-8859-1,59456401,59246052,2.7125
3541+
CC-MAIN-2026-17,ISO-8859-13,620,620,0.0000
3542+
CC-MAIN-2026-17,ISO-8859-15,973840,967873,0.0444
3543+
CC-MAIN-2026-17,ISO-8859-16,6935,6917,0.0003
3544+
CC-MAIN-2026-17,ISO-8859-2,1791551,1785905,0.0817
3545+
CC-MAIN-2026-17,ISO-8859-3,7325,7305,0.0003
3546+
CC-MAIN-2026-17,ISO-8859-4,14715,14662,0.0007
3547+
CC-MAIN-2026-17,ISO-8859-5,16730,16638,0.0008
3548+
CC-MAIN-2026-17,ISO-8859-6,449,447,0.0000
3549+
CC-MAIN-2026-17,ISO-8859-7,88024,87744,0.0040
3550+
CC-MAIN-2026-17,ISO-8859-8,14271,14251,0.0007
3551+
CC-MAIN-2026-17,ISO-8859-9,516036,514479,0.0235
3552+
CC-MAIN-2026-17,KOI8-R,167093,166532,0.0076
3553+
CC-MAIN-2026-17,KOI8-U,1905,1905,0.0001
3554+
CC-MAIN-2026-17,Shift_JIS,3241088,3230774,0.1479
3555+
CC-MAIN-2026-17,TIS-620,79240,79071,0.0036
3556+
CC-MAIN-2026-17,US-ASCII,445179,444631,0.0203
3557+
CC-MAIN-2026-17,UTF-16,47193,47017,0.0022
3558+
CC-MAIN-2026-17,UTF-16BE,6708,6706,0.0003
3559+
CC-MAIN-2026-17,UTF-16LE,84922,84770,0.0039
3560+
CC-MAIN-2026-17,UTF-32,1021,1017,0.0000
3561+
CC-MAIN-2026-17,UTF-32LE,3372,3352,0.0002
3562+
CC-MAIN-2026-17,UTF-8,2054795645,2042701966,93.7434
3563+
CC-MAIN-2026-17,windows-1250,1459400,1453648,0.0666
3564+
CC-MAIN-2026-17,windows-1251,10605458,10541589,0.4838
3565+
CC-MAIN-2026-17,windows-1252,3344637,3329213,0.1526
3566+
CC-MAIN-2026-17,windows-1253,52150,51994,0.0024
3567+
CC-MAIN-2026-17,windows-1254,270316,267976,0.0123
3568+
CC-MAIN-2026-17,windows-1255,116790,116111,0.0053
3569+
CC-MAIN-2026-17,windows-1256,526124,524625,0.0240
3570+
CC-MAIN-2026-17,windows-1257,115165,114923,0.0053
3571+
CC-MAIN-2026-17,windows-31j,7881,7830,0.0004
3572+
CC-MAIN-2026-17,x-iso-8859-11,855,852,0.0000
3573+
CC-MAIN-2026-17,x-windows-874,125810,125471,0.0057
3574+
CC-MAIN-2026-17,x-windows-949,130,129,0.0000

plots/crawler/crawldb_status.png

2.54 KB
Loading
2.86 KB
Loading

plots/crawler/metrics.png

1.52 KB
Loading

plots/crawler/url_protocols.png

1.48 KB
Loading
255 Bytes
Loading
-40.4 KB
Loading
3.05 KB
Loading

plots/crawlsize/cumulative.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,3 +121,4 @@ CC-MAIN-2025-51,298445856189,320445985715,97883155597
121121
CC-MAIN-2026-04,301026872854,322775616470,98499220111
122122
CC-MAIN-2026-08,302693544790,324942608192,99125333660
123123
CC-MAIN-2026-12,304458347420,326917453426,99725480448
124+
CC-MAIN-2026-17,306800659832,329109389726,100385941968

0 commit comments

Comments
 (0)