program-explorer/notes at main · aconz2/program-explorer · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
virt-customize -a ~/Downloads/Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2 --root-password password:hello
qemu-system-x86_64 -smp 2 -enable-kvm -m 2048 -drive file=~/Downloads/Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2

taskset -c 0 hyperfine --shell=none --warmup=500 --runs=1000 'gcc --version'

kata not supported with podman


# had to get the full to get runc. using containerd not from dnf because outside toolbox.
wget https://github.com/containerd/nerdctl/releases/download/v1.7.6/nerdctl-full-1.7.6-linux-amd64.tar.gz
wget https://github.com/containerd/containerd/releases/download/v1.7.18/containerd-1.7.18-linux-amd64.tar.gz
# tar -xf ...

from inside toolbox wasn't working, was getting
FATA[0009] failed to extract layer sha256:d3e8d42f967c9c00049f90237e1bf4a460d18c28895292d2bb4a0702f661a745: failed to mount /var/lib/containerd/tmpmounts/containerd-mount2679852185: invalid argument: unknown

outside toolbox
sudo bin/containerd
→ sudo ln -s $(readlink -f bin/runc) /usr/local/bin/runc
sudo ./nerdctl run --rm --network=none gcc:14.1.0 gcc --version

→ sudo ./nerdctl run --runtime io.containerd.kata.v2 --rm --network=none gcc:14.1.0 gcc --version

/usr/share/kata-containers/defaults/configuration.toml: file /var/cache/kata-containers/vmlinuz.container does not exist

this is saying the kernel isn't there

wget https://github.com/kata-containers/kata-containers/releases/download/3.6.0/kata-static-3.6.0-amd64.tar.xz
tar -xf -C kata-static-3.6.0-amd64
→ sudo ln -s $(readlink -f kata-static-3.6.0-amd64/kata/share/kata-containers/kata-containers-initrd.img) /var/cache/kata-containers/kata-containers-initrd.img
→ sudo ln -s $(readlink -f kata-static-3.6.0-amd64/kata/share/kata-containers/vmlinuz.container) /var/cache/kata-containers/vmlinuz.container

get vmlinuz and initramfs from image
→ virt-get-kernel -a ~/Downloads/Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2
vmlinuz is a compressed version of the whole linux kernel
initramfs

https://mergeboard.com/blog/2-qemu-microvm-docker/
https://gist.github.com/mikaelhg/7a67901affe56bdf22eb398606945a23
https://github.com/qemu/qemu/blob/master/docs/system/i386/microvm.rst
https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-qemu-running.html

/dev/sd* is scsi
/dev/hd* is hard drive
/dev/vd* is virtualized

blkid in the rescue shell works
cat /proc/mounts
shows that:
/dev/vda1 label=p.legacy
/dev/vda2 label=dfi
/dev/vda3 label=boot type=ext4
/dev/vda4 label=fedora type=btrfs
trying with -append "root=/dev/vda{1,2,3,4}" didn't work, 3 and 4 didn't blow up with file type errors but not /sysroot or something

→ virt-cat -a ~/Downloads/Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2 /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 rootflags=subvol=root"

with /dev/vda4
/sysroot has /root /home and /var
btrfs sub list /sysroot shows these are subvolumes

look in bootbasic2.sh for working examples

EDITOR=vi virt-edit fedora-cloud-base.raw /usr/local/bin/boot.sh

→ virt-cat fedora-cloud-base.raw /etc/systemd/system/myboot.service
[Unit]
Description=My boot service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/boot.sh
StandardOutput=journal+console

[Install]
WantedBy=multi-user.target

zcat kata-alpine-3.18.initrd | cpio -t

rdinit=/bin/bash

https://github.com/jqueuniet/docker-to-initramfs/blob/master/context/init

inside podman run --rm -it alpine:3.20.1
apk add linux-lts
mkinitfs --help
cat /etc/mkinitfs/mkinitfs.conf
cat /usr/share/mkinitfs/initramfs-init

wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-ikconfig
wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux
bash extract-ikconfig ~/Downloads/kata-static-3.6.0-amd64/kata/share/kata-containers/vmlinux-6.1.62-132
bash extract-vmlinux alpine/vmlinuz-lts > alpine/vmlinux-lts
# https://wiki.gentoo.org/wiki/Custom_Initramfs

virt-df alpine/nocloud_alpine-3.20.1.raw

mkdir iso
sudo mount -t auto -o loop alpine-virt-3.20.1-x86_64.iso iso
less iso/boot/config-6.6.34-1-virt

→ git clone --single-branch --branch v6.6 --depth 1 https://github.com/torvalds/linux
$ wget https://raw.githubusercontent.com/cloud-hypervisor/cloud-hypervisor/main/resources/linux-config-x86_64
cp linux-config-x86_64 .config
make -j vmlinux

wget https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-nocloud-amd64.qcow2
qemu-img convert debian-12-nocloud-amd64.qcow2 debian-12-nocloud-amd64.raw  # 2 GB!
wget https://github.com/cloud-hypervisor/cloud-hypervisor/releases/download/v40.0/cloud-hypervisor-static
chmod +x cloud-hypervisor-static

wget https://github.com/wagoodman/dive/releases/download/v0.12.0/dive_0.12.0_linux_amd64.tar.gz
dive --source podman distroless/static-debian12
make -j vmlinux

okay can use bash makecpioinit.sh to create an initramfs with busybox
using #!/bin/busybox as the shebang line ran busybox as the init, not my shell script
so use #!/bin/busybox sh

wget https://www.busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox

vsock uses a context identifier cid and a port. cid is assigned at vm start time and is unique per vm.
from inside the guest, connecting to the host is always with cid=2.
we can have many connections, one per port, and per cid / sock
https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/vsock.md
a process on the host creates a socket at the path given to cloud-hypervisor with _<port> appended to it (!)
socat - UNIX-LISTEN:/tmp/ch.vsock_42
cloud-hypervisor ... --vsock cid=123,socket=/tmp/ch.vsock
and connections from host to guest start with a message "CONNECT <port>\n" to establish connection to the right port

todo could get cloud hypervisor to catch kernel panics with --pvpanic and --event-monitor but not sure this actually exits, I think the process still needs to exit/kill

mkdir /tmp/gcc
id=$(podman create docker.io/library/gcc:14.1.0)
podman export "$id" | tar -xC /tmp/gcc
mksquashfs /tmp/gcc gcc-squashfs.sqfs
podman rm "$id"

id=$(podman create docker.io/library/gcc:14.1.0)
podman export "$id" | sqfstar gcc-14.sqfs
podman rm "$id"

sudo mount -t squashfs -o loop gcc-squashfs.sqfs /tmp/gcc-mnt
mkfs.ext4 -d /tmp/gcc gcc-ext4.img 2g

was getting mount: mounting /dev/vda on /mnt/bundle/rootfs failed: No such device
updated linux config with CONFIG_MISC_FILESYSTEMS=y
make menuconfig
Filesystems > misc filesystems > squashfs

now getting a crun error could not join cgroup
b/c /proc/self/cgroup is empty
was because needed to use mount -t cgroup2

was getting pivot_root: invalid argument
# https://github.com/containers/crun/issues/56 pivot_root appears to not work with tmpfs apparently is unsafe?
maybe workaround https://github.com/containers/bubblewrap/issues/592
some more notes
https://news.ycombinator.com/item?id=23167383

okay I could make an ext4 image from the cpio, but cloud hypervisor doesn't support --disk multiple times, so we couldn't mount an ext4 rootfs and a container bundle squashfs anyways!

okay so it looks like kata switches off pivot_root when using the agent as init, as seen in sandbox, it sends that config off to crio or runc so we don't actually see that codepath.
https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt says
both initrd and initramfs are cpio so we can't quite tell just from file
busybox mount shows we have rootfs on / both before and after the magic switcheroo configuration
crun does pivot_root . .
lets verify that with strace
building strace static needed LDFLAGS='-static' and dnf install glibc-static

got gdb server running with qemu by using -gdb tcp::1234 (two fing colons was getting no error about it)
add -device pvpanic-pci so that it exits on shutdown
use -S so it pauses at startup
can't use pvpanic with microvm, but thats okay

connect with lldb -o 'gdb-remote localhost:1234' ~/Repos/linux/vmlinux
search for symbols with
image lookup -r -n '.*pivot_root.*'
break set -r '.*pivot_root.*'
then 'c' to continue
but our breakpoint isn't being hit! b/c its not being set error: 34 sending the breakpoint request

building kernel with debug info CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_SPLIT=y
this produces .dwo files spread around the dir
use hardware breakpoints and disable kaslr
the symbol we're actually hitting is __x64_sys_pivot_root

okay so I think a bare pivot_root at / of pivot_root . . fails because new_mnt->mnt_flags has MNT_LOCKED
we don't seem to actually need an unshare

/proc/<pid|self>/mountinfo
https://www.kernel.org/doc/Documentation/filesystems/proc.txt

for getting crun to build statically:
dnf install libseccomp-static libcap-static glibc-static
CFLAGS='-static -Wl,-static' ./configure --disable-systemd --enable-embedded-yajl && make --trace

okay getting seccomp violation --seccomp log and syscall=72 by looking in journalctl
this is fcntl and no idea why, shows for vmm,vcpu0,__console,__rng like all the things

tried putting the sqfs image in the initramfs and it wasn't working and was slower anyways to boot even without running

for cloud-hypervisor
cargo build --profile=profiling --target x86_64-unknown-linux-musl
cargo clean && RUSTFLAGS='-C force-frame-pointers=y' cargo build --profile profiling
cargo clean && RUSTFLAGS='-C force-frame-pointers=y' cargo build --profile profiling --target x86_64-unknown-linux-musl
cargo clean && cargo build --profile profiling --target x86_64-unknown-linux-musl --features tracing

→ venv/bin/python -i analyzesqfs.py gcc-13.3.0.sqfs gcc-14.1.0.sqfs
gcc-13.3.0.sqfs     421.45 Mb (compressed)    1347.61 Mb (uncompressed)      22806 files       2850 dirs
gcc-14.1.0.sqfs     432.04 Mb (compressed)    1381.28 Mb (uncompressed)      22894 files       2852 dirs
20179     832.07 Mb shared 60.24%

under normal circumstances, the vmm unlinks /tmp/ch.sock BUT then what is that even for, it binds it and listens

trying to figure out why busybox doesn't have the uuid in blkid
on my system doing
sudo strace --trace=open,openat,read blkid --cache-file /dev/null |& less

was trying to do this from within the toolbox container and blkid wasn't getting enough permissions
solution was to
→ sudo podman run --privileged --rm -it fedora

hmm I think its getting the label from somehwere deep

you can use multiple disks! with --disk path=gcc-14.1.0.sqfs,readonly=on,id=gcc14 path=gcc-13.3.0.sqfs,readonly=on,id=gcc13

pmem file has to be aligned to 2MB (I think so it can be hugepaged even though we don't have to run with hugepages)
okay getting a sigbus when using --memory size=1024M,hugepages=on
is at /dev/pmem0
find /run/bundle/rootfs > /dev/null
  with pmem0 is 240ms
  with vda   is 265ms

how does the guest get notified of hotplug events?

okay you can just write to /dev/pmem. opening in append mode doesnt actually append

from looking at boot times, two things I'm disabling are
CONFIG_BLK_DEV_NULL_BLK=n  # used for benchmarking or something
CONFIG_TASKSTATS=n         # process stats over netlink socket
CONFIG_BLK_DEV_RAM=n       # used for block ramdisks
CONFIG_ZRAM=n              # we're never gonna swap out
CONFIG_SWAP=n              # mainly to disable zswap
CONFIG_HID=n
CONFIG_FB_CMDLINE
CONFIG_BACKLIGHT_CLASS_DEVICE=n
CONFIG_LCD_CLASS_DEVICE
CONFIG_VGA_CONSOLE
okay lots more now

using --memory size=1024M,thp=on works bug ,hugpages=on doesn't because the host has thp enabled
→ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

google/bloaty needs dwarf v4
bloaty vmlinux -d compileunits -n 0 -s vm > bloaty.out
only interesting thing so far is that AMD_MEM_ENCRYPT takes up nontrivial space
CONFIG_NLS

so with the http api, once the vm shuts down, the cloud hypervisor process itself exits, so we can't save time by keeping it around. if we have the init just sleep we could reboot I think
when the guest is sleeping, doing a vm.shutdown or vm.delete does not cause the ch proc to exit
overall getting very similar times for command line and api :(

build cloud-hypervisor with --features tracing
~/Repos/cloud-hypervisor/scripts/ch-trace-visualiser.py
not that informative
okay can also build with --features dhat-heap to get heap tracing

pre seccompiler changes dhat-heap-001.json
dhat: Total:     537,569 bytes in 5,029 blocks
dhat: At t-gmax: 146,902 bytes in 399 blocks
dhat: At t-end:  1,996 bytes in 11 blocks

to override the cargo deps we do
[patch.crates-io]
seccompiler = { path = "../seccompiler" }

image lookup -r -n .*create_pkg_length.*
break set -r .*create_pkg_length.*
ldb -o 'break set -r .*create_pkg_length.*' ../cloud-hypervisor/target/x86_64-unknown-linux-musl/profiling/cloud-hypervisor -- --api-socket /tmp/chapi.sock
then bash startchapi.sh

getting a permission denied to clone something about console. was passing --console off in the command line but the json api payload had console=tty so duh

sudo perf probe -x ../cloud-hypervisor/target/x86_64-unknown-linux-musl/profiling/cloud-hypervisor --funcs
sudo perf probe -x ../cloud-hypervisor/target/x86_64-unknown-linux-musl/profiling/cloud-hypervisor --funcs --no-demangle --filter='*to_aml_bytes*'

okay moved tracer::start() to start_vmm_thread and trace_scoped!("get_seccomp_filter"
vmm get_seccomp_filter                   0.01ms
vmm get_seccomp_filter                   0.05ms
main get_seccomp_filter                  0.28ms

so not much there
and for acpi tables, I see:

vmm create_facp_table                    0.00ms
vmm create_dsdt_table                    0.47ms
vmm create_acpi_tables                   0.61ms

added CONFIG_FTRACE and CONFIG_BOOTTIME_TRACING along with kernel commandline kernel.tp_printk

patched a github dep with
[patch."https://github.com/rust-vmm/acpi_tables"]
acpi_tables = { path = "../acpi_tables" }

→ jq < $(podman inspect lucid_varahamihira --format '{{.OCIConfigPath}}')

so I'm getting a mkdir /output Read-only file system error when specifying /output as a bind mount in config.json I think because the squashfs rootfs is read only. So I think we need to do the overlayfs thing

installed squashfs-tools-ng-devel

→ (cd peinit && bash build.sh) && bash makeinitramfs.sh && bash cloudhypervisormyinit.sh

if i do rootless with uid 0/0 and no user namespace, it works
        rootless with uid 0/0 and a  user namespace, it doesn't work
        rootless with uid 1000/1000 and no usernamespace, it works

https://github.com/containers/crun/issues/1536

okay Command::spawn creates a new socketpair itself...

getting error from the ch http server that something is einval, traced to an accept (accept4) call that is einval.

now trying with unixlistener and unixstream like fro the test, but maybe getting enotasock=88 b/c unixlistener uses cloexec

crun seems to be messing up the id_map and I can't get the static build to work anymore so try to get that working and then use it to debug the mysterious thing, when it works without a linux.uidMapping in config.json, it does a

512<3> openat(AT_FDCWD, "/proc/513/gid_map", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "0\0001000\0001\0", 9)  = -1 EINVAL (Invalid argument)
512<3> openat(AT_FDCWD, "/proc/513/setgroups", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "deny", 4)              = 4
512<3> openat(AT_FDCWD, "/proc/513/gid_map", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "0 1000 1\n", 9)        = 9
515<3> +++ exited with 1 +++
512<3> openat(AT_FDCWD, "/proc/513/uid_map", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "0\0001000\0001\0", 9)  = -1 EINVAL (Invalid argument)
512<3> openat(AT_FDCWD, "/proc/513/uid_map", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "0 1000 1\n", 9)        = 9

and when it fails it does

512<3> openat(AT_FDCWD, "/proc/513/gid_map", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
512<3> write(6, "0\00010000\0001000\0", 13) = -1 EINVAL (Invalid argument)

yay this came from a bug! https://github.com/containers/crun/issues/1585

okay crun building
on f39 b/c f40 breaks something with pthread
dnf install libseccomp-devel libcap-devel libseccomp-static libcap-static glibc-static
./autogen.sh
export CFLAGS='-static -static-libgcc -Wl,-static'
./configure  --disable-systemd --enable-embedded-yajl
make -j8

crun can change the container process uid_map because it is in the parent user namespace

using umoci: pretty slow to do the unpacking...
skopeo copy docker://library/gcc:14.1.0 oci:/tmp/gcc14:14.1.0
skopeo inspect containers-storage:localhost/maggie-rstudio | less
./umoci unpack --rootless --uid-map 0:1000:1000 --gid-map 0:1000:1000 --image /tmp/gcc14:14.1.0 /tmp/rootfs

okay trying crane
→ wget https://github.com/google/go-containerregistry/releases/download/v0.20.2/go-containerregistry_Linux_x86_64.tar.gz
https://github.com/google/go-containerregistry
crane config docker.io/library/gcc:14.1.0
# pulls and saves single layer tar
crane export docker.io/library/gcc:14.1.0 exportgcc14.1.0.tar
# pulls and saves single tar
crane pull docker.io/library/gcc:14.1.0 gcc14.1.0.tar
# pulls and saves in oci format
crane pull --format=oci docker.io/library/gcc:14.1.0 gcc14.1.0.tar
# flatten saved image
# but I guess it can't do so for an oci folder with layer sharing??
crane export - gcc-flattened.14.1.0.tar < gcc14.1.0.tar
when doing crane pull --format=oci it defaults to pulling all architectures!
okay still can I do a pull with layer caching without duplicating?
crane pull --format=oci --platform linux/amd64 --cache_path layers docker.io/library/gcc:14.1.0 gcc14again
this at least uses copy_file_range so btrfs should not actually copy

created a service account with "Artifact Registry Writer"
downloaded key, then
→ cat ~/Downloads/compute-247219-1b24685cd41f.json | podman login -u _json_key --password-stdin https://us-central1-docker.pkg.dev
# podman build -f whiteouttest.Containerfile -t whiteouttest:001
→ podman push whiteouttest:001 us-central1-docker.pkg.dev/compute-247219/testingrepo/whiteouttest:001

okay so on gcr if we push a layer with an opaque whiteout it actually does run correctly

→ skopeo copy containers-storage:localhost/whiteouttest:001 docker-archive:whiteout.tar
→ crane export - - < whiteout.tar | tar tvf -

crane confirmed does not support opaque whiteouts

okay so https://github.com/sylabs/oci-tools/blob/main/pkg/mutate/squash.go#L297 has what I need, lets write a cmdline tool to use that

go mod init peimage
add deps in import, then go mod tidy

→ crane pull gcc:14.1.0 --platform linux/amd64 --format oci --annotate-ref oci

go run peimage.go export /tmp/peimage/ocismall busybox:1.37 busybox:1.36.1 busybox:1.36.0  > /tmp/foo.tar
# doesn't play well with stdin unfortunately
mkfs.erofs --tar =f /tmp/peimage/ocismall.erofs /tmp/foo.tar

go run peimage.go image /tmp/peimage/ocismall.sqfs  /tmp/peimage/ocismall busybox:1.37 busybox:1.36.1 busybox:1.36.0
go run peimage.go image /tmp/peimage/ocismall.erofs /tmp/peimage/ocismall busybox:1.37 busybox:1.36.1 busybox:1.36.0

diff <(cd perunner && cargo run -- --index ../ocismall.sqfs sh -c 'find' 2> /dev/null) <(cd perunner && cargo run -- --index ../ocismall.erofs sh -c 'find' 2> /dev/null)
< ./proc/1/map_files/5639f1954000-5639f1962000
< ./proc/1/map_files/5639f1962000-5639f1a21000
---
> ./proc/1/map_files/563187792000-5631877a0000
> ./proc/1/map_files/5631877a0000-56318785f000

note to self that find | sha256sum isn't reliable because of the above, the proc maps will come back with differences

this does better by pruning out find
diff <(cd perunner && cargo run -- --index ../ocismall.sqfs find ! -path './proc*' 2> /dev/null) <(cd perunner && cargo run -- --index ../ocismall.erofs find ! -path './proc*' 2> /dev/null)
cat <(cd perunner && cargo run -- --index ../ocismall.sqfs sh -c 'find ! -path "./proc*" | sha256sum' 2> /dev/null) <(cd perunner && cargo run -- --index ../ocismall.erofs sh -c 'find ! -path "./proc*" | sha256sum' 2> /dev/null)

right now I'm testing with sqfs zstd level 15 and erofs no compression (and when it does use zstd it defaults to 3) so maybe at some point will need to add options in peimage.go to export those

we seem to get a event!("vm", "shutdown") event whether we do shutdown from the guest or kernel panic. This is triggered by the EpollDispatch::Exit getting read in vmm and comes from the exit_evt.write(1) in vcpu.rs

ACPI
using libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)
the second is for sure the ACPI shutdown
  cloud-hypervisor: 107.561860ms: <vcpu0> INFO:devices/src/acpi.rs:51 -- ACPI event 128 80 10000000
  cloud-hypervisor: 107.597857ms: <vcpu0> INFO:devices/src/acpi.rs:51 -- ACPI event 52 34 110100
  cloud-hypervisor: 107.624958ms: <vcpu0> INFO:devices/src/acpi.rs:70 -- ACPI Shutdown signalled

using libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)
  just instantly reboots

for multi platform images,
→ crane manifest index.docker.io/library/busybox:1.36.1 | jq
returns an image index with the architecture+os data pulled out of the config
remember that an image manifest gives the config + rootfs

podman run --rm -it --network=none gcc:14.1.0
podman-oci-config | jq -S
./build.sh && (cd perunner && target/debug/perunner --spec --console echo hi) | jq -S

pingora write timeouts don't work great because the write completes instantly and then the kernel holds the buffer, interestingly curl --limit-rate 1 keeps reading even after the process is shut down. not sure why the keepalive timeout isn't getting kicked though

okay so if we want user to resolve for named things, we have to lookup /etc/passwd and /etc/group in the container's rootfs, so we have to do this after flattening and probably after it is in an erofs since that handles weird shit like symlinks. Read the entry using
https://github.com/moby/sys/blob/main/user/user.go
and
https://pkg.go.dev/gvisor.dev/gvisor/pkg/erofs#Inode
podman does this in pkg/lookup/lookup.go GetUserGroupInfo

thinking ovh eco kimsufi
https://eco.us.ovhcloud.com/?display=list&range=kimsufi

okay so docker podman and kata all add a default PATH of
podman
https://github.com/containers/podman/blob/9f1fee2a0b119024750975e2b16c1a89edc615d9/pkg/env/env.go#L19
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

kata
https://github.com/kata-containers/kata-containers/blob/7d34ca44205a1e4c786306b2e61de9eeda851978/src/tools/genpolicy/src/containerd.rs#L167
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

containerd
https://github.com/docker-archive/containerd/blob/26183b3a69a36f426785cdefe1bc9e8e233596d1/cmd/ctr/run_unix.go#L55
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

moby
https://github.com/moby/moby/blob/04b03cfc0a99f50088ac9c811a9992e5b05463bc/oci/defaults.go#L12
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

but google cloud run does not
google cloud run busybox 1.37.0-glibc (which doesn't have PATH)
does need to be configured with PATH=/bin to make
entrypoint: sh
cmd: -c env
work
and it gives
No older entries found matching current filter.
CLOUD_RUN_EXECUTION=test003-cl69x
SHLVL=1
HOME=/root
CLOUD_RUN_TASK_INDEX=0
CLOUD_RUN_TASK_ATTEMPT=0
PATH=/bin
CLOUD_RUN_TASK_COUNT=1
PWD=/
CLOUD_RUN_JOB=test003

in podman we get
→ podman run --rm busybox env
container=podman
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOME=/root
HOSTNAME=b74d80f01fab

→ podman run --unsetenv=container --rm busybox env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOME=/root
HOSTNAME=96c0f1d08056

TODO gather some examples
test.c:
int main() {
    return 42;
}
gcc /run/pe/input/test.c -O2 -S -o -

okay in making the arch svg, why are we unpacking into /run/input and then bind mounting that to /run/pe/input? why not just directly into /run/pe/input?
and also since we never actually make /run/pe/input couldn't we skip the overlay and still do the mounts? I think I did that because crun compalined it couldn't bind to /run/pe/input in the container because /run/pe didn't exist

ugh podman --cpuset-cpus is complaining about cgroups v2 delgation; I get
→ cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers"
cpu io memory pids
which doesn't have cpuset
even though I can do taskset -c just fine

fixed by
sudo mkdir /etc/systemd/system/user@.service.d
cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
systemctl daemon-reload

caddy run -c caddy/dev.caddyfile

can we have the lb and worker not even have net access and only proxy over a file? and between lb and worker should we use h2c? between caddy and lb h2c?

okay so with uds, it looks like
cargo run --bin worker -- --uds /run/user/$(id -u)/program-explorer/worker1.sock
cargo run --bin lb -- --uds /run/user/$(id -u)/program-explorer/lb.sock --worker uds:/run/user/$(id -u)/program-explorer/worker1.sock

curl -v --unix-socket /run/user/1000/program-explorer/lb.sock http://localhost/api/v1/images
TODO update testclient to take unix socket

→ podman unshare
→ podman volume vmount systemd-pe-worker-images
→ cp peimage/busybox.erofs ~/.local/share/containers/storage/volumes/systemd-pe-worker-images/_data/

journalctl --user -feu pe-server-lb.service

TODO okay so if I use hyperthreads, then waiting for wall clock 1 second is not the same as 1 second of execution if you have a busy neighbor. Should I get a better way to wait for 1 second rusage? Or ditch hyperthreads for the VM? still could be all the cloud-hypervisor things and vm sharing the 2 hyperthreads, but not 1 worker per hyperthread. yeah that is probably what I want

tar unlinks an existing file if it exists
rsync writes to "foo.8ajWj" tmp name then does a rename

revisiting the dedupe, mkfs.erofs only does dedupe with -Ededupe, for silkeh/clang 17&19
mkfs.erofs has their own sha256 impl which is probably really slow
build size
2m45s 634M clang17_19_dedupe.erofs
  54s 858M clang17_19.erofs
  27s 426M clang17.erofs
  27s 432M clang19.erofs
  21s 1.3G clang19_nozstd.erofs

GOPROC=1 mkfs.erofs --workers=1
  54s clang17.erofs

GOPROC=1 mkfs.erofs --workers=2
  38s clang17.erofs

GOPROC=1 mkfs.erofs --workers=4
  30s clang17.erofs

GOPROC=1 mkfs.erofs --workers=8
  27s clang17.erofs

GOPROC=2 mkfs.erofs --workers=1
  54s clang17.erofs

perf record of peimage image build shows top 3 at 8.35 6.79 3.71 % in compress.flate

in just building clang19 by itself, we get Filesystem total deduplicated bytes (of source files): 34435072 (34M)

playing with index.docker.io/library/postgres
# this works, but needs 5 seconds for postgres to launch, spends a while on "syncing data to disk"
podman run --rm -it index.docker.io/library/postgres:latest bash -c 'POSTGRES_HOST_AUTH_METHOD=trust docker-entrypoint.sh postgres & sleep 5; psql -U postgres -c "select 1;"'

some timings on gzip tar reading, counting number of files, dirs, links, symlinks
139M sha256/38e599e367e116e9ce85f0015f97ce8e035cde32148f442bac57f5d9a3571e7d
nfiles=1781 nlinks=10 nslinks=32 ndirs=184
gzip 1.13 1633 ms (sh -c gzip -cd file > /dev/null)
Go        1543 ms
rust with default rust backend 720 ms
rust with cloudflare-zlib 538 ms
rust with libz-ng 517 ms

some timings on erofs compression type
mkfs.erofs with --workers=1
silkeh/clang:{17,19}
      size, buildtime, runtime,
zstd: 858 MB, 108s, 312 ms
lz4: 1279 MB,  45s, 214 ms
noz  2585 MB,  42s, 197 ms

just building the squashed tar stream (peimage export) for clang:{17,19} takes 42s writing to a tmpfile, the mkfs.erofs for that takes 2.68s (5.4 with lz4 compression)

the pax extension header I am hitting with gcc:13.3.0 NetLock cert path with utf8 chars uses the key `path` and/or `linkpath`

initial results for building with new squasher is looking nice
mkfifo fifo
perf stat sh -c '../target/release/squash-oci /mnt/storage/program-explorer/ocidir index.docker.io/silkeh/clang:17 > fifo & mkfs.erofs --tar=f --workers=1 -zlz4 /tmp/clang17.lz4.erofs fifo'
    3.988766890 seconds time elapsed
    3.056409000 seconds user
    2.166840000 seconds sys

2.56s elapsed with --workers=0 without compression to /tmp
3.00s elapsed with --workers=0 without compression to spinning hdd (with sync at end)

perf stat sh -c '../target/release/squash-oci /mnt/storage/program-explorer/ocidir index.docker.io/silkeh/clang:17 > fifo & mkfs.erofs --tar=f --workers=0 /mnt/storage/program-explorer/tmp/clang-17.erofs fifo; sync /mnt/storage/progam-explorer/tmp/clang-17.erofs'

1.96s with PathBuf for gcc:13.3.0
1.90 with OsString

so with the switch from tmpfile to memfd for the io file to ch, we are leaving all those fds without cloexec and so every child will have access to them. Currently only ch is called as a child but would be nice to be able to enforce that like "disallow Command" everywhere except where approved. And we should maybe be (in preexec) dup2'ing that the memfd fd to 3 then close_range(4, ~0) so that the ch process only has the fd's we want. I believe this can work for kernel, initramfs, and image as well

Initial perf of squashing to erofs with peerofs on tmpfs for silkeh/clang:17 is:
    2.091406319 seconds time elapsed
    1.479035000 seconds user
    0.600486000 seconds sys

which is pretty good! And only uses a single core, no compression of course.
in perf record, 77% is in inflate_fast_avx2 and 5% is in crc32fast::specialized::pclmulqdq::calculate
libz-ng has the ability to do the crc32 while it is copying the bytes out, but flate2 bypasses that because it first reads the gz header, so it never sees it. Put some info at https://github.com/rust-lang/flate2-rs/issues/117#issuecomment-2848248615 and added the feature nocrc to peimage but essentially no difference in real speed.
Unfortunately there isn't much to do to make the squash process go faster as inflate is the bottleneck. TODO whether it is worthwhile to inflate once on download and store either uncompressed or as zstd or lz4. Really depends on how often layers get reused.

xattrs:
dnf install -y attr
setfattr -n user.MYATTR -v value filename
getfattr -m '.+' -d filename

to get an image index for a tag (don't think you can get an index for a digest),
curl -v -H 'Accept: application/vnd.oci.image.index.v1+json' -v https://quay.io/v2/fedora/fedora/manifests/42
then use the digest to get the manifest
curl -v -H 'Accept: application/vnd.oci.image.manifest.v1+json' -v https://quay.io/v2/fedora/fedora/blobs/sha256:c21a0d95f9633eabc079773cd3d160cbacfd99fd1c7c636be86
dbfc4a96e9c4d

example testing
(cd peimage-service && env RUST_LOG=trace cargo run -- --listen /tmp/peimageservice.sock --auth ~/Secure/container-registries.json)
(cd peserver && env RUST_LOG=trace cargo run --bin worker -- --image-service /tmp/peimageservice.sock --uds /tmp/peworker.sock --worker-cpuset 30:1:2)
(cd peserver && env RUST_LOG=trace cargo run --bin lb -- --tcp localhost:6188 --worker uds:/tmp/peworker.sock)
(cd peserver && cargo run --bin testclient -- echo hi)

socat - UNIX-LISTEN:/tmp/ch.vsock_42
cloud-hypervisor --kernel vmlinux --initramfs target/debug/initramfs --cpus boot=1 --memory size=1024M --cmdline console=hvc0 --
vsock cid=42,socket=/tmp/ch.vsock --api-socket=/tmp/ch-api.sock
./ch-remote-static --api-socket /tmp/ch-api.sock pause
mkdir /tmp/vmsnap
./ch-remote-static --api-socket /tmp/ch-api.sock snapshot file:///tmp/vmsnap
./ch-remote-static --api-socket /tmp/ch-api.sock info | jq
under vsock we see the id is _vsock0
./ch-remote-static --api-socket /tmp/ch-api.sock remove-device _vsock0

/tmp/vmsnap/memory-ranges is 1G and compresses down to 18M with gzip. it is mainly 0's and c's in upper ranges

initial snapshot test
taskset -c 30,31 cargo run --bin snapshot-test
looks like boot to vsock ready is 115 ms and restoring from snapshot is 152 ms
I assume some of this is from reading the memory range from disk? true

read_volatile_from took 23 ms for 128M
13 ms in resume for resizing

env PATH=~/Repos/cloud-hypervisor/target/debug/:"$PATH" cargo run --bin snapshot-test && python ../viewtrace.py

env PATH=~/Repos/cloud-hypervisor/target/profiling/:"$PATH" cargo run --bin snapshot-test && perf script -F +pid > /tmp/ch.perf

https://profiler.firefox.com/

use -Elegacy-compress to get COMPRESSED_FULL
mkfs.erofs test.erofs testerofs/ -d5 -zlz4 -Elegacy-compress
have vmlinux in testerofs
cargo run --features lz4 --bin erofs-dump -- test.erofs
the inode gives compressed_blocks=3039, size=35331992
but at lci 243, I see the blkaddr shoot up from 2 to 4259842 so that is weird. the cluster_offset
according to some printf debugging of dump.erofs --cat, I see a max lcn of 8625
yeah so at 241 I see m_la go from 0 to 991297 with m_pa=8192 (previously always 4096)
looking at the loading of lclusters, I think 0-241 inclusive forms 1 pcluster maybe? b/c I think when it loads an lcn it also loads the head lcn, then the pattern repeats with 242-495
okay something is actually wrong, getting head2 type when shouldn't be... oof it was b/c I was missing repr(C)

okay so the first physical cluster should be 991297 long (lots of zeros in vmlinux at the beginning)
242*4096 + 64 = 991297;
so each LCI maps to one block
the next head is LCI 242 with cluster_offset=65

okay so first cluster should be 991297 and that decodes the first block correctly
so to compute that size, we have to first figure out how many blocks are in our pcluster, which we can get by looking at the next lci (if it exists) and delta[1] + 1. That also lets us look up the clusterofs of the next head cluster which we should add to the total length like block_size * n_clusters + clusterofs. Clusterofs should always be < the the block size
This is equivalent to:
the logical address in the decompressed file that an LCI corresponds to is the lci_index * block_size + cluster_offset. This logical address should always be in the range [lci_index * block_size, (lci_index + 1) * block_size]
so the decompressed size is the difference in logical address of consecutive LCI
ie for two heads i and j:
  (j*block_size + lci[j].cluster_offset) - (i*block_size + lci[i].cluster_offset)
  ==
  j*block_size + j_cluster_offset - i*block_size - i_cluster_offset
  ==
  (j-i)*block_size + j_cluster_offset - i_cluster_offset
This also applies for Plain type blocks, but we only ever have j=i+1

0: cluseter_offset=0, decoded_length=991297 = 242*4096 + 65, n_clusters=242
242: cluster_offset=64, decoded_length=1041954 = 254*4096 + 1635 - 65, n_clusters=254
total real size is

35332080
we write 35333642 which is 1562 too many bytes
agree up to 16777216
this is lcn 4096 oddly enough
so up to block 4095 (written 16778854) we are good
4096 has cluster_offset 0