Skip to content

Latest commit

 

History

History
658 lines (628 loc) · 15.5 KB

File metadata and controls

658 lines (628 loc) · 15.5 KB
id system-config

System Config Info

This tool is to collect the system information automatically on the tested GPU nodes including the following hardware categories:

Usage

Usage on local machine

  1. Install SuperBench on the local machine using root privilege.

  2. Start to collect the sys info using sb node info --output-dir ${output-dir} command using root privilege.

  3. After the command finished, you can find the output system info json file sys_info.json of local node under ${output_dir}.

Usage on multiple remote machines

  1. Install SuperBench on the local machine.

  2. Deploy SuperBench onto the remote machines.

  3. Prepare the host file of the tested GPU nodes using Ansible Inventory on the local machine.

  4. After installing the Superbnech and the host file is ready, you can start to collect the sys info automatically using sb run --get-info command. The detailed command can be found from SuperBench CLI.

sb run --get-info -f host.ini --output-dir ${output-dir} -C superbench.enable=none
  1. After the command finished, you can find the output system info json file sys_info.json of each node under ${output_dir}/nodes/${node_name}.

Parameter and Details

System

SubCategory Key Command Description Example
OS system-manufacturer dmidecode -s system-manufacturer manufacturer of the system Microsoft Corporation
system-product name(virtual machine) dmidecode -s system-product-name product name or virtual machine Virtual Machine
operating_system cat /proc/version version of current running os Ubuntu 9.3.0-17ubuntu1~20.04
uname uname short for system information Linux sb-test-wu-000000 5.8.0-1039-azure #42~20.04.1-Ubuntu
Docker docker_server_version docker version server version of docker engine 20.10.3
docker_client_version docker version client version of docker engine 20.10.3
VM vmbus lsvmbus devices attached to the Hyper-V VMBus "VMBUS ID 1": "[Dynamic Memory]",
"VMBUS ID 2": "Synthetic mouse",
...
Kernel kernel_modules lsmod list of active kernel modules "Module": "binfmt_misc",
"Size": "24576",
"Used": "1"
...
kernel_parameters sysctl kernel parameters "abi.vsyscall32": "1",
"debug.exception-trace": "1",
...
DMI dmidecode dmidecode DMI table dump (info on hardware components) "dmidecode": "# dmidecode 3.2\nGetting SMBIOS data from sysfs..."

Memory

SubCategory Key Command Description Example
General model dmidecode -t memory distinct model name of the memory Samsung M393A4K40DB3-CWE
type dmidecode -t memory distinct type of memory DDR4-3200
clock frequency dmidecode -t memory distinct clock frequency of memory 3200 MT/s
channels dmidecode -t memory the number of memory chips 16
capacity lsmem the total capacity of memory 511.9G
block_size lsmem the block size of memory 128M

CPU

SubCategory Key Command Description Example
General archeticture lscpu architecture of cpu x86_64
model name lscpu model name of cpu AMD EPYC 7662 64-Core Processor
cpu op-mode lscpu cpu mode: 32bit/64bit 32-bit, 64-bit
byte order lscpu byte order Little Endian
address size lscpu size of address 48 bits physical, 48 bits virtual
cpus lscpu logical cpu cores count 256
On-line CPU(s) list lscpu on-line logical cpu cores 0-255
Thread(s) per core lscpu thread per core 2
Core(s) per socket lscpu core per socket 64
Socket(s) lscpu socket count 2
NUMA node(s) lscpu numa node count 4
L<x> caches lscpu cache size "L1d cache": "4 MiB", "L1i cache": "4 MiB", "L2 cache": "64 MiB", "L3 cache": "512 MiB"
NUMA node<x> CPU(s) lscpu cpu core list of the numa node "NUMA node0 CPU(s)": "0-31,128-159", "NUMA node1 CPU(s)": "32-63,160-191", "NUMA node2 CPU(s)": "64-95,192-223", "NUMA node3 CPU(s)": "96-127,224-255"
Flags lscpu cpu flags fpu vme de pse tsc msr pae mce cx8 apic ...
max_speed sudo dmidecode -t processor | grep "Speed" distinct cpu max frequency 3700 MHz
current_speed sudo dmidecode -t processor | grep "Speed" distinct cpu current frequency 2000 MHz

Disk

SubCategory Key Command Description Example
FileSystem filesystem df -Th the name/path of the filesystem /dev/nvme0n1p2
avail df -Th avail size of the filesystem 1.4T
size df -Th total size of the filesystem 1.8T
type df -Th the type of the filesystem ext4
block_size blockdev --getbsz /dev/<device> the block size of the filesytem 4096
4k_alignment 4kDEVICE=/dev/sdb1 do parted $DEVICE align-check opt 1; done_alignment whether the file system is 4k alignment 1 aligned
BlockDevice name lsblk -e 7 -o NAME,ROTA,SIZE,MODEL  the name of the block device nvme0n1
model lsblk -e 7 -o NAME,ROTA,SIZE,MODEL  the model name of the block device VO001920KXAVP
rotational lsblk -e 7 -o NAME,ROTA,SIZE,MODEL  whether rotational, thai is HDD or SSD 0
size lsblk -e 7 -o NAME,ROTA,SIZE,MODEL  the total size of the block device 1.8T
block_size fdisk -l -u /dev/{} | grep "Sector size" the sector size of the block device Sector size (logical/physical): 512 bytes / 512 bytes
General mapping mount mount relationship between filesystem and block device

Networking

SubCategory Key Command Description Example
NIC nic_logical_name lshw -c network logical name of the nic ib1
nic_model lshw -c network model name of the nic Mellanox Technologies MT28908 Family [ConnectX-6]
nic_firmware lshw -c network fw version 20.30.1004 (MT_0000000594)
nic_driver lshw -c network driver version mlx5_core[ib_ipoib] 5.3-1.0.0
nic_speed lshw -c network speed spec of the nic 200 Gbit/s
nic_disabled lshw -c network whether diabled false
IB device_info ibv_devinfo -v list of device information for each ib device "hca_id:\tmlx5_0": ...
device_status ibstat list of device status for each ib device "CA 'mlx5_0'": ...
General ofed_version ofed_info  -s the version of ofed MLNX_OFED_LINUX-5.3-1.0.5.0:

Accelerator

SubCategory Key Command Description Example(NVIDIA) Example(AMD)
General driver_version nvidia-smi -q -x/rocm-smi -a driver version 460.27.04 5.9.25
topology nvidia-smi topo -m/rocm-smi --showtopo gpu connection topology (nvidia only) / /
nvidia-container-runtime_version nvidia-container-runtime -v version of nvidia-container-runtime (nvidia only) 1.0.0-rc92 N/A
nvidia-fabricmanager_version nv-fabricmanager --version version of nvidia-fabricmanager (nvidia only) 460.27.04 N/A
nv_peer_mem_version dpkg -l | grep 'nvidia-peer-memory' version of nv_peer_mem (nvidia only) 1.1-0 N/A
GPUCard rocm_info rocm-smi -a & rocm-smi --showmeminfo vram amd gpu info of each gpu&lsindex>, including firmware, frequency, memory, etc. (amd only) N/A "card0": ...
"card1": ...
nvidia_info nvidia-smi -q nvidia gpu info list of each gpu, including firmware, frequency, memory, etc. (nvidia only) "timestamp": "Fri Aug 20 05:36:24 2021",
"driver_version": "460.27.04",
"cuda_version": "11.2",
"attached_gpus": "8",
"gpu": [...]
...
N/A

PCIe

SubCategory Key Command Description Example
General topology lspci -t -vvv topology of installed PCI devices /
device_info lspci -vvv device info on installed PCI devices 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex...