Commit a8f0aaa
authored
feat(crashtracking)!: collect all threads (#1878)
# What does this PR do?
When a process crashes the crashtracker now captures a full stack trace
for every live non-crashing thread and includes them in the crash report
under error.threads.
Previously only the crashing thread's stack was reported. Customers had
no visibility into what other threads were doing at the time of the
crash, making it much harder to diagnose race conditions, deadlocks, and
cross-thread interactions.
Thread collection runs entirely in the receiver process after it has
finished reading the crash pipe. Because the crashed process stays alive
in its signal handler until the receiver calls `finish()`, all threads
remain valid ptrace targets for the entire collection window.
The receiver:
1. Enumerates threads from` /proc/<pid>/task/`
2. Attaches to each non-crashing thread with `PTRACE_SEIZE` +
`PTRACE_INTERRUPT`
3. Uses libunwind remote unwinding (`_UPT_create` / `unw_init_remote` /
`unw_step_remote`) to walk each thread's stack
4. Thread names and states are read from /proc/<pid>/task/<tid>/stat`
5. Thread collection is opt-in using
`CrashtrackerConfiguration::set_collect_all_threads(true)` and bounded
by a max of 256. The user can also set a lower boundary using
`CrashtrackerConfiguration::set_max_threads()` . The collection timeout
is the remaining slice of the receiver's deadline after the crash pipe
is fully read, so the total receiver lifetime is always bounded by
`config.timeout()`
# Breaking changes
Updated data schema to 1.7; there is now a `Threads` object.
Errors intake upload still sends an array of `ThreadData`, as that is
the schema that the enrichment pipeline expects
`CrashtrackerConfig` now has
```
pub struct CrashtrackerConfiguration {
additional_files: Vec<String>,
collect_all_threads: bool, <-- NEW
create_alt_stack: bool,
demangle_names: bool,
endpoint: Option<Endpoint>,
max_threads: usize, <-- NEW
resolve_frames: StacktraceCollection,
signals: Vec<i32>,
timeout: Duration,
unix_socket_path: Option<String>,
use_alt_stack: bool,
}
```
```
config.set_collect_all_threads(true);
config.set_max_threads(128); // how many threads to collect
```
```
CrashtrackerConfiguration::builder()
.collect_all_threads(true)
.max_threads(128)
.build()?;
```
# Additional notes
One enhancement we can do is to collect even the crashing thread in the
receiver along with all threads, since enumerating all threads touches
the crashing thread also. We can do this by checking for the crashing
thread first in the receiver, then checking all threads.
However, this will be explored in a future PR/investigation, as this is
changing the core crashing thread unwinding logic
# How to test the change?
Run a crash with a program with multiple threads, with all thread
collection turned on.
Bin tests also added to test e2e flow
```
{
"is_crash": true,
"kind": "UnixSignal",
"message": "Process terminated with SEGV_MAPERR (SIGSEGV)",
"thread_name": "crashtracker_bi",
"source_type": "Crashtracking",
"stack": {
"format": "Datadog Crashtracker 1.0",
"frames": [
{
"ip": "0x5d1e89d2adcb",
"module_base_address": "0x5d1e89cfd000",
"sp": "0x7ffc9f327330",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x000000000002ddcb",
"column": 13,
"file": "/home/bits/go/src/github.com/DataDog/libdatadog/bin_tests/src/bin/crashtracker_bin_test.rs",
"function": "cause_segfault",
"line": 37
},
...
{
"ip": "0x5d1e89d29405",
"module_base_address": "0x5d1e89cfd000",
"sp": "0x7ffc9f327d70",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x000000000002c405",
"function": "_start"
}
],
"incomplete": false
},
"threads": [
{
"crashed": false,
"name": "ct_worker_0",
"stack": {
"format": "Datadog Crashtracker 1.0",
"frames": [
{
"ip": "0x5d1e89d34b57",
"sp": "0x778c6131ed48",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x0000000000037b57",
"column": 9,
"file": "/home/bits/go/src/github.com/DataDog/libdatadog/bin_tests/src/modes/unix/test_017_multi_thread_collection.rs",
"function": "worker_fn_0",
"line": 21
},
{
"ip": "0x5d1e89d34b2c",
"sp": "0x778c6131ed50",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x0000000000037b2c",
"column": 18,
"file": "/home/bits/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs",
"function": "__rust_begin_short_backtrace<bin_tests::modes::unix::test_017_multi_thread_collection::{impl#0}::post::{closure_env#0}, ()>",
"line": 158
},
{
"ip": "0x5d1e89d349e4",
"sp": "0x778c6131ed70",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x00000000000379e4",
"column": 5,
"file": "/home/bits/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs",
"function": "call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<bin_tests::modes::unix::test_017_multi_thread_collection::{impl#0}::post::{closure_env#0}, ()>, ()>",
"line": 250
},
{
"ip": "0x5d1e89d8223f",
"sp": "0x778c6131edf0",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x000000000008523f",
"column": 17,
"file": "/rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/std/src/sys/thread/unix.rs",
"function": "std::sys::thread::unix::Thread::new::thread_start",
"line": 126
},
{
"ip": "0x778c613c7ac3",
"sp": "0x778c6131ee20",
"build_id": "4f7b0c955c3d81d7cac1501a2498b69d1d82bfe7",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/usr/lib/x86_64-linux-gnu/libc.so.6",
"relative_address": "0x0000000000094ac3",
"column": 8,
"file": "./nptl/pthread_create.c",
"function": "start_thread",
"line": 442
},
{
"ip": "0x778c614598c0",
"sp": "0x778c6131eec0",
"build_id": "4f7b0c955c3d81d7cac1501a2498b69d1d82bfe7",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/usr/lib/x86_64-linux-gnu/libc.so.6",
"relative_address": "0x00000000001268c0",
"column": 0,
"file": "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S",
"function": "__clone3",
"line": 83
}
],
"incomplete": false
},
"state": "R"
},
{
"crashed": false,
"name": "ct_worker_1",
"stack": {
"format": "Datadog Crashtracker 1.0",
"frames": [
{
"ip": "0x5d1e89d34894",
"sp": "0x778c6111dd48",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x0000000000037894",
"column": 9,
"file": "/home/bits/go/src/github.com/DataDog/libdatadog/bin_tests/src/modes/unix/test_017_multi_thread_collection.rs",
"function": "worker_fn_1",
"line": 29
},
{
"ip": "0x5d1e89d34869",
"sp": "0x778c6111dd50",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x0000000000037869",
"column": 18,
"file": "/home/bits/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs",
"function": "__rust_begin_short_backtrace<bin_tests::modes::unix::test_017_multi_thread_collection::{impl#0}::post::{closure_env#1}, ()>",
"line": 158
},
{
"ip": "0x5d1e89d34719",
"sp": "0x778c6111dd70",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x0000000000037719",
"column": 5,
"file": "/home/bits/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs",
"function": "call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<bin_tests::modes::unix::test_017_multi_thread_collection::{impl#0}::post::{closure_env#1}, ()>, ()>",
"line": 250
},
{
"ip": "0x5d1e89d8223f",
"sp": "0x778c6111ddf0",
"build_id": "8d4cd090dde4a270bed5fb7f8168dcc1291051e0",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/home/bits/go/src/github.com/DataDog/libdatadog/target/release/crashtracker_bin_test",
"relative_address": "0x000000000008523f",
"column": 17,
"file": "/rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/std/src/sys/thread/unix.rs",
"function": "std::sys::thread::unix::Thread::new::thread_start",
"line": 126
},
{
"ip": "0x778c613c7ac3",
"sp": "0x778c6111de20",
"build_id": "4f7b0c955c3d81d7cac1501a2498b69d1d82bfe7",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/usr/lib/x86_64-linux-gnu/libc.so.6",
"relative_address": "0x0000000000094ac3",
"column": 8,
"file": "./nptl/pthread_create.c",
"function": "start_thread",
"line": 442
},
{
"ip": "0x778c614598c0",
"sp": "0x778c6111dec0",
"build_id": "4f7b0c955c3d81d7cac1501a2498b69d1d82bfe7",
"build_id_type": "GNU",
"file_type": "ELF",
"path": "/usr/lib/x86_64-linux-gnu/libc.so.6",
"relative_address": "0x00000000001268c0",
"column": 0,
"file": "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S",
"function": "__clone3",
"line": 83
}
],
"incomplete": false
},
"state": "R"
}
]
}
```1 parent 982b6bd commit a8f0aaa
21 files changed
Lines changed: 1758 additions & 66 deletions
File tree
- bin_tests
- src
- modes
- unix
- tests
- docs/RFCs
- artifacts
- libdd-crashtracker/src
- collector
- crash_info
- receiver
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| 141 | + | |
| 142 | + | |
141 | 143 | | |
142 | 144 | | |
143 | 145 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
Lines changed: 74 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
| 24 | + | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
| |||
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
| 46 | + | |
| 47 | + | |
44 | 48 | | |
45 | 49 | | |
46 | 50 | | |
| |||
62 | 66 | | |
63 | 67 | | |
64 | 68 | | |
| 69 | + | |
| 70 | + | |
65 | 71 | | |
66 | 72 | | |
67 | 73 | | |
| |||
92 | 98 | | |
93 | 99 | | |
94 | 100 | | |
| 101 | + | |
| 102 | + | |
95 | 103 | | |
96 | 104 | | |
97 | 105 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
188 | 189 | | |
189 | 190 | | |
190 | 191 | | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
191 | 364 | | |
192 | 365 | | |
193 | 366 | | |
| |||
0 commit comments