Skip to content

Latest commit

 

History

History
470 lines (260 loc) · 16.6 KB

File metadata and controls

470 lines (260 loc) · 16.6 KB

SST Debug Stories

This repository contains a collection of debug use case examples for SST. These are small, artificial examples illustrating situations that might occur in an SST simulation where a debugger could be used to detect or analyze behavior. They are simple SST models with small topologies. Some examples demonstrate debugger features available today, but other cases might serve to inspire possible new debugger features or companion tools.

For each debug story we include a "use case report": a short write-up that explains the scenario, what behavior to observe, and how the SST debugger can be used to investigate it. Many reports also include thoughts and wishlist items for debugger improvements. The stories and links to their reports are in the story status table. For cross-cutting debugger ideas that come up across multiple stories, see the wish list document. This document also includes a catalog of wishlist items mentioned in individual stories.

Overview

  • All stories are launched from a single SST simulation configuration script, runStory.py, which is passed the name of the particular story to run. Valid story names that can be passed to this are listed in the first column of the story status table.
  • This repository is still a work in progress. All of the use cases listed below are implemented, and our current effort is focused on hand-verifying each case and evaluating how it could currently be addressed using the SST debugger.
  • All stories are built around a single SST component named Node (implemented in Node.cpp and Node.h) and use a unified simulation configuration file, runStory.py.

How to Run

From this directory:

  1. Build and run in one step:

    ./doit <storyName>

  2. Or run manually:

    make clean && make

    sst --interactive-stop ./runStory.py <storyName>

Where <storyName> is any valid story name from the story descriptions section.

Story Status

This table lists the use case stories included in this repository and overviews their status. To see a short description of each story see the story descriptions section.

All stories have been implemented, so we're now focused on ensuring that they have been implemented properly and writing "use case reports" for each.

In the "Verified?" column, we indicate whether it has been hand-verified (indicated with ✅, ❌, or ❓; ❌ indicates that something is wrong and ❓ indicates that although I've manually read the code and believe it to be correct I don't know of an easy way to verify that it's working as intended today).

In the "use case report" column I use ♦ symbols to indicate how "mature" I believe the report is. You can view one diamond as indicating that the report includes an example script of how to use the sst debugger to address the case but I haven't yet thought deeply about how effective it is. Two diamonds has more content and some thoughts on wishlist items for the SST debugger. Three diamonds indicates that I view the content as being "complete".

Story Verified? Use Case Report Notes
Event Tracing
wrongPath ♦♦ works in debugger but requires advanced topology knowledge and the event to set a side effect on components
infiniteLoop
unexpectedDisappear
missedDeadline
outOfOrderReceipt
duplicateSepTimes
duplicateSameTime
Event Processing
broadcastStorm
badMerge
Incorrect Topology
missingLink
wrongLink
unexpectedDuplicateLink
Deadlock
directDeadlock
indirectDeadlock
Fault Detection And Attribution
detectWhenComponentBecomesInvalid
badInvariantBetweenComponents
componentsLoseParity
divergedModels
componentCausesSegfault
badInitialState
badTerminatingState
findFirstToComplete
determineWhatNotComplete
Load Imbalances
findEventHeavyComponent
findSlowProcessingComponent
findMemHeavyComponent
findMemHeavyEvent
findStarvedComponent

Story Descriptions

Category Stories
Event Tracing wrongPath, infiniteLoop, unexpectedDisappear, missedDeadline, outOfOrderReceipt, duplicateSepTimes, duplicateSameTime
Event Processing broadcastStorm, badMerge
Incorrect Topology missingLink, wrongLink, unexpectedDuplicateLink
Deadlock directDeadlock, indirectDeadlock
Fault Detection And Attribution detectWhenComponentBecomesInvalid, badInvariantBetweenComponents, componentsLoseParity, diverged models: divergedModels_A and divergedModels_B, componentCausesSegfault, badInitialState, badTerminatingState, findFirstToComplete, determineWhatNotComplete
Load Imbalances findEventHeavyComponent, findSlowProcessingComponent, findMemHeavyComponent, findMemHeavyEvent, findStarvedComponent

Event Tracing

wrongPath

(back to table)

An event propagates throughout the model, its intended path is A -> B -> C, but B misroutes the event to D instead.

wrongPath flowchart

infiniteLoop

(back to table)

An event is supposed to move onward to D, but A, B, and C keep forwarding it in a cycle, creating an infinite loop.

infiniteLoop flowchart

unexpectedDisappear

(back to table)

The intended path is A -> B -> C -> D, but the event vanishes at C because it is never forwarded onward.

unexpectedDisappear flowchart

missedDeadline

(back to table)

D is expected to receive an event by a target time, but the A -> B -> C -> D path uses enough link latency that arrival is late; the goal is to locate which link is causing the slowdown.

missedDeadline flowchart

outOfOrderReceipt

(back to table)

E is intended to see ev1 before ev2, but two events launched on different branches arrive in the opposite order because C starts at 3ns while A starts at 5ns (with all links at 1ns).

outOfOrderReceipt flowchart

duplicateSepTimes

(back to table)

D is expected to receive a given event once, but A injects it at setup and again on later ticks, so repeated deliveries occur at different times.

duplicateSepTimes flowchart

duplicateSameTime

(back to table)

B is expected to receive a given event once, but A injects it twice at setup.

duplicateSameTime flowchart

Event Processing

broadcastStorm

(back to table)

An event is broadcast too broadly from A to all six neighbors at startup.

broadcastStorm flowchart

badMerge

(back to table)

C receives values from A and B and should merge them correctly, but it multiplies 10 * 2 instead of performing the intended add-style merge before sending the result to D.

badMerge flowchart

Incorrect Topology

missingLink

(back to table)

The intended topology includes a B <-> C connection, but that link is absent.

missingLink flowchart

wrongLink

(back to table)

The intended topology is A -> B, but A is connected to C instead.

wrongLink flowchart

unexpectedDuplicateLink

(back to table)

A and B are linked twice instead of once.

unexpectedDuplicateLink flowchart

Deadlock

directDeadlock

(back to table)

A waits for an event from B while B waits for an event from A, so neither side ever makes progress.

directDeadlock flowchart

indirectDeadlock

(back to table)

This is the same wait cycle as direct deadlock, but with B sitting between A and C as a relay, so the blocked endpoints are separated by an intermediate component.

indirectDeadlock flowchart

Fault Detection And Attribution

detectWhenComponentBecomesInvalid

(back to table)

A starts valid and then flips its valid flag to false on a 40ns clock tick, modeling a component whose state becomes invalid during execution.

detectWhenComponentBecomesInvalid flowchart

badInvariantBetweenComponents

(back to table)

A cross-component invariant is supposed to hold, but C follows a different update rule when it receives certain values, breaking the invariant.

badInvariantBetweenComponents flowchart

componentsLoseParity

(back to table)

A and B are expected to stay in matching state over time, but their scripted values diverge at cycle 40 when they become 5 and 7.

componentsLoseParity flowchart

divergedModels (divergedModels_A and divergedModels_B substories)

(back to table)

This pair of stories represent separate models that are intended to retain parity with each other throughout execution, but at timestamp 40, divergedModels_A uses value 5 while divergedModels_B uses value 7.

divergedModels flowchart

componentCausesSegfault

(back to table)

Component C asserts once its clock reaches cycle 50 or later. The goal is to identify which component is responsible for the segfault and at what point in time the segfault occurs.

componentCausesSegfault flowchart

badInitialState

(back to table)

Four unconnected components are intended to initialize to the same state, but C starts with a different value than the others.

badInitialState flowchart

badTerminatingState

(back to table)

Similar to badInitialState, but the issue is that C changes to a different value before the simulation terminates. The goal is to identify which component has the bad value just prior to termination.

badTerminatingState flowchart

findFirstToComplete

(back to table)

The goal is to determine which component finishes first; the completion order is D first, then B, then C, then A.

findFirstToComplete flowchart

determineWhatNotComplete

(back to table)

The goal is to find components that never mark complete when the simulation ought to be done; here A, D, and E finish, while B and C never do.

determineWhatNotComplete flowchart

Load Imbalances

findEventHeavyComponent

(back to table)

The goal is to identify which component processes the most events; in this four-node ring each component sends to its neighbor to the right.

findEventHeavyComponent flowchart

findSlowProcessingComponent

(back to table)

One component should be noticeably slower at processing than the others; all nodes send one event at startup to their right neighbor, but the event received by B takes much longer to process.

findSlowProcessingComponent flowchart

findMemHeavyComponent

(back to table)

The goal is to spot a component with unusually high memory usage; four unconnected components allocate different local buffer sizes, with B holding by far the largest payload.

findMemHeavyComponent flowchart

findMemHeavyEvent

(back to table)

The goal is to spot an unusually large event; each node in a ring sends one rightward event with a payload buffer, and one of those messages is much larger than the others.

findMemHeavyEvent flowchart

findStarvedComponent

(back to table)

The intended pattern is that all components should receive work, but one does not; in the current ring with uneven send quotas, C receives no events while the others do.

findStarvedComponent flowchart

Adding a New Story

  1. Add the story name to NODE_STORY_LIST in Node.cpp.
  2. Add setup_<story> and handleEvent_<story> in Node.h and Node.cpp.
  3. Add the story string to VALID_STORIES in runStory.py.
  4. Add a story_<story>() function in runStory.py.

Legacy Cases

Older standalone cases are stored in:

  • old/infiniteLoopTest/
  • old/loadImbalance/