Skip to content

i#2006 generalize drcachesim: add threading pool support for writing out traces#2319

Open
zhaoqin wants to merge 34 commits into
masterfrom
i2006-drcachesim-threadpool
Open

i#2006 generalize drcachesim: add threading pool support for writing out traces#2319
zhaoqin wants to merge 34 commits into
masterfrom
i2006-drcachesim-threadpool

Conversation

@zhaoqin

@zhaoqin zhaoqin commented Mar 30, 2017

Copy link
Copy Markdown
Contributor

Adds two options -num_threads and -queue_size to the tracer for asynchronous
trace write by the sideline threads.
This CL only adds the two options, and the real implementation is to be added.

… write

Adds two options -num_threads and -queue_size to tracer for asynchronous
trace write by sideline threading pool.
This CL only adds the two options, real implementation is to be added.

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling in title: "drcachesime"

Small incremental reviews are good in general but the repository has to be in a valid state at any time, so unimplemented options need to be labeled as unimplemented to avoid confusion. But maybe it's better to only add an option when there's something implemented that it controls.

Comment thread clients/drcachesim/common/options.cpp Outdated
"0 means synchronized write without any sideline threads.");

droption_t<unsigned int> op_queue_size
(DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for "??

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/common/options.cpp Outdated
droption_t<unsigned int> op_queue_size
(DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ",
"For the offline analysis mode (when -offline is requested), specifies the size "
"of the queue for batching trace data to be written out by sideline threads.");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the units here? If bytes this is way too small. If buffers it seems large for a default: 128MB for 32K buffers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/common/options.cpp Outdated
droption_t<bytesize_t> op_num_threads
(DROPTION_SCOPE_ALL, "num_threads", 0, "Number of threads for writing traces",
"For the offline analysis mode (when -offline is requested), specifies the number "
"of sideline threads are used to write traces file out. "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar: s/are/to be/
grammar: s/traces file/trace files/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/common/options.cpp Outdated
"for later offline analysis. No simulator is executed.");

droption_t<bytesize_t> op_num_threads
(DROPTION_SCOPE_ALL, "num_threads", 0, "Number of threads for writing traces",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be SCOPE_CLIENT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/common/options.cpp Outdated
"0 means synchronized write without any sideline threads.");

droption_t<unsigned int> op_queue_size
(DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just SCOPE_CLIENT, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread suite/tests/CMakeLists.txt Outdated
torunonly_ci(tool.drcacheoff.${testname} ${exetgt} drcachesim
"offline-${testname}.c" # for templatex basename
"-offline" "" "")
"-offline -num_threads 0" "" "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was trying to test if the option can be specified.
removed.

Adds two options -num_threads and -queue_size to the tracer for asynchronous
trace write by the sideline threads.
This CL only adds the two options, and the real implementation is to be added.
@zhaoqin

zhaoqin commented Mar 31, 2017

Copy link
Copy Markdown
Contributor Author

PTAL

zhaoqin and others added 4 commits March 30, 2017 22:05
Adds two options -num_threads and -queue_size to the tracer for asynchronous
trace write by the sideline threads.
This CL only adds the two options, and the real implementation is to be added.
This CL uses client sideline threads to overlap the profiling and data output.
We use a producer-consumer queue to pass traces from profiling thread to sideline threads.
Tow options -num_threads and -queue_size are added to specify the threading pool parameters.
A simple sideline test is also added.

TODO: enable sideline thread for burst and static tests.

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part-way through.

Offline we talked a little about whether this needs to be limited to offline -- seems like the pool can be positioned at a point shared by both.

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
per_thread_t *data = (per_thread_t *) drmgr_get_tls_field(drcontext, tls_idx);
ssize_t size = towrite_end - towrite_start;
if (file_ops_func.handoff_buf != NULL) {
if (op_num_threads.get_value()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style violation: non-bool as bool

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated

/* setup sideline thread and queue */
if (op_offline.get_value() && op_num_threads.get_value() > 0) {
/* XXX support thread pool with more than one threads */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "XXX:"
nit: "one thread"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add multi-thread support

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
/* XXX support thread pool with more than one threads */
if (op_num_threads.get_value() > 1) {
NOTIFY(0, "Usage error: only one sideline thread is supported");
DR_ASSERT(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be dr_abort() if fatal -- that's what the other usage errors do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
uint i;
queue_size = op_queue_size.get_value() / max_buf_size;
// We round up queue_size to the nearest power of 2.
for (i = 1; i <= queue_size; i <<= 1) /*do nothing*/;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style violation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread suite/tests/CMakeLists.txt Outdated
set(tool.drcacheoff.${testname}_basedir
macro (torunonly_drcacheoff testname exetgt sideline)
if (${sideline})
set(tname "${testname}-sideline")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these are all using . and _ rather than - as a separator: "tool.drcacheoff.burst_static"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed since the default is the sideline.
Will add test without sideline

Comment thread suite/tests/CMakeLists.txt Outdated
set(tool.drcacheoff.${testname}_basedir
macro (torunonly_drcacheoff testname exetgt sideline)
if (${sideline})
set(tname "${testname}-sideline")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than exploding the option space, can we just make the thread pool the default? This will make testing much simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/common/options.cpp Outdated
"For the offline analysis mode (when -offline is requested), specifies the number "
"of sideline threads to be used to write trace files out. "
"0 means synchronized write without any sideline threads.");
"Only 0 and 1 is currently supported. "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar: are

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread clients/drcachesim/common/options.cpp Outdated
"(NYI) The maximum size of the queue for traces before they are written out",
"For the offline analysis mode (when -offline is requested), specifies the maximum "
"size (number of bytes) of the queue for traces before they are written out "
"The approximate maximum size of the queue for traces before they are written out",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline: the description can be clearer. As written it sounds like a minimum size, where nothing is written at all until it's reached. Really what you mean is the size at which we'll block the app threads and wait for the queue to shrink.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
static volatile uint queue_tail = 0;
static void *sideline_exit_event;
struct buf_entry_t {
file_t file; // file descriptor, 0 means the end of profiling.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: INVALID_FILE, not 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to support multiple threads now, which is just a couple of lines more of code, and ensures the design won't have to change. It looks like this CL's design will have to change, which we would avoid by having multiple threads to start with.


/* sideline */
static drvector_t buf_queue;
static uint queue_size = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest: comment explaining what this is: probably current total size of all buffers, rather than the length of the vector?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
static uint queue_size = 0;
static uint queue_size_mask = 0;
static volatile uint queue_front = 0;
static volatile uint queue_tail = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is a circular buffer and these are array indices? Suggest: comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/* sideline */
static drvector_t buf_queue;
static uint queue_size = 0;
static uint queue_size_mask = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no, if this is based on queue_size, then queue_size must be the length of the vector? Why do we need a separate var for that: it's already in buf_queue, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
static uint queue_size_mask = 0;
static volatile uint queue_front = 0;
static volatile uint queue_tail = 0;
static void *sideline_exit_event;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a separate one per thread?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
}
drvector_lock(&buf_queue);
if (((queue_front + 1) & queue_size_mask) != queue_tail) {
drvector_set_entry(&buf_queue, queue_front, entry);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the invariant is that the front is empty? Maybe a comment on that up above? How about an assert that the current front file is INVALID_FILE?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this done?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment on the queue_front variable.

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
if (file_ops_func.handoff_buf != NULL) {
if (op_num_threads.get_value()) {
trace_buf_enqueue(data->file, towrite_start, size);
} else if (file_ops_func.handoff_buf != NULL) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you've put a priority here: the handoff is going to be ignored if there's a thread pool. That should be reflected in its docs.

Maybe we should just remove the handoff feature completely, since it's not needed anymore: it can be duplicated by using a thread pool and having file_ops_func.write do whatever you want.

I also think the thread pool should be the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed handoff

Comment thread clients/drcachesim/common/options.cpp Outdated
"For the offline analysis mode (when -offline is requested), specifies the number "
"of sideline threads to be used to write trace files out. "
"Only 0 and 1 is currently supported. "
"0 means synchronized write without any sideline threads. ");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the pool should be on by default, for code simplicity.

I'm not sure about only supporting 1 thread: we need to know that the design here supports many threads, and seeing the stop-profiling scheme below I'm not sure that it does. Adding >1 thread seems like just a few lines of code: array of events, for loop when touching those events. It seems better to include multi-thread support now, to ensure the design works (again see below where I'm not sure it does).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A threading pool with more than one thread has a complex problem, i.e, sideline thread vs open files.
The simplest way is one open file per sideline thread, which requires us to change raw2trace.
If we want to keep current one file per app-thread, writing to same file with multiple threads is complex.
I will go with the one file per sideline thread.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember seeing any mention of this in the CL? Did I miss it? E.g., the scheme for putting an entry in the queue to terminate the sideline thread function isn't marked as temporary, so the reader can't tell why only one thread is supported and why we can't just add an array and for loops to easily increase the pool to 100 threads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set it default.

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
if (op_offline.get_value()) {
if (op_num_threads.get_value() > 0) {
trace_buf_enqueue((file_t)0, NULL, 0); // end of profiles
dr_event_wait(sideline_exit_event);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a for loop, one event per thread, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
queue_size = i;
if (queue_size < 2) {
NOTIFY(0, "Usage error: queue size too small\n");
DR_ASSERT(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dr_abort

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other comments like this that I did not see any reply to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
get_queue_size();
drvector_init(&buf_queue, queue_size, false, NULL);
dr_create_client_thread(sideline_run, NULL);
sideline_exit_event = dr_event_create();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop around these two lines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

zhaoqin and others added 3 commits April 18, 2017 07:21
Added threading pool for asynchronous trace dump.
- used a circular buffer for producer consumer queue,
- added fork handling,
- updated trace format to write trace per sideline thread instead of app thread,
- removed handoff callbacks.
@zhaoqin

zhaoqin commented Apr 25, 2017

Copy link
Copy Markdown
Contributor Author

Just updated the threading pool implementation.
PTAL.

@derekbruening derekbruening changed the title i#2006 generalize drcachesime: add two options for asynchronous trace… i#2006 generalize drcachesim: add two options for asynchronous trace… Apr 25, 2017
@zhaoqin

zhaoqin commented Apr 25, 2017 via email

Copy link
Copy Markdown
Contributor Author

@zhaoqin

zhaoqin commented Apr 25, 2017 via email

Copy link
Copy Markdown
Contributor Author

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good though there seem to be a few issues: memory leak, ignoring return values, etc.. Most comments are suggestions or style nits.

Don't really want to spend more time it so approving with requests for changes.

Comment thread clients/drcachesim/tracer/instru.h Outdated
virtual int append_thread_exit(byte *buf_ptr, thread_id_t tid) = 0;
virtual int append_iflush(byte *buf_ptr, addr_t start, size_t size) = 0;
virtual int append_thread_header(byte *buf_ptr, thread_id_t tid) = 0;
// This is a per-thread-file-writeout header.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"per-thread-file-writeout" means every time we write to a per-thread file -- so how would that be different from the per-buffer-writeout header? Do you mean just "per-thread-file"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
I was trying to say the write-out/sideline thread instead of app thread.

Comment thread clients/drcachesim/common/options.cpp Outdated
"for later offline analysis. No simulator is executed.");

droption_t<unsigned int> op_num_threads
(DROPTION_SCOPE_CLIENT, "num_threads", 1, "Number of sideline threads for writing traces",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1? It seems too small. How was it chosen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set it 1 so the online simulation would work here.
could this been dynamically set here?
Ideally, it would be something like 1/4 of number of cpus or something. Any cross-platform way to get number of CPUs?


int
offline_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid)
offline_instru_t::append_thread_file_header(byte *buf_ptr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of confusing to have "thread" in the name: makes me wonder why there's a pid in the header but no tid? Maybe just "append_file_header".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread suite/tests/CMakeLists.txt Outdated
torunonly_ci(tool.reuse ${ci_shared_app} drcachesim
"reuse_distance.c" # for templatex basename
"-ipc_name drtestpipe6 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "")
"-ipc_name drtestpipe6 -num_threads 0 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Why no thread pool?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the thread pool test for drcachesim online simulation
Here I want to test the online simulation without threading pool, unless we do not need to support that any more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but it is confusing b/c this looks like there's some relationship to the reuse-distance tool: like the reuse-distance tool doesn't support the thread pool or sthg. I would suggest either a comment explaining, or adding a new test run that looks just an existing run except it has -num_threads 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, add a separate test instead.

Comment thread suite/tests/CMakeLists.txt Outdated
prefix_cmd_if_necessary(drcachesim_path ON ${drcachesim_path})

macro (torunonly_drcacheoff testname exetgt)
macro (torunonly_drcacheoff testestname exetgt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? "testestname"? The original is better: is this is a mistake?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
my stupid reg replace mistake.

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
dr_global_alloc(op_num_threads.get_value() * sizeof(sideline_exit_events[0]));
for (i = 0; i < op_num_threads.get_value(); i++) {
sideline_exit_events[i] = dr_event_create();
sideline_args[i].id = i;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
sideline_args[i].id = i;
if (op_offline.get_value())
sideline_args[i].file = create_thread_file((ptr_int_t)i);
dr_create_client_thread(sideline_run, &sideline_args[i]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the return value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
}
if (op_num_threads.get_value() > 1) {
NOTIFY(0, "Usage error: only one sideline thread is supported "
"for online simulation\n");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Please explain in a comment: is this temporary or permanent?
Is this documented in options.cpp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
only one named pipe, so one thread makes sense, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will something not work correctly if there's more than one thread in the pool? Or it will work, you just think there's no perf improvement?

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
entry->data = data;
entry->size = size;
entry->tid = tid;
drvector_set_entry(&circular_buf_queue, queue_put, entry);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
queue_get = 0;
}

void event_fork_init(void *drcontext)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style violation: \n

Also, should be static

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Added threading pool for asynchronous trace dump.
- used a circular buffer for producer consumer queue,
- added fork handling,
- updated trace format to write trace per sideline thread instead of app thread,
- removed handoff callbacks.

@zhaoqin zhaoqin left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated based on the review.

Comment thread clients/drcachesim/common/options.cpp Outdated
"for later offline analysis. No simulator is executed.");

droption_t<unsigned int> op_num_threads
(DROPTION_SCOPE_CLIENT, "num_threads", 1, "Number of sideline threads for writing traces",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set it 1 so the online simulation would work here.
could this been dynamically set here?
Ideally, it would be something like 1/4 of number of cpus or something. Any cross-platform way to get number of CPUs?

Comment thread clients/drcachesim/tracer/instru.h Outdated
virtual int append_thread_exit(byte *buf_ptr, thread_id_t tid) = 0;
virtual int append_iflush(byte *buf_ptr, addr_t start, size_t size) = 0;
virtual int append_thread_header(byte *buf_ptr, thread_id_t tid) = 0;
// This is a per-thread-file-writeout header.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
I was trying to say the write-out/sideline thread instead of app thread.


int
offline_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid)
offline_instru_t::append_thread_file_header(byte *buf_ptr)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

online_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid)
online_instru_t::append_thread_file_header(byte *buf_ptr)
{
// The caller separately calls append_tid for us which is all we need.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread suite/tests/CMakeLists.txt Outdated
torunonly_ci(tool.reuse ${ci_shared_app} drcachesim
"reuse_distance.c" # for templatex basename
"-ipc_name drtestpipe6 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "")
"-ipc_name drtestpipe6 -num_threads 0 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the thread pool test for drcachesim online simulation
Here I want to test the online simulation without threading pool, unless we do not need to support that any more.

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
dr_global_alloc(op_num_threads.get_value() * sizeof(sideline_exit_events[0]));
for (i = 0; i < op_num_threads.get_value(); i++) {
sideline_exit_events[i] = dr_event_create();
sideline_args[i].id = i;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
sideline_args[i].id = i;
if (op_offline.get_value())
sideline_args[i].file = create_thread_file((ptr_int_t)i);
dr_create_client_thread(sideline_run, &sideline_args[i]);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
queue_get = 0;
}

void event_fork_init(void *drcontext)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
}
if (op_num_threads.get_value() > 1) {
NOTIFY(0, "Usage error: only one sideline thread is supported "
"for online simulation\n");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.
only one named pipe, so one thread makes sense, right?

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
redzone_size = instru->sizeof_entry() * MAX_NUM_ENTRIES;
max_buf_size = trace_buf_size + redzone_size;
buf_hdr_slots_size = instru->sizeof_entry() * BUF_HDR_SLOTS;
/* get the header size by faking an unit header write */

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zhaoqin

zhaoqin commented Apr 26, 2017 via email

Copy link
Copy Markdown
Contributor Author

@zhaoqin

zhaoqin commented Apr 26, 2017 via email

Copy link
Copy Markdown
Contributor Author

@zhaoqin

zhaoqin commented Apr 27, 2017

Copy link
Copy Markdown
Contributor Author

PTAL
there are some non-trivial changes, so it would be better to take another round of review.

@zhaoqin

zhaoqin commented Apr 27, 2017

Copy link
Copy Markdown
Contributor Author

New commit message:
i#2006 generalize drcachesime: add threading pool support for writing out traces

Added threading pool for asynchronous trace writing out.

  • added two options for threading pool configuration,
  • used a circular buffer for producer-consumer queue,
  • added fork handling,
  • updated trace format to write trace per sideline thread instead of app thread,
  • updated raw2trace to handle the new trace format,
  • added new tests for different threading pool configurations,
  • removed handoff callbacks and corresponding tests.

@zhaoqin zhaoqin changed the title i#2006 generalize drcachesim: add two options for asynchronous trace… i#2006 generalize drcachesime: add threading pool support for writing out traces Apr 27, 2017

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main comments:

  1. If this slows everything down it should be off by default until perf is improved

  2. We need to add back thread and file footers for raw2trace: this change removes both thread and file truncation sanity checks

Comment thread clients/drcachesim/common/options.cpp Outdated
"for later offline analysis. No simulator is executed.");

droption_t<unsigned int> op_num_threads
(DROPTION_SCOPE_CLIENT, "num_threads", IF_WINDOWS_ELSE(0, 1),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment: why is it different for Windows?

Also, if the pool is a perf hit (esp w/ only 1 thread) for offline (but we expect to make it a net win with tuning, right?), should it be disabled by default on all platforms to avoid slowing everyone down?

If this really depends on online vs offline then have the default set in C++ code and describe it in the docs here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
@@ -403,13 +403,13 @@ raw2trace_t::merge_and_process_thread_files()
// The current thread we're processing is tidx. If it's set to thread_files.size()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current file

I guess leaving as "tidx" is fine: less code churn.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
agree

}
}
VPRINT(2, "Next thread in timestamp order is %u @0x" ZHEX64_FORMAT_STRING
"\n", (uint)tids[next_tidx], times[next_tidx]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? I found it useful for seeing the operation: which file is next.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
WARN("Input file for thread %d is truncated", (uint)tids[tidx]);
in_entry.extended.type = OFFLINE_TYPE_EXTENDED;
in_entry.extended.ext = OFFLINE_EXT_TYPE_FOOTER;
VPRINT(2, "Finish read file %d\n", tidx);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reading

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

offline_entry_t entry;
if (thread_files[tidx]->read((char*)&entry, sizeof(entry)) ||
!thread_files[tidx]->eof())
FATAL_ERROR("Footer is not the final entry");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're now missing two checks:

  1. That each thread ends with a footer: this is important to know whether we had thread truncation. This was checked above at lines 458-462. I think we have to put this check back in (will have to iterate the threads at the end I guess), along with the warning and workaround.

  2. That each file is not truncated. We probably need a new kind of footer now, a file footer entry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

buf_entry_t *entry;
do {
/* quick racy check without locking */
while (((queue_put + 1) & queue_size_mask) == queue_get) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said you saw major performance issues involving contention on the queue: please add comments about that so we know this code needs improvement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread suite/tests/CMakeLists.txt Outdated
"${PROJECT_SOURCE_DIR}/clients/drcachesim/tests")
set(tool.drcachesim.threads_rawtemp ON) # no preprocessor
set(tool.drcachesim.threads_timeout 150) # This test is long.
set(tool.drcachesim.threads_timeout 200) # This test is long.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool slows all of these down so much we need longer timeouts?? That's not good. Sounds like this should definitely be off by default until its performance is improved.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
write_trace_data(file_t file, void *data, ssize_t size)
{
ssize_t written = file_ops_func.write_file(file, data, size);
DR_ASSERT(written == size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be NOTIFY0 with an understandable message and dr_abort, which seems better than "written==size" being printed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@derekbruening

Copy link
Copy Markdown
Contributor

Also, the commit message should talk about the performance, since that's a big TODO item: reading the current commit message it sounds like this thread pool is all finished.

@derekbruening derekbruening changed the title i#2006 generalize drcachesime: add threading pool support for writing out traces i#2006 generalize drcachesim: add threading pool support for writing out traces Apr 27, 2017
… out traces

Added threading pool for asynchronous trace writing out.

- added two options for threading pool configuration,
- used a circular buffer for producer-consumer queue,
- added fork handling,
- updated trace format to write trace per sideline thread instead of app thread,
- added a new offline ext type,
- updated raw2trace to handle the new trace format,
- added new tests for different threading pool configurations,
- removed handoff callbacks and corresponding tests.

TODO: the initial threading pool implementation caused a big slowdown
on online simulation, we need more optimization and tuning to improve
the performance.

@zhaoqin zhaoqin left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
@@ -403,13 +403,13 @@ raw2trace_t::merge_and_process_thread_files()
// The current thread we're processing is tidx. If it's set to thread_files.size()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
agree

}
}
VPRINT(2, "Next thread in timestamp order is %u @0x" ZHEX64_FORMAT_STRING
"\n", (uint)tids[next_tidx], times[next_tidx]);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
WARN("Input file for thread %d is truncated", (uint)tids[tidx]);
in_entry.extended.type = OFFLINE_TYPE_EXTENDED;
in_entry.extended.ext = OFFLINE_EXT_TYPE_FOOTER;
VPRINT(2, "Finish read file %d\n", tidx);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread clients/drcachesim/tracer/tracer.cpp Outdated
write_trace_data(file_t file, void *data, ssize_t size)
{
ssize_t written = file_ops_func.write_file(file, data, size);
DR_ASSERT(written == size);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread suite/tests/CMakeLists.txt Outdated
"${PROJECT_SOURCE_DIR}/clients/drcachesim/tests")
set(tool.drcachesim.threads_rawtemp ON) # no preprocessor
set(tool.drcachesim.threads_timeout 150) # This test is long.
set(tool.drcachesim.threads_timeout 200) # This test is long.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

buf_entry_t *entry;
do {
/* quick racy check without locking */
while (((queue_put + 1) & queue_size_mask) == queue_get) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

offline_entry_t entry;
if (thread_files[tidx]->read((char*)&entry, sizeof(entry)) ||
!thread_files[tidx]->eof())
FATAL_ERROR("Footer is not the final entry");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zhaoqin

zhaoqin commented Apr 27, 2017

Copy link
Copy Markdown
Contributor Author

PTAL

@derekbruening derekbruening left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two cases where the current code has a nice message or a good check but this CL removes it: requesting reinstatement of the prior code for those two cases.

A question about the thread exit ordering which sounds like it may be a big deal. If I misunderstood and the exit is last then this shouldn't need another review.

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
VPRINT(2, "Finish reading file %d at pos %u\n",
tidx, (int)thread_files[tidx]->tellg());
// We need do another failed read to make ->eof() return true.
if (thread_files[tidx]->read((char*)&in_entry, 1)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code that you deleted seems better, "if (read entry || !eof) fatal..."

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
}
} while (file_count > 0);
// If we used threading pool during profiling, the trace for a thread may be
// seen after the thread exit entry, so we can only check whether the unique

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? But you can't put a thread exit entry in the final trace here and then put a memref entry for that thread afterward. Is that what's happening due to the timestamp changes? We have to enforce that order, don't we?

Comment thread clients/drcachesim/tracer/raw2trace.cpp Outdated
// tid and tid exit matches.
VPRINT(1, "%u thread(s) are seen\n", (uint)tid_set.size());
if (tid_set != tid_exit_set) {
WARN("Profile might be corrupted.");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more specific message, please! Again the prior code was IMHO better: "Trace for thread %d is truncated".

/* quick racy check without locking */
while (((queue_put + 1) & queue_size_mask) == queue_get) {
/* XXX: use futex/semaphore for better performance. */
/* XXX: there is a big slowdown for the online simulation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you said offline was slow too? You plan to put the numbers in the issue tracker, right?

zhaoqin added 5 commits April 28, 2017 11:26
… out traces

Added threading pool for asynchronous trace writing out.

- added two options for threading pool configuration,
- used a circular buffer for producer-consumer queue,
- added fork handling,
- updated trace format to write trace per sideline thread instead of app thread,
- updated raw2trace to handle the new trace format,
- added new tests for different threading pool configurations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants