i#2006 generalize drcachesim: add threading pool support for writing out traces#2319
i#2006 generalize drcachesim: add threading pool support for writing out traces#2319zhaoqin wants to merge 34 commits into
Conversation
… write Adds two options -num_threads and -queue_size to tracer for asynchronous trace write by sideline threading pool. This CL only adds the two options, real implementation is to be added.
derekbruening
left a comment
There was a problem hiding this comment.
spelling in title: "drcachesime"
Small incremental reviews are good in general but the repository has to be in a valid state at any time, so unimplemented options need to be labeled as unimplemented to avoid confusion. But maybe it's better to only add an option when there's something implemented that it controls.
| "0 means synchronized write without any sideline threads."); | ||
|
|
||
| droption_t<unsigned int> op_queue_size | ||
| (DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ", |
| droption_t<unsigned int> op_queue_size | ||
| (DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ", | ||
| "For the offline analysis mode (when -offline is requested), specifies the size " | ||
| "of the queue for batching trace data to be written out by sideline threads."); |
There was a problem hiding this comment.
What are the units here? If bytes this is way too small. If buffers it seems large for a default: 128MB for 32K buffers.
| droption_t<bytesize_t> op_num_threads | ||
| (DROPTION_SCOPE_ALL, "num_threads", 0, "Number of threads for writing traces", | ||
| "For the offline analysis mode (when -offline is requested), specifies the number " | ||
| "of sideline threads are used to write traces file out. " |
There was a problem hiding this comment.
grammar: s/are/to be/
grammar: s/traces file/trace files/
| "for later offline analysis. No simulator is executed."); | ||
|
|
||
| droption_t<bytesize_t> op_num_threads | ||
| (DROPTION_SCOPE_ALL, "num_threads", 0, "Number of threads for writing traces", |
There was a problem hiding this comment.
Shouldn't this be SCOPE_CLIENT?
| "0 means synchronized write without any sideline threads."); | ||
|
|
||
| droption_t<unsigned int> op_queue_size | ||
| (DROPTION_SCOPE_ALL, "queue_size", 4000, "The queue size for ", |
There was a problem hiding this comment.
Just SCOPE_CLIENT, right?
| torunonly_ci(tool.drcacheoff.${testname} ${exetgt} drcachesim | ||
| "offline-${testname}.c" # for templatex basename | ||
| "-offline" "" "") | ||
| "-offline -num_threads 0" "" "") |
There was a problem hiding this comment.
Isn't this the default?
There was a problem hiding this comment.
was trying to test if the option can be specified.
removed.
Adds two options -num_threads and -queue_size to the tracer for asynchronous trace write by the sideline threads. This CL only adds the two options, and the real implementation is to be added.
|
PTAL |
Adds two options -num_threads and -queue_size to the tracer for asynchronous trace write by the sideline threads. This CL only adds the two options, and the real implementation is to be added.
This CL uses client sideline threads to overlap the profiling and data output. We use a producer-consumer queue to pass traces from profiling thread to sideline threads. Tow options -num_threads and -queue_size are added to specify the threading pool parameters. A simple sideline test is also added. TODO: enable sideline thread for burst and static tests.
derekbruening
left a comment
There was a problem hiding this comment.
Part-way through.
Offline we talked a little about whether this needs to be limited to offline -- seems like the pool can be positioned at a point shared by both.
| per_thread_t *data = (per_thread_t *) drmgr_get_tls_field(drcontext, tls_idx); | ||
| ssize_t size = towrite_end - towrite_start; | ||
| if (file_ops_func.handoff_buf != NULL) { | ||
| if (op_num_threads.get_value()) { |
There was a problem hiding this comment.
style violation: non-bool as bool
|
|
||
| /* setup sideline thread and queue */ | ||
| if (op_offline.get_value() && op_num_threads.get_value() > 0) { | ||
| /* XXX support thread pool with more than one threads */ |
There was a problem hiding this comment.
nit: "XXX:"
nit: "one thread"
There was a problem hiding this comment.
add multi-thread support
| /* XXX support thread pool with more than one threads */ | ||
| if (op_num_threads.get_value() > 1) { | ||
| NOTIFY(0, "Usage error: only one sideline thread is supported"); | ||
| DR_ASSERT(false); |
There was a problem hiding this comment.
Should be dr_abort() if fatal -- that's what the other usage errors do
| uint i; | ||
| queue_size = op_queue_size.get_value() / max_buf_size; | ||
| // We round up queue_size to the nearest power of 2. | ||
| for (i = 1; i <= queue_size; i <<= 1) /*do nothing*/; |
| set(tool.drcacheoff.${testname}_basedir | ||
| macro (torunonly_drcacheoff testname exetgt sideline) | ||
| if (${sideline}) | ||
| set(tname "${testname}-sideline") |
There was a problem hiding this comment.
nit: these are all using . and _ rather than - as a separator: "tool.drcacheoff.burst_static"
There was a problem hiding this comment.
removed since the default is the sideline.
Will add test without sideline
| set(tool.drcacheoff.${testname}_basedir | ||
| macro (torunonly_drcacheoff testname exetgt sideline) | ||
| if (${sideline}) | ||
| set(tname "${testname}-sideline") |
There was a problem hiding this comment.
Rather than exploding the option space, can we just make the thread pool the default? This will make testing much simpler.
| "For the offline analysis mode (when -offline is requested), specifies the number " | ||
| "of sideline threads to be used to write trace files out. " | ||
| "0 means synchronized write without any sideline threads."); | ||
| "Only 0 and 1 is currently supported. " |
| "(NYI) The maximum size of the queue for traces before they are written out", | ||
| "For the offline analysis mode (when -offline is requested), specifies the maximum " | ||
| "size (number of bytes) of the queue for traces before they are written out " | ||
| "The approximate maximum size of the queue for traces before they are written out", |
There was a problem hiding this comment.
As discussed offline: the description can be clearer. As written it sounds like a minimum size, where nothing is written at all until it's reached. Really what you mean is the size at which we'll block the app threads and wait for the queue to shrink.
| static volatile uint queue_tail = 0; | ||
| static void *sideline_exit_event; | ||
| struct buf_entry_t { | ||
| file_t file; // file descriptor, 0 means the end of profiling. |
There was a problem hiding this comment.
nit: INVALID_FILE, not 0
derekbruening
left a comment
There was a problem hiding this comment.
It seems better to support multiple threads now, which is just a couple of lines more of code, and ensures the design won't have to change. It looks like this CL's design will have to change, which we would avoid by having multiple threads to start with.
|
|
||
| /* sideline */ | ||
| static drvector_t buf_queue; | ||
| static uint queue_size = 0; |
There was a problem hiding this comment.
suggest: comment explaining what this is: probably current total size of all buffers, rather than the length of the vector?
| static uint queue_size = 0; | ||
| static uint queue_size_mask = 0; | ||
| static volatile uint queue_front = 0; | ||
| static volatile uint queue_tail = 0; |
There was a problem hiding this comment.
I'm guessing this is a circular buffer and these are array indices? Suggest: comments.
| /* sideline */ | ||
| static drvector_t buf_queue; | ||
| static uint queue_size = 0; | ||
| static uint queue_size_mask = 0; |
There was a problem hiding this comment.
Hmm, no, if this is based on queue_size, then queue_size must be the length of the vector? Why do we need a separate var for that: it's already in buf_queue, right?
| static uint queue_size_mask = 0; | ||
| static volatile uint queue_front = 0; | ||
| static volatile uint queue_tail = 0; | ||
| static void *sideline_exit_event; |
There was a problem hiding this comment.
Don't we need a separate one per thread?
| } | ||
| drvector_lock(&buf_queue); | ||
| if (((queue_front + 1) & queue_size_mask) != queue_tail) { | ||
| drvector_set_entry(&buf_queue, queue_front, entry); |
There was a problem hiding this comment.
So the invariant is that the front is empty? Maybe a comment on that up above? How about an assert that the current front file is INVALID_FILE?
There was a problem hiding this comment.
add comment on the queue_front variable.
| if (file_ops_func.handoff_buf != NULL) { | ||
| if (op_num_threads.get_value()) { | ||
| trace_buf_enqueue(data->file, towrite_start, size); | ||
| } else if (file_ops_func.handoff_buf != NULL) { |
There was a problem hiding this comment.
So you've put a priority here: the handoff is going to be ignored if there's a thread pool. That should be reflected in its docs.
Maybe we should just remove the handoff feature completely, since it's not needed anymore: it can be duplicated by using a thread pool and having file_ops_func.write do whatever you want.
I also think the thread pool should be the default.
| "For the offline analysis mode (when -offline is requested), specifies the number " | ||
| "of sideline threads to be used to write trace files out. " | ||
| "Only 0 and 1 is currently supported. " | ||
| "0 means synchronized write without any sideline threads. "); |
There was a problem hiding this comment.
IMHO the pool should be on by default, for code simplicity.
I'm not sure about only supporting 1 thread: we need to know that the design here supports many threads, and seeing the stop-profiling scheme below I'm not sure that it does. Adding >1 thread seems like just a few lines of code: array of events, for loop when touching those events. It seems better to include multi-thread support now, to ensure the design works (again see below where I'm not sure it does).
There was a problem hiding this comment.
A threading pool with more than one thread has a complex problem, i.e, sideline thread vs open files.
The simplest way is one open file per sideline thread, which requires us to change raw2trace.
If we want to keep current one file per app-thread, writing to same file with multiple threads is complex.
I will go with the one file per sideline thread.
There was a problem hiding this comment.
I don't remember seeing any mention of this in the CL? Did I miss it? E.g., the scheme for putting an entry in the queue to terminate the sideline thread function isn't marked as temporary, so the reader can't tell why only one thread is supported and why we can't just add an array and for loops to easily increase the pool to 100 threads.
| if (op_offline.get_value()) { | ||
| if (op_num_threads.get_value() > 0) { | ||
| trace_buf_enqueue((file_t)0, NULL, 0); // end of profiles | ||
| dr_event_wait(sideline_exit_event); |
There was a problem hiding this comment.
Needs to be a for loop, one event per thread, right?
| queue_size = i; | ||
| if (queue_size < 2) { | ||
| NOTIFY(0, "Usage error: queue size too small\n"); | ||
| DR_ASSERT(false); |
There was a problem hiding this comment.
There are other comments like this that I did not see any reply to.
| get_queue_size(); | ||
| drvector_init(&buf_queue, queue_size, false, NULL); | ||
| dr_create_client_thread(sideline_run, NULL); | ||
| sideline_exit_event = dr_event_create(); |
There was a problem hiding this comment.
for loop around these two lines
Added threading pool for asynchronous trace dump. - used a circular buffer for producer consumer queue, - added fork handling, - updated trace format to write trace per sideline thread instead of app thread, - removed handoff callbacks.
|
Just updated the threading pool implementation. |
|
I put comment for queue_front/queue_tail (replaced with
queue_put/queue_get), which I feel clearer.
…On Mon, Apr 24, 2017 at 9:30 PM, Derek Bruening ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In clients/drcachesim/tracer/tracer.cpp
<#2319 (comment)>:
> +trace_buf_enqueue(file_t file, void *data, ssize_t size)
+{
+ bool done = false;
+ buf_entry_t *entry = (buf_entry_t *)dr_global_alloc(sizeof(*entry));
+ entry->file = file;
+ entry->data = data;
+ entry->size = size;
+ do {
+ /* quick racy check without locking */
+ while (((queue_front + 1) & queue_size_mask) == queue_tail) {
+ /* the queue is full, waiting */;
+ dr_thread_yield();
+ }
+ drvector_lock(&buf_queue);
+ if (((queue_front + 1) & queue_size_mask) != queue_tail) {
+ drvector_set_entry(&buf_queue, queue_front, entry);
Was this done?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2319 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABkcDPeoJc9n0--B16LYEo6jU2371lx6ks5rzUy6gaJpZM4Mu6lp>
.
|
|
I kind of rewriting a lot, including removing entry->file.
…On Mon, Apr 24, 2017 at 9:30 PM, Derek Bruening ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In clients/drcachesim/tracer/tracer.cpp
<#2319 (comment)>:
> +{
+ buf_entry_t *entry;
+ bool profiling = true;
+ do {
+ /* quick racy check without locking */
+ while (queue_tail == queue_front) {
+ /* the queue is empty, waiting */
+ dr_thread_yield();
+ }
+ entry = trace_buf_dequeue();
+ if (entry == NULL)
+ continue;
+ if (entry->data != NULL) {
+ file_ops_func.write_file(entry->file, entry->data, entry->size);
+ dr_raw_mem_free(entry->data, max_buf_size);
+ } else if (entry->file != 0) {
This one?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2319 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABkcDEldw-0AzHlp_15RdV0_fdw_1Qcoks5rzUzBgaJpZM4Mu6lp>
.
|
derekbruening
left a comment
There was a problem hiding this comment.
Looks pretty good though there seem to be a few issues: memory leak, ignoring return values, etc.. Most comments are suggestions or style nits.
Don't really want to spend more time it so approving with requests for changes.
| virtual int append_thread_exit(byte *buf_ptr, thread_id_t tid) = 0; | ||
| virtual int append_iflush(byte *buf_ptr, addr_t start, size_t size) = 0; | ||
| virtual int append_thread_header(byte *buf_ptr, thread_id_t tid) = 0; | ||
| // This is a per-thread-file-writeout header. |
There was a problem hiding this comment.
"per-thread-file-writeout" means every time we write to a per-thread file -- so how would that be different from the per-buffer-writeout header? Do you mean just "per-thread-file"?
There was a problem hiding this comment.
done
I was trying to say the write-out/sideline thread instead of app thread.
| "for later offline analysis. No simulator is executed."); | ||
|
|
||
| droption_t<unsigned int> op_num_threads | ||
| (DROPTION_SCOPE_CLIENT, "num_threads", 1, "Number of sideline threads for writing traces", |
There was a problem hiding this comment.
Why 1? It seems too small. How was it chosen?
There was a problem hiding this comment.
We set it 1 so the online simulation would work here.
could this been dynamically set here?
Ideally, it would be something like 1/4 of number of cpus or something. Any cross-platform way to get number of CPUs?
|
|
||
| int | ||
| offline_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid) | ||
| offline_instru_t::append_thread_file_header(byte *buf_ptr) |
There was a problem hiding this comment.
Kind of confusing to have "thread" in the name: makes me wonder why there's a pid in the header but no tid? Maybe just "append_file_header".
| torunonly_ci(tool.reuse ${ci_shared_app} drcachesim | ||
| "reuse_distance.c" # for templatex basename | ||
| "-ipc_name drtestpipe6 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "") | ||
| "-ipc_name drtestpipe6 -num_threads 0 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "") |
There was a problem hiding this comment.
? Why no thread pool?
There was a problem hiding this comment.
We had the thread pool test for drcachesim online simulation
Here I want to test the online simulation without threading pool, unless we do not need to support that any more.
There was a problem hiding this comment.
OK but it is confusing b/c this looks like there's some relationship to the reuse-distance tool: like the reuse-distance tool doesn't support the thread pool or sthg. I would suggest either a comment explaining, or adding a new test run that looks just an existing run except it has -num_threads 0.
There was a problem hiding this comment.
done, add a separate test instead.
| prefix_cmd_if_necessary(drcachesim_path ON ${drcachesim_path}) | ||
|
|
||
| macro (torunonly_drcacheoff testname exetgt) | ||
| macro (torunonly_drcacheoff testestname exetgt) |
There was a problem hiding this comment.
? "testestname"? The original is better: is this is a mistake?
There was a problem hiding this comment.
done
my stupid reg replace mistake.
| dr_global_alloc(op_num_threads.get_value() * sizeof(sideline_exit_events[0])); | ||
| for (i = 0; i < op_num_threads.get_value(); i++) { | ||
| sideline_exit_events[i] = dr_event_create(); | ||
| sideline_args[i].id = i; |
| sideline_args[i].id = i; | ||
| if (op_offline.get_value()) | ||
| sideline_args[i].file = create_thread_file((ptr_int_t)i); | ||
| dr_create_client_thread(sideline_run, &sideline_args[i]); |
There was a problem hiding this comment.
Check the return value
| } | ||
| if (op_num_threads.get_value() > 1) { | ||
| NOTIFY(0, "Usage error: only one sideline thread is supported " | ||
| "for online simulation\n"); |
There was a problem hiding this comment.
Why? Please explain in a comment: is this temporary or permanent?
Is this documented in options.cpp?
There was a problem hiding this comment.
done.
only one named pipe, so one thread makes sense, right?
There was a problem hiding this comment.
Will something not work correctly if there's more than one thread in the pool? Or it will work, you just think there's no perf improvement?
| entry->data = data; | ||
| entry->size = size; | ||
| entry->tid = tid; | ||
| drvector_set_entry(&circular_buf_queue, queue_put, entry); |
There was a problem hiding this comment.
Isn't this redundant?
| queue_get = 0; | ||
| } | ||
|
|
||
| void event_fork_init(void *drcontext) |
There was a problem hiding this comment.
style violation: \n
Also, should be static
Added threading pool for asynchronous trace dump. - used a circular buffer for producer consumer queue, - added fork handling, - updated trace format to write trace per sideline thread instead of app thread, - removed handoff callbacks.
zhaoqin
left a comment
There was a problem hiding this comment.
updated based on the review.
| "for later offline analysis. No simulator is executed."); | ||
|
|
||
| droption_t<unsigned int> op_num_threads | ||
| (DROPTION_SCOPE_CLIENT, "num_threads", 1, "Number of sideline threads for writing traces", |
There was a problem hiding this comment.
We set it 1 so the online simulation would work here.
could this been dynamically set here?
Ideally, it would be something like 1/4 of number of cpus or something. Any cross-platform way to get number of CPUs?
| virtual int append_thread_exit(byte *buf_ptr, thread_id_t tid) = 0; | ||
| virtual int append_iflush(byte *buf_ptr, addr_t start, size_t size) = 0; | ||
| virtual int append_thread_header(byte *buf_ptr, thread_id_t tid) = 0; | ||
| // This is a per-thread-file-writeout header. |
There was a problem hiding this comment.
done
I was trying to say the write-out/sideline thread instead of app thread.
|
|
||
| int | ||
| offline_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid) | ||
| offline_instru_t::append_thread_file_header(byte *buf_ptr) |
| online_instru_t::append_thread_header(byte *buf_ptr, thread_id_t tid) | ||
| online_instru_t::append_thread_file_header(byte *buf_ptr) | ||
| { | ||
| // The caller separately calls append_tid for us which is all we need. |
| torunonly_ci(tool.reuse ${ci_shared_app} drcachesim | ||
| "reuse_distance.c" # for templatex basename | ||
| "-ipc_name drtestpipe6 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "") | ||
| "-ipc_name drtestpipe6 -num_threads 0 -simulator_type reuse_distance -reuse_distance_threshold 256" "" "") |
There was a problem hiding this comment.
We had the thread pool test for drcachesim online simulation
Here I want to test the online simulation without threading pool, unless we do not need to support that any more.
| dr_global_alloc(op_num_threads.get_value() * sizeof(sideline_exit_events[0])); | ||
| for (i = 0; i < op_num_threads.get_value(); i++) { | ||
| sideline_exit_events[i] = dr_event_create(); | ||
| sideline_args[i].id = i; |
| sideline_args[i].id = i; | ||
| if (op_offline.get_value()) | ||
| sideline_args[i].file = create_thread_file((ptr_int_t)i); | ||
| dr_create_client_thread(sideline_run, &sideline_args[i]); |
| queue_get = 0; | ||
| } | ||
|
|
||
| void event_fork_init(void *drcontext) |
| } | ||
| if (op_num_threads.get_value() > 1) { | ||
| NOTIFY(0, "Usage error: only one sideline thread is supported " | ||
| "for online simulation\n"); |
There was a problem hiding this comment.
done.
only one named pipe, so one thread makes sense, right?
| redzone_size = instru->sizeof_entry() * MAX_NUM_ENTRIES; | ||
| max_buf_size = trace_buf_size + redzone_size; | ||
| buf_hdr_slots_size = instru->sizeof_entry() * BUF_HDR_SLOTS; | ||
| /* get the header size by faking an unit header write */ |
|
It probably will work but may lose accuracy, because more than one threads
are writing to the pipe at the same time.
I could remove the check
…On Wed, Apr 26, 2017 at 12:28 AM, Derek Bruening ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In clients/drcachesim/tracer/tracer.cpp
<#2319 (comment)>:
> @@ -971,11 +1213,18 @@ drmemtrace_client_main(client_id_t id, int argc, const char *argv[])
droption_parser_t::usage_short(DROPTION_SCOPE_ALL).c_str());
dr_abort();
}
- if (!op_offline.get_value() && op_ipc_name.get_value().empty()) {
- NOTIFY(0, "Usage error: ipc name is required\nUsage:\n%s",
- droption_parser_t::usage_short(DROPTION_SCOPE_ALL).c_str());
- dr_abort();
- } else if (op_offline.get_value() && op_outdir.get_value().empty()) {
+ if (!op_offline.get_value()) {
+ if (op_ipc_name.get_value().empty()) {
+ NOTIFY(0, "Usage error: ipc name is required\nUsage:\n%s",
+ droption_parser_t::usage_short(DROPTION_SCOPE_ALL).c_str());
+ dr_abort();
+ }
+ if (op_num_threads.get_value() > 1) {
+ NOTIFY(0, "Usage error: only one sideline thread is supported "
+ "for online simulation\n");
Will something not work correctly if there's more than one thread in the
pool? Or it will work, you just think there's no perf improvement?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2319 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABkcDP-WKX2T3rlcNp-tERh3Sxx7PIvCks5rzsfcgaJpZM4Mu6lp>
.
|
|
No, still working on the corrrectness, saw some hangs.
…On Wed, Apr 26, 2017 at 11:04 AM, Derek Bruening ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In clients/drcachesim/tracer/tracer.cpp
<#2319 (comment)>:
> } else {
/* pass pid and tid to the simulator to register current thread */
- proc_info = (byte *)buf;
- DR_ASSERT(BUFFER_SIZE_BYTES(buf) >= 3*instru->sizeof_entry());
- proc_info += instru->append_thread_header(proc_info, dr_get_thread_id(drcontext));
- proc_info += instru->append_tid(proc_info, dr_get_thread_id(drcontext));
- proc_info += instru->append_pid(proc_info, dr_get_process_id());
- write_trace_data(drcontext, (byte *)buf, proc_info);
-
- /* put buf_base to TLS plus header slots as starting buf_ptr */
- BUF_PTR(data->seg_base) = data->buf_base + buf_hdr_slots_size;
+ byte *buf = (byte *)
+ dr_raw_mem_alloc(max_buf_size, DR_MEMPROT_READ | DR_MEMPROT_WRITE, NULL);
Did you measure the performance for online and for offline? We have some
numbers for SPEC in the tracker and it would be good to see what effect the
pool has on a simple app like that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2319 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABkcDJ3KEU1q4IxHVxjTwOkSKuO4ktqHks5rz1zjgaJpZM4Mu6lp>
.
|
|
PTAL |
|
New commit message: Added threading pool for asynchronous trace writing out.
|
derekbruening
left a comment
There was a problem hiding this comment.
Two main comments:
-
If this slows everything down it should be off by default until perf is improved
-
We need to add back thread and file footers for raw2trace: this change removes both thread and file truncation sanity checks
| "for later offline analysis. No simulator is executed."); | ||
|
|
||
| droption_t<unsigned int> op_num_threads | ||
| (DROPTION_SCOPE_CLIENT, "num_threads", IF_WINDOWS_ELSE(0, 1), |
There was a problem hiding this comment.
This needs a comment: why is it different for Windows?
Also, if the pool is a perf hit (esp w/ only 1 thread) for offline (but we expect to make it a net win with tuning, right?), should it be disabled by default on all platforms to avoid slowing everyone down?
If this really depends on online vs offline then have the default set in C++ code and describe it in the docs here.
| @@ -403,13 +403,13 @@ raw2trace_t::merge_and_process_thread_files() | |||
| // The current thread we're processing is tidx. If it's set to thread_files.size() | |||
There was a problem hiding this comment.
current file
I guess leaving as "tidx" is fine: less code churn.
| } | ||
| } | ||
| VPRINT(2, "Next thread in timestamp order is %u @0x" ZHEX64_FORMAT_STRING | ||
| "\n", (uint)tids[next_tidx], times[next_tidx]); |
There was a problem hiding this comment.
Why was this removed? I found it useful for seeing the operation: which file is next.
| WARN("Input file for thread %d is truncated", (uint)tids[tidx]); | ||
| in_entry.extended.type = OFFLINE_TYPE_EXTENDED; | ||
| in_entry.extended.ext = OFFLINE_EXT_TYPE_FOOTER; | ||
| VPRINT(2, "Finish read file %d\n", tidx); |
| offline_entry_t entry; | ||
| if (thread_files[tidx]->read((char*)&entry, sizeof(entry)) || | ||
| !thread_files[tidx]->eof()) | ||
| FATAL_ERROR("Footer is not the final entry"); |
There was a problem hiding this comment.
We're now missing two checks:
-
That each thread ends with a footer: this is important to know whether we had thread truncation. This was checked above at lines 458-462. I think we have to put this check back in (will have to iterate the threads at the end I guess), along with the warning and workaround.
-
That each file is not truncated. We probably need a new kind of footer now, a file footer entry.
| buf_entry_t *entry; | ||
| do { | ||
| /* quick racy check without locking */ | ||
| while (((queue_put + 1) & queue_size_mask) == queue_get) { |
There was a problem hiding this comment.
You said you saw major performance issues involving contention on the queue: please add comments about that so we know this code needs improvement.
| "${PROJECT_SOURCE_DIR}/clients/drcachesim/tests") | ||
| set(tool.drcachesim.threads_rawtemp ON) # no preprocessor | ||
| set(tool.drcachesim.threads_timeout 150) # This test is long. | ||
| set(tool.drcachesim.threads_timeout 200) # This test is long. |
There was a problem hiding this comment.
The pool slows all of these down so much we need longer timeouts?? That's not good. Sounds like this should definitely be off by default until its performance is improved.
| write_trace_data(file_t file, void *data, ssize_t size) | ||
| { | ||
| ssize_t written = file_ops_func.write_file(file, data, size); | ||
| DR_ASSERT(written == size); |
There was a problem hiding this comment.
This used to be NOTIFY0 with an understandable message and dr_abort, which seems better than "written==size" being printed
|
Also, the commit message should talk about the performance, since that's a big TODO item: reading the current commit message it sounds like this thread pool is all finished. |
… out traces Added threading pool for asynchronous trace writing out. - added two options for threading pool configuration, - used a circular buffer for producer-consumer queue, - added fork handling, - updated trace format to write trace per sideline thread instead of app thread, - added a new offline ext type, - updated raw2trace to handle the new trace format, - added new tests for different threading pool configurations, - removed handoff callbacks and corresponding tests. TODO: the initial threading pool implementation caused a big slowdown on online simulation, we need more optimization and tuning to improve the performance.
| @@ -403,13 +403,13 @@ raw2trace_t::merge_and_process_thread_files() | |||
| // The current thread we're processing is tidx. If it's set to thread_files.size() | |||
| } | ||
| } | ||
| VPRINT(2, "Next thread in timestamp order is %u @0x" ZHEX64_FORMAT_STRING | ||
| "\n", (uint)tids[next_tidx], times[next_tidx]); |
| WARN("Input file for thread %d is truncated", (uint)tids[tidx]); | ||
| in_entry.extended.type = OFFLINE_TYPE_EXTENDED; | ||
| in_entry.extended.ext = OFFLINE_EXT_TYPE_FOOTER; | ||
| VPRINT(2, "Finish read file %d\n", tidx); |
| write_trace_data(file_t file, void *data, ssize_t size) | ||
| { | ||
| ssize_t written = file_ops_func.write_file(file, data, size); | ||
| DR_ASSERT(written == size); |
| "${PROJECT_SOURCE_DIR}/clients/drcachesim/tests") | ||
| set(tool.drcachesim.threads_rawtemp ON) # no preprocessor | ||
| set(tool.drcachesim.threads_timeout 150) # This test is long. | ||
| set(tool.drcachesim.threads_timeout 200) # This test is long. |
| buf_entry_t *entry; | ||
| do { | ||
| /* quick racy check without locking */ | ||
| while (((queue_put + 1) & queue_size_mask) == queue_get) { |
| offline_entry_t entry; | ||
| if (thread_files[tidx]->read((char*)&entry, sizeof(entry)) || | ||
| !thread_files[tidx]->eof()) | ||
| FATAL_ERROR("Footer is not the final entry"); |
|
PTAL |
derekbruening
left a comment
There was a problem hiding this comment.
Two cases where the current code has a nice message or a good check but this CL removes it: requesting reinstatement of the prior code for those two cases.
A question about the thread exit ordering which sounds like it may be a big deal. If I misunderstood and the exit is last then this shouldn't need another review.
| VPRINT(2, "Finish reading file %d at pos %u\n", | ||
| tidx, (int)thread_files[tidx]->tellg()); | ||
| // We need do another failed read to make ->eof() return true. | ||
| if (thread_files[tidx]->read((char*)&in_entry, 1)) { |
There was a problem hiding this comment.
The previous code that you deleted seems better, "if (read entry || !eof) fatal..."
| } | ||
| } while (file_count > 0); | ||
| // If we used threading pool during profiling, the trace for a thread may be | ||
| // seen after the thread exit entry, so we can only check whether the unique |
There was a problem hiding this comment.
What? But you can't put a thread exit entry in the final trace here and then put a memref entry for that thread afterward. Is that what's happening due to the timestamp changes? We have to enforce that order, don't we?
| // tid and tid exit matches. | ||
| VPRINT(1, "%u thread(s) are seen\n", (uint)tid_set.size()); | ||
| if (tid_set != tid_exit_set) { | ||
| WARN("Profile might be corrupted."); |
There was a problem hiding this comment.
A more specific message, please! Again the prior code was IMHO better: "Trace for thread %d is truncated".
| /* quick racy check without locking */ | ||
| while (((queue_put + 1) & queue_size_mask) == queue_get) { | ||
| /* XXX: use futex/semaphore for better performance. */ | ||
| /* XXX: there is a big slowdown for the online simulation |
There was a problem hiding this comment.
I thought you said offline was slow too? You plan to put the numbers in the issue tracker, right?
… out traces Added threading pool for asynchronous trace writing out. - added two options for threading pool configuration, - used a circular buffer for producer-consumer queue, - added fork handling, - updated trace format to write trace per sideline thread instead of app thread, - updated raw2trace to handle the new trace format, - added new tests for different threading pool configurations.
Adds two options -num_threads and -queue_size to the tracer for asynchronous
trace write by the sideline threads.
This CL only adds the two options, and the real implementation is to be added.