parallel processing - Profiling OpenMP-parallelized C++ code -
what easiest way profile c++ program parallelized openmp, on machine on 1 has no sudo rights?
i recommend using intel vtune amplifier xe profiler.
the basic hotspots analysis doesn't require root privileges , can install without being in sudoers.
for openmp analysis it's best compile intel openmp implementation , set environment variable kmp_forkjoin_frames 1 before running profile session. enable tool visualize time regions fork point join point each parallel region. gives idea had sufficient parallelism , did not. using grid grouping frame domain / frame type / function can correlate parallel regions happening on cpus allows finding functions didn't scale.
for example, imagine simple code below runs balanced work, serial work , imbalanced work calling delay() function of these making sure delay() doesn't inline. imitates real workload kinds of unfamiliar functions may invoked parallel regions making harder analyze whether parallism or bad looking hot-functions profile:
void __attribute__ ((noinline)) balanced_work() { printf("starting ideal parallel\n"); #pragma omp parallel delay(3000000); } void __attribute__ ((noinline)) serial_work() { printf("starting serial work\n"); delay(3000000); } void __attribute__ ((noinline)) imbalanced_work() { printf("starting parallel imbalance\n"); #pragma omp parallel { int mythread = omp_get_thread_num(); int nthreads = omp_get_num_threads(); delay(1000000); printf("first barrier %d\n", mythread); #pragma omp barrier delay(mythread * 25000 + 200000); printf("second barrier %d\n", mythread); #pragma omp barrier delay((nthreads - 1 - mythread) * 25000 + 200000); printf("join barrier %d\n", mythread); } } int main(int argc, char **argv) { setvbuf(stdout, null, _ionbf, 0); calibrate(); balanced_work(); serial_work(); imbalanced_work(); printf("bye bye\n"); }
for code typical function profile show of time spent in delay() function. on other hand, viewing data frame grouping , cpu usage information in vtune give idea serial, imbalanced , balanced. here might see vtune:
here 1 can see that:
- there 13.671 of elapsed time when executing imbalanced region. 1 can see imbalance cpu usage breakdown.
- there 3.652 of elapsed time pretty balanced. there red time here, that’s system effects - worth investigating in real-world case.
- and have 4 seconds of serial time. figuring out it’s 4 seconds bit tricky - have take elapsed time summary (21.276 in case) , subtract 13.671 , 3.652 yielding four. easy enough.
hope helps.
Comments
Post a Comment