parallel builds

A few weeks ago I showed how to determine the number of agents used during an ElectricAccelerator build, using some simple analysis of the annotation file it generates. But, I made the unfortunate choice of a pie chart to display the results, and a couple of readers called me to task for that decision. Pie charts, of course, are notoriously hard to use effectively. So, it was back to the drawing board. After some more experimentation, this is what I came up with:

UPDATE:

Some readers have said that this graph is confusing. Blast! OK, here’s how I read it:

The y-axis is number of agents in use. The x-axis is cumulative build time, so a point at x-coordinate 3000 means that for a total of 3000 seconds the build used that many agents or more. Therefore in the graph above, I can see that this build used 48 agents for about 2200 seconds; it used 47 or more agents for about 2207 seconds; etc.

Similarly, you can determine how long the build ran with N agents by finding the line at y-coordinate N and comparing the x-coordinates of the start and end of that line. For example, in the graph above the line for 1 agent starts at about 3100 seconds and ends at about 4100 seconds, so the build used just one agent for a total of about 1000 seconds.

Here’s what I like about this version:

At a glance we can see that this build used 48 agents most of the time, but that it used only one agent for a good chunk of time too.
We can get a sense of the health of the build, or it’s parallel-friendliness, by the shape of the curve— a perfect build will have a steep drop-off far to the right; anything less than that indicates an opportunity for improvement.
We can see all data points, even those of little significance (for example, this build used exactly 35 agents for several seconds). The pie chart stripped out such data points to avoid cluttering the display.
We can plot multiple builds on a single graph.
It’s easier to implement than the pie chart.

Here are some more examples:

Example of a build with great parallelism

Example of a build with good parallelism

Example of a graph showing two builds at once

A glitch in the matrix

While I was generating these graphs, I ran into an interesting problem: in some cases, the algorithm reported that more agents were in use than there were agents on the cluster! Besides being impossible, this skewed my graphs by needlessly inflating the range of the y-axis. Upon further investigation, I found instances of back-to-back jobs on a single agent with start and end times that overlapped, like this:

<job id="J00000001">
  <timing invoked="1.0000" completed="2.0002" node="linbuild1-1"/>
</job>
<job id="J00000002">
  <timing invoked="2.0000" completed="3.0000" node="linbuild1-1"/>
</job>

Based on this data, it appears that there were two jobs running simultaneously on a single agent. This is obviously incorrect, but the naive algorithm I used last time cannot handle this inconsistency — it will erroneously consider this build to have had two agents in use for the brief interval between 2.0000 seconds and 2.0002 seconds, when in reality there was only one agent in use.

There is a logical explanation for how this can happen — and no, it’s not a bug — but it’s beyond the scope of this article. For now, suffice to say that it is to do with making high-resolution measurements of time on a multi-core system. The more pressing question at the moment is, how do we deal with this inconsistency?

Refining the algorithm

To compensate for overlapping timestamps, I added a preprocessing phase that looks for places where the start time of a job on a given agent is earlier than the end time of the previous job to run on that agent. Any time the algorithm detects this situation, it combines the two jobs into a single “pseudo-job” with the start time of the first job, and the end time of the last job:

    $anno indexagents
    foreach agent [$anno agents] {
        set pseudo(start)  -1
        set pseudo(finish) -1
        foreach job [$anno agent jobs $agent] {
            set start  [$anno job start  $job]
            set finish [$anno job finish $job]
            if { $pseudo(start) == -1 } {
                set pseudo(start)  $start
                set pseudo(finish) $finish
            } else {
                if { int($start * 100) <= int($pseudo(finish) * 100) } {
                    set pseudo(finish) $finish
                } else {
                    lappend events \
                        [list $pseudo(start)  $JOB_START_EVENT] \
                        [list $pseudo(finish) $JOB_END_EVENT]
                    set pseudo(start)  $start
                    set pseudo(finish) $finish
                }
            }
        }
    }

With the data thus triaged, we can continue with the original algorithm: sort the list of start and end events by time, then scan the list, incrementing the count of agents in use for each start event, and decrementing it for each end event.

Availability

You can find the updated code here at GitHub. One comment on packaging: I wrote this version of the code as an ElectricInsight report, rather than as a stand-alone script. The installation instructions are simple:

Download AgentUtilization.tcl
Copy the file to one of the following locations:
- <install dir>/ElectricInsight/reports
- (Unix only) $HOME/.ecloud/ElectricInsight/reports
- (Windows only) %USERPROFILE%/Electric Cloud/ElectricInsight/reports
Restart ElectricInsight.

Give it a try!

#!tclsh load annolib.so proc CountAgents {annofile} { global anno total set xml [open $annofile r] set anno [anno create] $anno load $xml # These values will tell us what type of event we have later. set START_EVENT 1 set END_EVENT -1 # Iterate through all the jobs in the build. set first [$anno jobs begin] set last [$anno jobs end] for {set job $first} {$job != $last} {set job [$anno job next $job]} { # Get the timing information for this job. If this job was not # actually run, its timing information will be empty. set t [lindex [$anno job timing $job] 0] if { [llength $t] == 0 } { continue } foreach {start end agent} $t { break } # Add a start and an end event for this job to the master list. lappend events [list $start $START_EVENT] [list $end $END_EVENT] } # Order the events chronologically. set events [lsort -real -increasing -index 0 $events] # Scan the list of events. Every time we see a START event, increment # the count of agents in use; every time we see an END event, decrement # the count. This way, "count" always reflects the number of agents # in use. set count 0 set last 0 foreach event $events { foreach {t e} $event { break } if { ![info exists total($count)] } { set total($count) 0 } # Add the time interval between the current and the previous event # to the total time for "count". set total($count) [expr {$total($count) + ($t - $last)}] # Update the in-use counter. I chose the event type values # so that we can simply add the event type to the counter. incr count $e # Track the current time, so we can compute the size of the next # interval. set last $t } } CountAgents [lindex $argv end]

set output "-raw" if { [llength $argv] >= 2 } { set output [lindex $argv 0] } switch -- $output { "-raw" { foreach count [lsort -integer [array names total]] { if { $total($count) > 0.0001 } { puts "$count $total($count)" } } } "-text" { set duration [$anno duration] puts "Agents in use by portion of build time" foreach count [lsort -integer [array names total]] { set len [expr {round(double($total($count)*70) / $duration)}] if { $len > 0 } { puts [format "%2d %s" $count [string repeat * $len]] } } } "-google" { set url "http://chart.apis.google.com/chart" append url "?chs=300x225" append url "&cht=p" append url "&chtt=Agents+in+use+by+portion+of+build+time" append url "&chco=3399CC" set lbl "" set dat "" set lblsep "" set datsep "" set duration [$anno duration] foreach count [lsort -integer [array names total]] { set pct [expr {($total($count) * 100) / $duration}] if { $pct >= 1.0 } { append lbl $lblsep$count append dat $datsep[format "%0.2f" $pct] set lblsep "|" set datsep "," } } append url "&chd=t:$dat" append url "&chl=$lbl" puts $url } }

eric melski's blog.melski.net

musings on the craft of software development, data visualization and more

An Agent Utilization Report for ElectricInsight

A glitch in the matrix

Refining the algorithm

Availability

How many agents did my build use?

Counting agents in use

A glitch in the matrix

Refining the algorithm

Availability

Share this:

Counting agents in use

Share this: