An Agent Utilization Report for ElectricInsight

A few weeks ago I showed how to determine the number of agents used during an ElectricAccelerator build, using some simple analysis of the annotation file it generates. But, I made the unfortunate choice of a pie chart to display the results, and a couple of readers called me to task for that decision. Pie charts, of course, are notoriously hard to use effectively. So, it was back to the drawing board. After some more experimentation, this is what I came up with:

UPDATE:

Some readers have said that this graph is confusing. Blast! OK, here’s how I read it:

The y-axis is number of agents in use. The x-axis is cumulative build time, so a point at x-coordinate 3000 means that for a total of 3000 seconds the build used that many agents or more. Therefore in the graph above, I can see that this build used 48 agents for about 2200 seconds; it used 47 or more agents for about 2207 seconds; etc.

Similarly, you can determine how long the build ran with N agents by finding the line at y-coordinate N and comparing the x-coordinates of the start and end of that line. For example, in the graph above the line for 1 agent starts at about 3100 seconds and ends at about 4100 seconds, so the build used just one agent for a total of about 1000 seconds.

Here’s what I like about this version:

  • At a glance we can see that this build used 48 agents most of the time, but that it used only one agent for a good chunk of time too.
  • We can get a sense of the health of the build, or it’s parallel-friendliness, by the shape of the curve— a perfect build will have a steep drop-off far to the right; anything less than that indicates an opportunity for improvement.
  • We can see all data points, even those of little significance (for example, this build used exactly 35 agents for several seconds). The pie chart stripped out such data points to avoid cluttering the display.
  • We can plot multiple builds on a single graph.
  • It’s easier to implement than the pie chart.

Here are some more examples:

Example of a build with great parallelismExample of a build with good parallelism
Example of a build with OK parallelismExample of a graph showing two builds at once

A glitch in the matrix

While I was generating these graphs, I ran into an interesting problem: in some cases, the algorithm reported that more agents were in use than there were agents on the cluster! Besides being impossible, this skewed my graphs by needlessly inflating the range of the y-axis. Upon further investigation, I found instances of back-to-back jobs on a single agent with start and end times that overlapped, like this:

<job id="J00000001">
  <timing invoked="1.0000" completed="2.0002" node="linbuild1-1"/>
</job>
<job id="J00000002">
  <timing invoked="2.0000" completed="3.0000" node="linbuild1-1"/>
</job>

Based on this data, it appears that there were two jobs running simultaneously on a single agent. This is obviously incorrect, but the naive algorithm I used last time cannot handle this inconsistency — it will erroneously consider this build to have had two agents in use for the brief interval between 2.0000 seconds and 2.0002 seconds, when in reality there was only one agent in use.

There is a logical explanation for how this can happen — and no, it’s not a bug — but it’s beyond the scope of this article. For now, suffice to say that it is to do with making high-resolution measurements of time on a multi-core system. The more pressing question at the moment is, how do we deal with this inconsistency?

Refining the algorithm

To compensate for overlapping timestamps, I added a preprocessing phase that looks for places where the start time of a job on a given agent is earlier than the end time of the previous job to run on that agent. Any time the algorithm detects this situation, it combines the two jobs into a single “pseudo-job” with the start time of the first job, and the end time of the last job:

    $anno indexagents
    foreach agent [$anno agents] {
        set pseudo(start)  -1
        set pseudo(finish) -1
        foreach job [$anno agent jobs $agent] {
            set start  [$anno job start  $job]
            set finish [$anno job finish $job]
            if { $pseudo(start) == -1 } {
                set pseudo(start)  $start
                set pseudo(finish) $finish
            } else {
                if { int($start * 100) <= int($pseudo(finish) * 100) } {
                    set pseudo(finish) $finish
                } else {
                    lappend events \
                        [list $pseudo(start)  $JOB_START_EVENT] \
                        [list $pseudo(finish) $JOB_END_EVENT]
                    set pseudo(start)  $start
                    set pseudo(finish) $finish
                }
            }
        }
    }

With the data thus triaged, we can continue with the original algorithm: sort the list of start and end events by time, then scan the list, incrementing the count of agents in use for each start event, and decrementing it for each end event.

Availability

You can find the updated code here at GitHub. One comment on packaging: I wrote this version of the code as an ElectricInsight report, rather than as a stand-alone script. The installation instructions are simple:

  1. Download AgentUtilization.tcl
  2. Copy the file to one of the following locations:
    • <install dir>/ElectricInsight/reports
    • (Unix only) $HOME/.ecloud/ElectricInsight/reports
    • (Windows only) %USERPROFILE%/Electric Cloud/ElectricInsight/reports
  3. Restart ElectricInsight.

Give it a try!

How many agents did my build use?

When you run a parallel build, how many jobs are actually running in parallel during the life of the build? If you’re using ElectricAccelerator, you can load the build annotation file in ElectricInsight and eyeball it, as long as you have a small, uncongested cluster. But if you have a big cluster, and lots of other builds running simultaneously, the build may touch many more distinct agents than it actually uses simultaneously at any given point. It’d be great to see a simple chart like this:

With this graph I can see at a glance that this build used 48 agents most of the time, although there was a lot of time when it used only one agent, probably due to serializations in the build. In this post I’ll show you how to generate a report like this using data from an annotation file.

Counting agents in use

Counting the agents in use over the lifetime of the build is a simple algorithm: make a list of all the job start and end events in the build, sorted by time. Then scan the list, incrementing the count of agents in use every time you find a start event, and decrementing it every time you find an end event. Here’s the code, using annolib, the annotation analysis library:

#!tclsh
load annolib.so

proc CountAgents {annofile} {
    global anno total

    set xml  [open $annofile r]
    set anno [anno create]
    $anno load $xml

    # These values will tell us what type of event we have later.

    set START_EVENT  1
    set END_EVENT   -1

    # Iterate through all the jobs in the build.

    set first [$anno jobs begin]
    set last  [$anno jobs end]
    for {set job $first} {$job != $last} {set job [$anno job next $job]} {
        # Get the timing information for this job.  If this job was not
        # actually run, its timing information will be empty.

        set t [lindex [$anno job timing $job] 0]
        if { [llength $t] == 0 } {
            continue
        }
        foreach {start end agent} $t {
            break
        }

        # Add a start and an end event for this job to the master list.

        lappend events [list $start $START_EVENT] [list $end $END_EVENT]
    }

    # Order the events chronologically.

    set events [lsort -real -increasing -index 0 $events]

    # Scan the list of events.  Every time we see a START event, increment
    # the count of agents in use; every time we see an END event, decrement
    # the count.  This way, "count" always reflects the number of agents
    # in use.

    set count 0
    set last  0
    foreach event $events {
        foreach {t e} $event { break }
        if { ![info exists total($count)] } {
            set total($count) 0
        }

        # Add the time interval between the current and the previous event 
        # to the total time for "count".

        set total($count) [expr {$total($count) + ($t - $last)}]

        # Update the in-use counter.  I chose the event type values
        # so that we can simply add the event type to the counter.

        incr count $e

        # Track the current time, so we can compute the size of the next
        # interval.

        set last $t
    }
}

CountAgents [lindex $argv end]

After this code runs, we’ll have the amount of time spent using one agent, two agents, three agents, etc. in the global array total. The only thing left to do is output the result in a usable form:

set output "-raw"
if { [llength $argv] >= 2 } {
    set output [lindex $argv 0]
}
switch -- $output {
    "-raw" {
        foreach count [lsort -integer [array names total]] {
            if { $total($count) > 0.0001 } {
                puts "$count $total($count)"
            }
        }
    }

    "-text" {
        set duration [$anno duration]
        puts "Agents in use by portion of build time"
        foreach count [lsort -integer [array names total]] {
            set len [expr {round(double($total($count)*70) / $duration)}]
            if { $len > 0 } {
                puts [format "%2d %s" $count [string repeat * $len]]
            }
        }
    }

    "-google" {
        set url "http://chart.apis.google.com/chart"
        append url "?chs=300x225"
        append url "&cht=p"
        append url "&chtt=Agents+in+use+by+portion+of+build+time"
        append url "&chco=3399CC"
        set lbl ""
        set dat ""
        set lblsep ""
        set datsep ""
        set duration [$anno duration]
        foreach count [lsort -integer [array names total]]  {
            set pct [expr {($total($count) * 100) / $duration}]
            if { $pct >= 1.0 } {
                append lbl $lblsep$count
                append dat $datsep[format "%0.2f" $pct]
                set lblsep "|"
                set datsep ","
            }
        }
        append url "&chd=t:$dat"
        append url "&chl=$lbl"
        puts $url
    }
}

This gives us three choices for the output format:

  • -raw, which just dumps the raw data, one entry per line.
  • -text, which formats the data as a simple ASCII bar chart.
  • -google, which emits a Google Charts URL you can put into your browser to see a chart like the one at the top of this post.

For example, if I run this script as tclsh count_agents.tcl -text sample.xml, the output looks like this:

Agents in use by portion of build time 0 *** 1 ***************** 2 *** 3 * 4 * 5 * 47 * 48 ************************************

So that’s it: another trivial annolib script, another slick build visualization!