I would like to profile a complex web application from the server PoV.

According to the wikipedia link above, and the Stack Overflow profiling tag description, profiling (in one of its forms) means getting a list (or a graphical representation) of APIs/components of the application, each with the number of calls and time spent in it during run-time.

Note that unlike a traditional one-program/one-language a web server application may be:

Distributed over multiple machines
Different components may be written in different languages
Different components may be running on top of different OSes, etc.

So the traditional "Just use a profiler" answer is not easily applicable to this problem.

I'm not looking for:

Coarse performance stats like the ones provided by various log-analysis tools (e.g. analog) nor for
client-side, per-page performance stats like the ones presented by tools like Google's Pagespeed, or Yahoo! Y!Slow, waterfall diagrams, and browser component load times)

Instead, I'm looking for a classic profiler-style report:

number of calls
call durations

by function/API/component-name, on the server-side of the web application.

Bottom line, the question is:

How can one profile a multi-tiered, multi-platform, distributed web application?

A free-software based solution is much preferred.

I have been searching the web for a solution for a while and couldn't find anything satisfactory to fit my needs except some pretty expensive commercial offerings. In the end, I bit the bullet, thought about the problem, and wrote my own solution which I wanted to freely share.

I'm posting my own solution since this practice is encouraged on SO

This solution is far from perfect, for example, it is at very high level (individual URLs) which may not good for all use-cases. Nevertheless, it has helped me immensely in trying to understand where my web-app spends its time.

In the spirit on open source and knowledge sharing, I welcome any other, especially superior, approaches and solutions from others.

回答1:

Thinking of how traditional profilers work, it should be straight-forward to come up with a general free-software solution to this challenge.

Let's break the problem into two parts:

Collecting the data
Presenting the data

Collecting the data

Assume we can break our web application into its individual constituent parts (API, functions) and measure the time it takes each of these parts to complete. Each part is called thousands of times a day, so we could collect this data over a full day or so on multiple hosts. When the day is over we would have a pretty big and relevant data-set.

Epiphany #1: substitute 'function' with 'URL', and our existing web-logs are "it". The data we need is already there:

Each part of a web API is defined by the request URL (possibly with some parameters)
The round-trip times (often in microseconds) appear on each line We have a day, (week, month) worth of lines with this data handy

So if we have access to standard web-logs for all the distributed parts of our web application, part one of our problem (collecting the data) is solved.

Presenting the data

Now we have a big data-set, but still no real insight. How can we gain insight?

Epiphany #2: visualize our (multiple) web-server logs directly.

A picture is worth a 1000 words. Which picture can we use?

We need to condense 100s of thousands or millions lines of multiple web-server logs into a short summary which would tell most of the story about our performance. In other words: the goal is to generate a profiler-like report, or even better: a graphical profiler report, directly from our web logs.

Imagine we could map:

Call-latencies to one dimension
Number of calls to another dimension, and
The function identities to a color (essentially a 3rd dimension)

One such picture: a stacked-density chart of latencies by API appears below (functions names were made-up for illustrative purposes).

The Chart:

Some observations from this example

We have a tri-modal distribution representing 3 radically different 'worlds' in our application:
The fastest responses, are centered around ~300 microseconds of latency. These responses are coming from our varnish cache
The second fastest, taking a bit less than 0.01 seconds on average, are coming from various APIs served by our middle-layer web application (Apache/Tomcat)
The slowest responses, centered around 0.1 seconds and sometimes taking several seconds to respond to, involve round-trips to our SQL database.

We can see how dramatic caching effects can be on an application (note that the x-axis is on a log10 scale)

We can specifically see which APIs tend to be fast vs slow, so we know what to focus on.

We can see which APIs are most often called each day. We can also see that some of them are so rarely called, it is hard to even see their color on the chart.

How to do it?

The first step is to pre-process and extract the subset needed-data from the logs. A trivial utility like Unix 'cut' on multiple logs may be sufficient here. You may also need to collapse multiple similar URLs into shorter strings describing the function/API like 'registration', or 'purchase'. If you have a multi-host unified log view generated by a load-balancer, this task may be easier. We extract only the names of the APIs (URLs) and their latencies, so we end up with one big file with a pair of columns, separated by TABs

*API_Name   Latency_in_microSecs*


func_01    32734
func_01    32851
func_06    598452
...
func_11    232734

Now we run the R script below on the resulting data pairs to produce the wanted chart (using Hadley Wickham's wonderful ggplot2 library). Voilla!

The code to generate the chart

Finally, here's the code to produce the chart from the API+Latency TSV data file:

#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
         '/usr/lib/R/library',
         '/usr/lib/R/site-library'
))

suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))

#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)

title.theme   = element_text(family="FreeSans", face="bold.italic",
                        size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
                        size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
                       size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme  = element_text(family="FreeSans", face="bold",
                        size=FONTSIZE-1, colour=MyGray)
y.axis.theme  = element_text(family="FreeSans", face="bold",
                        size=FONTSIZE-1, colour=MyGray)

#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
    from <- 0
    to <- from.to[2]
    # round to 10 ceiling
    to <- ceiling(to / 10.0) * 10
    # Count major breaks on 10^N boundaries, include the 0
    n.maj = 1 + ceiling(log(to) / log(10))
    # if major breaks are too few, add minor-breaks half-way between them
    n.breaks <- ifelse(n.maj < 5, max(5, n.maj*2+1), n.maj)
    breaks <- as.integer(seq(from, to, length.out=n.breaks))
    breaks
}

#
# -- main
#

# -- process the command line args:  [tsv_file [png_file]]
#    (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
    argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])

# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)

# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')

#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)

#
# -- Trim the latency axis:
#       Drop the few 0.001% extreme-slowest outliers on the right
#       to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)

#
# -- API factor pruning
#       Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
    d.N <- nrow(d)
    API.counts <- table(d$API)
    d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
    d <- d[d$CallPct > Rare.APIs.pct, ]
    d.N.new <- nrow(d)
}

#
# -- Adjust legend item-height &font-size
#    to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme  = element_text(family="FreeSans", face="bold.italic",
                        size=Legend.FontSize,
                        colour=gray(0.3))


# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)
#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
   geom_bar(binwidth=0.01) +
      scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
      scale_y_continuous(breaks=yscale_breaks) +
      ggtitle('APIs Calls & Latency Distribution') +
      xlab('Latency in seconds - log(10) scale') +
      ylab('Call count (in 1000s)') +
      theme(
            plot.title=title.theme,
            axis.title.y=y.label.theme,
            axis.title.x=x.label.theme,
            axis.text.x=x.axis.theme,
            axis.text.y=y.axis.theme,
            legend.text=legend.theme,
            legend.key.height=unit(Legend.LineSize, "line")
      )

#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

回答2:

Your discussion of "back in the day" profiling practice is true. There's just one little problem it always had:

In non-toy software, it may find something, but it doesn't find much, for a bunch of reasons.

The thing about opportunities for higher performance is, if you don't find them, the software doesn't break, so you just can pretend they don't exist. That is, until a different method is tried, and they are found.

In statistics, this is called a type 2 error - a false negative. An opportunity is there, but you didn't find it. What it means is if somebody does know how to find it, they're going to win, big time. Here's probably more than you ever wanted to know about that.

So if you're looking at the same kind of stuff in a web app - invocation counts, time measurements, you're not liable to do better than the same kind of non-results.

I'm not into web apps, but I did a fair amount of performance tuning in a protocol-based factory automation app many years ago. I used a logging technique. I won't say it was easy, but it did work. The people I see doing something similar is here, where they use what they call a waterfall chart. The basic idea is rather than casting a wide net and getting a lot of measurements, you trace through a single logical thread of transactions, analyzing where delays are occurring that don't have to.

So if results are what you're after, I'd look down that line of thinking.