I would like to profile a complex web application from the server PoV.
According to the wikipedia link above, and the Stack Overflow profiling
tag description, profiling (in one of its forms) means getting a list (or a graphical representation) of APIs/components of the application, each with the number of calls and time spent in it during run-time.
Note that unlike a traditional one-program/one-language a web server application may be:
- Distributed over multiple machines
- Different components may be written in different languages
- Different components may be running on top of different OSes, etc.
So the traditional "Just use a profiler" answer is not easily applicable to this problem.
I'm not looking for:
- Coarse performance stats like the ones provided by various log-analysis tools (e.g. analog) nor for
- client-side, per-page performance stats like the ones presented by tools like Google's Pagespeed, or Yahoo! Y!Slow, waterfall diagrams, and browser component load times)
Instead, I'm looking for a classic profiler-style report:
- number of calls
- call durations
by function/API/component-name, on the server-side of the web application.
Bottom line, the question is:
How can one profile a multi-tiered, multi-platform, distributed web application?
A free-software based solution is much preferred.
I have been searching the web for a solution for a while and couldn't find anything satisfactory to fit my needs except some pretty expensive commercial offerings. In the end, I bit the bullet, thought about the problem, and wrote my own solution which I wanted to freely share.
I'm posting my own solution since this practice is encouraged on SO
This solution is far from perfect, for example, it is at very high level (individual URLs) which may not good for all use-cases. Nevertheless, it has helped me immensely in trying to understand where my web-app spends its time.
In the spirit on open source and knowledge sharing, I welcome any other, especially superior, approaches and solutions from others.
Your discussion of "back in the day" profiling practice is true. There's just one little problem it always had:
The thing about opportunities for higher performance is, if you don't find them, the software doesn't break, so you just can pretend they don't exist. That is, until a different method is tried, and they are found.
In statistics, this is called a type 2 error - a false negative. An opportunity is there, but you didn't find it. What it means is if somebody does know how to find it, they're going to win, big time. Here's probably more than you ever wanted to know about that.
So if you're looking at the same kind of stuff in a web app - invocation counts, time measurements, you're not liable to do better than the same kind of non-results.
I'm not into web apps, but I did a fair amount of performance tuning in a protocol-based factory automation app many years ago. I used a logging technique. I won't say it was easy, but it did work. The people I see doing something similar is here, where they use what they call a waterfall chart. The basic idea is rather than casting a wide net and getting a lot of measurements, you trace through a single logical thread of transactions, analyzing where delays are occurring that don't have to.
So if results are what you're after, I'd look down that line of thinking.
Thinking of how traditional profilers work, it should be straight-forward to come up with a general free-software solution to this challenge.
Let's break the problem into two parts:
Collecting the data
Assume we can break our web application into its individual constituent parts (API, functions) and measure the time it takes each of these parts to complete. Each part is called thousands of times a day, so we could collect this data over a full day or so on multiple hosts. When the day is over we would have a pretty big and relevant data-set.
Epiphany #1: substitute 'function' with 'URL', and our existing web-logs are "it". The data we need is already there:
So if we have access to standard web-logs for all the distributed parts of our web application, part one of our problem (collecting the data) is solved.
Presenting the data
Now we have a big data-set, but still no real insight. How can we gain insight?
Epiphany #2: visualize our (multiple) web-server logs directly.
A picture is worth a 1000 words. Which picture can we use?
We need to condense 100s of thousands or millions lines of multiple web-server logs into a short summary which would tell most of the story about our performance. In other words: the goal is to generate a profiler-like report, or even better: a graphical profiler report, directly from our web logs.
Imagine we could map:
One such picture: a stacked-density chart of latencies by API appears below (functions names were made-up for illustrative purposes).
The Chart:
Some observations from this example
We can see how dramatic caching effects can be on an application (note that the x-axis is on a log10 scale)
We can specifically see which APIs tend to be fast vs slow, so we know what to focus on.
We can see which APIs are most often called each day. We can also see that some of them are so rarely called, it is hard to even see their color on the chart.
How to do it?
The first step is to pre-process and extract the subset needed-data from the logs. A trivial utility like Unix 'cut' on multiple logs may be sufficient here. You may also need to collapse multiple similar URLs into shorter strings describing the function/API like 'registration', or 'purchase'. If you have a multi-host unified log view generated by a load-balancer, this task may be easier. We extract only the names of the APIs (URLs) and their latencies, so we end up with one big file with a pair of columns, separated by TABs
Now we run the R script below on the resulting data pairs to produce the wanted chart (using Hadley Wickham's wonderful ggplot2 library). Voilla!
The code to generate the chart
Finally, here's the code to produce the chart from the API+Latency TSV data file: