I’ve written previously about many of the useful types of analysis you can perform using the time-taken field in your web logs. However, I would be remiss for not pointing out some limitations of the data it provides.
When poring over my web log files, I would occasionally come across truly bizarre entries. I would see a static javascript file that listed as taken 7 seconds. How could the web server take that long to process a static file?
Time-taken measures how long it took to process a request, but when does processing start, and when does it end? Before the web application can actually start to generate a response, it has to receive all of the request headers and form variables from the client. Then, when the response is generated, it needs to be transmitted back to the client. The server can only send back data as fast as the client is able to accept it, so there is time spent after the response is generated waiting for transmit to complete. Only once the client acknowledges that it has all of the data can the connection be closed.
So, is this time included in the time-taken value? Unfortunately, the answer is yes.
Normally, the time necessary to upload request data and download response data the client is pretty trivial, and it doesn’t impact the analysis. However, there are a couple of cases where these times become significant:
· Uploading a large amount of data during a post-back or file upload
· Downloading a large response or file attachment
· A client with a very slow connection speed
· A client from very far away
If you combine a slow connection with a large file transfer, you can end up seeing some high times in your log files. Don’t worry; you don’t have a major bug in your application somewhere. It’s just that a lot of time was spent before real processing could start or after it had ended.
Network time is not factored in 100% of the time. Microsoft documents this in IIS 6 and IIS 7, stating that when using TCP Buffering or sending back very small responses (under 2k) in certain conditions, the network time will not be a factor.
I haven’t been able to find as clear an answer on whether time-taken is an issue in Tomcat or Apache. There is some interesting chatter about it on the web, but I didn’t manage turn up anything as definitive as the Microsoft article. However, the conventional wisdom seems to imply that it does incorporate network time in those applications as well.
So what to do? Does this invalidate using time-taken data for performance tuning?
No, but you need to interpret the results more carefully. Here are a couple of tips:
1. Pay attention to request and response size (these are other fields you can turn on in web logs). Do you see a correlation between request/response size and response time? If you do, you may just be seeing network effects and not an actual performance problem in your servers themselves.
2. Look at the IP address. Do particularly large response times seem to come from far away (see geolocating an ip address)?
3. Do you have pages that are known to large response sizes, like file downloads? Consider excluding them when generating 95th percentile response times to get a more accurate number.
One more thing to keep in mind: while the numbers may not reflect actual host processing time, they still do translate to real time experienced by a user. While some of these may feel like things beyond your control, like a user with a slow connection or far away, there is more you can do:
· Use HTTP compression to reduce response sizes
· Use appropriate image file formats and image color palettes to keep image file sizes small
· Use Expires and cache-control headers to avoid unneeded refreshes of static content
· Use a minifier or comment stripper to squeeze unneeded weight out of javascript and css files
· Use a CDN like Akamai to cache content closer to end users (okay, that one costs big $$$)
Image may be NSFW.
Clik here to view.
Clik here to view.
