...making Linux just a little more fun!

<-- prev | next -->

Boosting Apache Performance by using Reverse Proxies

By René Pfeiffer and pooz

Once upon not so very long ago, a lone Web server was in distress. Countless Web browsers had laid siege to its port. The bandwidth was exhausted; the CPUs were busy; the database was moaning. The head of the IT department approached Pooz and me, asking for an improvement. Upgrading either the hardware or the Internet connection was not an option, so we tried to find out what else we could do - caches to the rescue!

Caches Wherever You Go

Every computer lives on caching. Your CPU has one, your disk drive, your operating system, your video card, and of course your Web browser. Caches are designed to keep a copy of data that is accessed often. The CPU caches can store instructions and data. Instead of accessing system memory to get the next instruction or piece of data, it retrieves it from the cache. The Web browser, however, is more interested in caching files such as images, cascading style sheets, documents, and the like. This speeds up Web surfing, because certain format elements appear quite frequently in Web pages. Rather than repeatedly downloading the same image or file, it re-uses items found in the cache. This is especially true if you look at generated pages from a content management system (CMS). Now, if we can find a way of telling the Web browser that its copy in the cache is valid, then we might save some of our bandwidth at the Web server. In case of our CMS, which is Typo3, we can also save both CPU time and database access, provided we can publish the expiration time of generated HTML documents as well. You can also insert an additional cache between Web browsers and your server, to reduce server requests still further. This cache is called a reverse proxy, sometimes called a gateway or surrogate cache. Classical proxies work for their clients, but a reverse proxy works for the server. This proxy also has a disk and memory cache, which can be used to offload static content from the Apache server. The following picture illustrates where the caches are and what they do.

Overview of caches involved in Web browsing

The green lines mark cache hits. A cache hit is valid content (i.e., not expired) that is found in a cache and can be copied from there. Hits often don't reach the Web server. Some clients may ask the Web server if the content has already changed, but this short question doesn't generate much traffic. The Web server simply answers with a "HTTP/1.x 304 Not Modified" header and no additional data. The red lines mark cache misses. A miss occurs when the cache doesn't find the requested object and requests it from the target server. It is then copied to disk or memory and served to the client. Whenever another request is forwarded to the cache, a local copy is used as long as it is valid.

Cache Control Headers

How does a cache know when to use a local copy and when to ask the server? Well, it depends. A browser cache looks for messages from the Web server. The server can use cache control headers to give advice. Let's look at an example. The request "GET http://www.luchs.at/linuxgazette/index.html HTTP/1.1" fetches a Web page whose HTTP headers look like this.

HTTP/1.x 200 OK
Date: Tue, 03 Oct 2006 10:24:35 GMT
Server: Apache
Last-Modified: Mon, 02 Oct 2006 02:04:36 GMT
Etag: "e324ac5-6d7d5500"
Accept-Ranges: bytes
Cache-Control: max-age=142800
Expires: Thu, 05 Oct 2006 02:04:36 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3028
Content-Type: text/html; charset=ISO-8859-1
X-Cache: MISS from bazaar.office.lan
X-Cache-Lookup: MISS from bazaar.office.lan:3128
Via: 1.0 bazaar.office.lan:3128 (squid/2.6.STABLE1)
Proxy-Connection: keep-alive

The server gives you the HTML document. In addition, the HTTP header contains the following fields:

Cache-Control: is better than Expires:, because the latter requires the machines to use a synchronised time source. Cache-Control: is more general but only valid for HTTP 1.1. There is some data included that wasn't sent by the Apache server. The last four HTTP header fields were inserted by the local Squid proxy in our office. It tells us that we made a cache miss.

Server Side Cache Configuration

Now let's turn to our servers, and see what we can configure there.

Apache's mod_expires

Even though the Cache-Control: is better, we first look at a way to generate an Expires: header for served content. The Apache Web server has a module for this called mod_expires. Most distributions include it in their Apache version. You can also compile it as a module and insert it after installing your own Apache. Either way, you now have the possibility to create Expires: headers, either in the global configuration or per virtual host. A sample setup could look like this (for Apache 2.0.x):

<IfModule mod_expires.c>
    ExpiresActive On
    ExpiresByType text/html "modification plus 3 days"
    ExpiresByType text/xml  "modification plus 3 days"
    ExpiresByType image/gif "access plus 4 weeks"
    ExpiresByType image/jpg "access plus 4 weeks"
    ExpiresByType image/png "access plus 4 weeks"
    ExpiresByType video/quicktime "access plus 2 months"
    ExpiresByType audio/mpeg "access plus 2 months"
    ExpiresByType application/pdf "modification plus 2 months"
    ExpiresByType application/ps "modification plus 2 months"
    ExpiresByType application/xml "modification plus 2 weeks"
</IfModule>

The first line activates the module. If you forget it, mod_expires won't do anything. The remaining lines set the expiration period per MIME type. mod_expires automatically calculates and inserts a Cache-Control: header as appropriate, which is nice. You can use either "modification plus ..." or "access plus ...". "modification" works only with files that Apache reads from disk. This means that you have to use "access" if you want to set expires headers for dynamically generated content, as well. Be careful! Although CGI scripts are required to set their own expiration date in the past to guarantee immediate reloads - some developers don't care. mod_expires will break badly written CGIs - harshly. Once, I spent an hour digging through horrible code to find out why a login script didn't work anymore. The developer had forgotten to set the expiration time correctly, so I adapted the server config for this particular virtual host as a workaround. Also, be sure to select suitable expiration periods. The above values are examples. You might have different requirements, depending on how frequently your content changes.

Squid Reverse Proxy

The Squid proxy has a metric ton of configuration directives. If you have no experience with a Squid proxy, this can seem a bit overwhelming, at first. Therefore, I present only a minimal config file, that does what we intend to do. The capabilities of Squid are worth a second look, though. I will assume we are running a Squid proxy 2.6.x installed from source and installed in /usr/local/squid/.

The reverse proxy assumes the place of the original Web server. It has to intercept every request, in order to compare it with its cache content. Let's assume we have two machines:

The local /usr/local/squid/etc/squid.conf defines what our Squid should do. We begin with the IP addresses, and tell it to listen for incoming requests on port 80.

http_port       172.16.23.43:80 vhost vport
http_port       127.0.0.1:80
icp_port        0
cache_peer      172.16.23.42 parent 80 0 originserver default

ICP denotes the Internet Cache Protocol. We don't need it, and turn it off by using port 0. cache_peer tells our reverse proxy to forward every request it cannot handle to the Web server. Next, we have to define the access rules. In contrast to the situation with client proxies, a reverse proxy for a public Web server has to answer requests for everybody. Warning: This is a good reason not to mix forward and reverse proxies, or you will end up with an open proxy, which is a bad thing.

acl         all src 0.0.0.0/0.0.0.0
acl         manager proto cache_object
acl         localhost src 127.0.0.1/255.255.255.255
acl         accel_hosts dst 172.16.23.42 172.16.23.43
http_access allow accel_hosts
http_access allow manager localhost
http_access deny manager
http_access deny all
deny_info   http://www.example.net/ all

The acl lines define groups. accel_hosts are our two servers. http_access allow accel_hosts allows everyone to access these servers. The other lines are from the Squid default configuration, and deactivate the cache manager URL. We don't need this right now. The last line is a safeguard against unwanted error pages. (Squid has a set of its own: they differ from the Apache error pages.) Users are sent to our front page, in case there are any troubles with requests. You can view the full squid.conf seperately, because the rest "only" deals with the cache setup and tuning. (Take care: the config is taken from a server with 2 GB RAM and lots of disks. You might want to reduce the cache memory size.) As I said, Squid is capable of doing many wonderful things. As soon as Squid is up and running, we are ready to send our users to the reverse proxy.

Statistics

You have to be careful, if you rely on accurate statistics from your Web server logs. A good deal of HTTP requests will be intercepted by the Squid reverse proxy. This means that the Apache server sees fewer requests, and that they originate from the IP address of the proxy server. That was the very idea of our setup. You can collect Apache-like logs on Squid, if you change the log format.

logformat       combined %{Host}>h %>a %ui %un [%tl] "%rm %ru  HTTP/%rv" %Hs %<st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh
logformat       vcombined %{Host}>h %>a %ui %un [%tl] "%rm %ru  HTTP/%rv" %Hs %<st "%{Referer}>h" "%{User-Agent}>h"
access_log      /var/log/squid/access.log combined
access_log      /var/log/squid/vaccess.log vcombined

In order to incorporate them into your log analysis, you have to copy the logs from the reverse proxy and merge them with your Apache logs. As soon as your Web setup uses a proxy or even load balancing techniques, maintaining accurate statistics gets quite tricky.

Activating the Cache

After you have configured Apache and Squid, you are ready to test everything. Start with a single virtual host reserved for testing purposes. Change the DNS records to point to the reverse proxy machine. Check the logs. Surf around. Analyse the headers. When you are confident, move the other DNS records. A side note for debugging: You can force a "real" reload in Internet Explorer and Mozilla Firefox if you hold down the shift key while pressing the "Reload" button. An ordinary reload may just hit the local cache, now.

You won't get a good impression of what's changed, just by looking at the logs. I recommend a monitoring system with statistics, Munin, for example, so that you can graphically see what your servers are doing. I have two graphs from testing servers, taken during a load simulation.

Graph showing requests per day for the Squid proxy Graph showing requests per day for the Apache server

In the first graph, red shows cache misses; green shows cache hits. Below, you can see the hits on the Apache server behind the reverse proxy. The shape of the graphs is similar, but keep in mind that all requests shown in green on the Squid server never reach the Apache, and thus reduce the load. If you compare the results, you will see that only one in two of the requests gets through to the Apache server.

Summary

Now, you know what you can achieve using the resources of Apache and Squid. Our Web server handled the traffic spikes well, the CPU load went down by 50%, and all the surfers were happy again. You can do a lot more, if you use multiple Internet connections and load balancing on the firewall or your router. Fortunately, we didn't need to do that in our case.

Useful links

No animals or software were harmed while preparing this article. You might wish to take a look at the following tools and articles; they may just save your Web server.

Talkback: Discuss this article with The Answer Gang

René Pfeiffer


Bio picture

René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.

After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.

Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.


pooz


Bio picture

pooz is a system administrator/Web application hacker working in Vienna, Austria. Free/Open Source software has been his tool of choice since early 90s.


Copyright © 2006, René Pfeiffer and pooz. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 132 of Linux Gazette, November 2006

<-- prev | next -->
Tux