Web Content Caching Explained

The minefield that is web content caching can be confusing and complicated!

It’s one of those features that really sits between the network and the application. In many businesses this grey area has no real ownership and as such, mistakes are often made.

I want to take a little time to explore and hopefully simplify this complex area.

So firstly one must address the question of what is a content cache and why bother using it?

A content cache is a piece of software or an appliance designed to sit in front of the application server. Its job is to intercept certain requests and respond on behalf of the application thus reducing the number of hits sent to the backend severs.

Typically the requests that are intercepted have already been seen before, so the cache will store responses to these requests and if it intercepts a similar request it can respond in the same way.

A simple example would be an image cache. The client (ie browser) would make a response to the web server for an image. The first time this image is requested the cache will have to get it from the web server however for subsequent requests it will simply serve it directly.

The idea is that the web server will have a lot less work to do as the cache can serve a lot of the content. The implications of this really depend on the exact setup, application and content however they typically fit into the following:

1. Reduce load on the web/application servers – Save on application server hardware and licensing costs.

2. Reduce load on middleware and Backed DB systems.

3. Serve the content faster – Caches can be very fast!

4. Cache the content closer to your users. This is an interesting one. The cache does not always have to be at the same location as the rest of the system. You could use a good cache to build your own simple content distribution system to ensure your users get the content from a source as close to them as possible. Who said that a CDN (Content Distribution Network) has to cost the earth!

OK so you have decided that you may want to use a cache, so what next? The biggest problem with content caching is simply deciding what and for how long you want to cache content for. It sounds simple but failure to do this in the past has given many a network manager / application owner a genuine fear of caching. You don’t want one user getting the account balance of the previous user!

However to really make the most of the capabilities that a modern content cache can offer we must understand the full picture. The application behaviour, the cache setup and the browser behaviour.

Let us take a quick look at typically what happens when an image is requested from a web server by a browser.

1. The user requests an image to look at.

2. The Cache will notice that it has not seen this request before and therefore will request it from the web server.

3. The web server will get the image and send it back to the cache.

Important. The web server will decide if this image can be cached or not. It will instruct the upstream devices such as the cache and ultimately the client if it should be cached it or not. It does this by using specific HTTP header. (I won’t geek out on this, in this article as it can get overly complex.)

4. Now it gets interesting because the cache will need to decide on which way to go. Does it obey the rules from the web server or does it override it with its own rules? We will discuss this later on.

5. So the image is now sent back to the client and just to complicate things the client will also cache the content and also look out for the cache control header.

So we can see that not only do we have to decide on the application caching rules but also how we want the cache to behave upstream. If we do this right we can even reduce the number of hits reaching the cache and thus network in the first place.

So now we have a rough idea of what is going on, so how do we implement a cache?

A general method is described in “light” detail below.

1. Decide what you want to cache and for how long – So I want to cache all images for 24hours or all *.asp for 2seconds etc. These can get complicated.

2. Be certain you are happy with these rules.

3. Now check again.

4. Configure the cache to remove all cache control headers from the response from the web server. In other words ignore any caching control setup on the webserver – We assume we know how to create more accurate set of rules.

5. Configure the cache with these new rules and also expiry dates/ durations.

6. Configure the cache to add some new cache controls for the Client. Client side cache control is tough as browsers respect these rules with varying degrees of success. This is not really in the scope of this paper but I may revisit it.

7. Document – These rules can be complex and can involve speaking to people in different parts of the business to understand the application, therefore it’s worth documenting this info whilst it is fresh.

8. Test and test again – The more testing the better. Remember part of testing is to manually empty the content cache’s cache but also the client’s too.

That’s it for now folks.

I am writing a follow up on this talking in a little more detail about typical cache settings and also helping to answer questions such as, ‘will a cache be of any help for my application and if so what type of cache should I be looking at?’

Speeding PHP Using APC PHP Cache

If you look at a PHP source file you will notice one thing. It’s a source file. Not particularly surprising, but think about when you deploy a PHP application, what do you deploy? PHP source files. Now for many other languages; Java, C, etc when you deploy an application you deploy the compiled file. So, the question that you want to ask yourself is this, how much time does a PHP application spend compiling source files vs running the code? I’ll answer that for you, a lot.

There are advantages to being able to deploy source files though. It makes it easy to do on the fly modifications or bug fixes to a program, much like we used to do in the early BASIC languages. Just change the file and the next time it’s accessed your change is reflected. So, how do we keep the dynamic nature of PHP, but not recompile our files every time they are accessed?

A PHP cache. It’s surprising to me that this concept isn’t built into the base PHP engine, but perhaps that’s because some company’s can sell this add on to speed up PHP. Luckily for us, some companies/open source projects provide this plug in to PHP at no charge. These plug ins are generally known as PHP accelerators, some of them do some optimization and then caching and some only do caching. I’m not going to pass judgement on which one is the best, any of them are better than nothing, but I decided to use APC, the Alternative PHP Cache. I chose this one because it is still in active development and is open source and free.

Alternative php cache can be found at php.net, just look down the left column for APC. It comes in source form, so you will need to compile it before installing it, don’t worry about that part. If you’re using Red Hat 4 or CentOS4 I’ll tell you exactly how to do it. If you’re using something else, you’ll need the same tools, but getting the tools might be a bit different.

1. The Tools

Do you know how many web sites, forums and blogs I went to with my error messages before I found the answers as to what I was missing when I was trying to install APC – Alternative PHP Cache? Two days worth, but I finally found the correct combination and it’s really quite obvious as is everything once you know the answer. There are three sets of dev tools that you will need.

1a. You’ll need a package called “Development Tools” this will include all the important dev tools like the GCC compiler, etc.

1b. You’ll need a package called php-devel which as you might guess are development tools for PHP

1c. You’ll need a package called httpd-devel which of course are dev tools for Apache web server.

On Red Hat or CentOS getting these should be as easy as the following 3 commands:

yum groupinstall “Development Tools”

yum install php-devel

yum install httpd-devel

You’ll do these three one at a time and follow any instructions (usually just saying yes).

Now it’s time to follow the instructions contained in the APC package. Since these may change over time I’m not going to go through them. They are very complete. If you follow the instructions and get an apc.so file out of it, then you’re all set, just modify your php.ini file and you’re good to go.

There are two problems that I encountered that you may encounter too. The first is an error when running phpize. I ignored this error and everything succeeded okay, but not before I spent hours looking for the solution to this error. Here is the error.

configure.in:9: warning: underquoted definition of PHP_WITH_PHP_CONFIG

run info ‘(automake)Extending aclocal’

or see http://sources.redhat.com/automake/automake.html#Extending-aclocal

configure.in:32: warning: underquoted definition of PHP_EXT_BUILDDIR

configure.in:33: warning: underquoted definition of PHP_EXT_DIR

configure.in:34: warning: underquoted definition of PHP_EXT_SRCDIR

configure.in:35: warning: underquoted definition of PHP_ALWAYS_SHARED

acinclude.m4:19: warning: underquoted definition of PHP_PROG_RE2C

People would have had me updating my PHP version from 4.3.9 and everything else under the sun to get rid of this error, but in the end it didn’t matter. My APC compiled and installed nicely and I am good to go.

The other slight problem that I ran into was the location of php-config. The install instructions wanted me to do the following:

./configure –enable-apc-mmap –with-apxs
–with-php-config=/usr/local/php/bin/php-config

However my php-config is in /usr/bin/php-config. Making that change allowed this part to work.

So, have at it, once it’s done you can expect to see huge improvements in your web site response times and reductions on your CPU load. One more quick note, My server hosts about 20 web sites, but only 3 or 4 are really busy. To reduce the memory footprint of caching everything for all 20 sites I used the apc.filters property. Although this property is slightly flawed for non qualified includes, it worked nicely for my Serendipity blogs. Your mileage with this property will vary according to the software you are using and how it does it’s includes.

Structuring Data With Cache

When it comes to the world of computers, there is nothing more important than data. A long time ago, data was kept on paper, but paper proved to be unreliable because it was vulnerable to things like water, temperature, and other destructive elements like bugs. Once computers came along, then businesses, and other facilities like hospitals, now had a much easier way to deal with volumes of data. For someone who works with the world of data, there are Cache MUMPS jobs that offer a person a chance to help organize data and make dealing with it more efficient and manageable.

The Best Way To Access and Manage Data

There are many places in this world that rely on the ability to manage their data storage. One of the main places where data is important is a hospital. The MUMPS system was created in Massachusetts at the Massachusetts General Hospital, and the software was designed specifically to help hospitals run better. Hospitals deal with massive amounts of data from things like patient records to admissions, and MUMPS helps make everything work better. Cache MUMPS jobs are all about data, and here is more information about the benefits of the cache system:

– Store data more efficiently: Instead of having to comb through a massive database trying to find the right file, a cache system is going to make sure that data is stored efficiently so people can easily find what they are looking for without having to spend a lot of time searching.

– Keep track of the stored data: One of the biggest duties that Cache MUMPS jobs deals with is keeping track of the stored data. There are always ways to take data, and organize it to make it be stored and access better, and people who work in a hospital taking care of data understand better than anyone just how much data a hospital can deal with on a daily basis. Data is extremely important to doctors because a doctor treating a patient is going to need to know this person’s entire history, which is going to require access to a lot of data.

– Create a directory of data: One of the best ways to keep data organized was by creating a directory. People who deal with data can look into the creation of a directory that is going to help every hospital department keep track of their own data. A directory can be named with a person’s last name, the year, the name of the department, the name of the doctor, or countless other ways that is going to make the data more organized. With a directory, different departments can get the data they need quickly and easily.

The MUMPS system was created in a hospital for a hospital, and many hospitals use it today. One of the most important aspects of a hospital system is controlling the data. Cache MUMPS jobs deal with data, and there is no place that has massive amounts of data like a hospital. With people who deal with Cache jobs, their job is to take data, and make it easy to access and keep it organized.