Latest Publications

Maximizing Elasticity in the Cloud

Running a production application in the cloud can be great because it’s possible to add and remove servers from a cluster dynamically using a provisioning API. These automatic additions and removals can be triggered by system utilization levels that you measure, such as concurrent network connections, memory utilization, or CPU utilization. When you need more capacity, you can add more servers, and when they are not needed anymore, you simply turn them back off. You only pay for the time those servers were running, so it’s more economic than having a large number of servers deployed all the time.

Most simple web clusters rely on a single database sever that all the application servers connect to. This way, all of the application servers have concurrent access to the same data. This can be problematic in the elastic use case when workloads increase, and more servers are added to the cluster. If the work is bottle-necked on storing or accessing data in the database server, adding additional application servers will not help. It will actually make the problem worse.

I spoke on a panel at Zendcon yesterday, which was covered in an Infoworld article where my remarks were published. The article says:

Panelists also debated use of SQL and database connectivity in clouds. SQL as a design pattern for storage “is not ideal for cloud applications,” said Adrian Otto, senior technical strategist for Rackspace Cloud. Afterward, he described SQL issues as “typically the No. 1 bottleneck” to elasticity in the cloud. With elasticity, applications use more or fewer application servers based on demand. Otto recommended that developers who want elasticity should have a decentralized data model that scales horizontally. “SQL itself isn’t the problem. The problem is row-oriented data in an application,” which causes performance bottlenecks, said Otto.

The author Paul Krill did a good job here of accurately reporting my position on this subject. Data stored in databases are arranged in tables of rows and columns. A new piece of data adds a new row. Each row has multiple columns that separate fields of a single record of data in the table. The truth is that most web applications work very well with this data design pattern. Those should continue to use SQL databases with row oriented data. However, there are some applications where data may be arranged differently to make reading the data more efficient.

If you have a big table of data, and you want to pull out just a little bit of it using a query, the database server must determine the location of that data in the table by consulting the table’s index, and return the desired portion that matches the constraints given in the query. This makes the reading of data relatively expensive from a computational perspective. If data were instead arranged in lots of columns instead, it could be retrieved more efficiently, and the data could be more easily distributed over a larger number of servers yielding the horizontal scalability that cloud applications want. This works very well in cases where the number of reads are very high, but the data is not updated very frequently in proportion to the reads.

Let’s use a blog application as an example. Blog posts are written once, and maybe updated a few times, possibly once each time a comment is submitted. However, on a busy web site, a blog post may be read millions of times. If the posts were stored in a column oriented storage system like Cassandra, they could be quickly and easily retrieved using the id number of the blog post. The listing of recent blog posts can also be arranged in a column so that the front page of the blog site with the listing of the articles can be generated. Using this approach requires that the data be properly arranged as it’s stored, putting the computational burden on the (infrequent) write rather than on the (frequent) read.

Using a distributed system to store data in columns allows the data to be evenly distributed over an arbitrary number of servers, eliminating the central data bottleneck. Adding more servers in the correct proportion of application servers and storage servers can result in true horizontal scalability, meaning that the capacity increases as a direct proportion of how many servers are in the cluster.

Why doesn’t everyone do this already? For some good reasons:

  1. The concept of running applications in clouds is still relatively new. The related technology is still maturing.
  2. Existing software tends to use SQL already. If you want to use an existing CMS platform, chances are it will require a central SQL database.
  3. Most heavy-read workloads can be scaled well using data caching techniques. If applications don’t write data very often, it may not be necessary to scale beyond a single database server.
  4. You must anticipate exactly how the application will use the data, and arrange it just right.
  5. It may be harder to analyze the data. Once your data is arranged in a column store, you may not be able to query it in arbitrary ways. You may only be able to pull it out using it’s id numbers, or by systematically scanning all of it to find the parts you want.
  6. Distributed data storage (aka: NoSQL) systems like Cassandra, Hbase, Redis, etc. are complicated, and there is a considerable learning curve associated with setting them up and maintaining them. In some cases these systems are not as good in terms of data durability or data consistency as the prevailing SQL database systems. These tradeoffs can be difficult to navigate.
  7. Today’s software developers are very familiar with SQL as a data storage and access paradigm. They can very quickly develop software that relies on the ACID qualities of a SQL database.

If you have an application that you want to deploy into a cloud, and you want it to be very elastic, you should think about the subject of how you arrange your data. If you use a centralized data design, you will probably have scalability bottlenecks when you add lots of servers. You should aim to decentralize the data in a way that you can easily add more servers to horizontally scale the environment, and not stumble on the limits of the database server. This is particularly important in situations where you need the application to write a lot of data, and a cache is not a suitable solution for you.

Over time, the reasons why not to use column oriented data will begin to shrink, and better tools and services will make it easier to do. Until then, I suggest that you carefully consider if you need maximum elasticity. If not, then it’s perfectly appropriate to keep using the same centralized row-oriented data paradigm. Use a cache like memcached in cases where you have heavy reads, and when it’s acceptable to show slightly outdated information to readers. The truth is that traditional solutions work really well for most web applications. However, if you have one of the more unique situations where you need true horizontal scalability, take a good look at a different arrangement for your data, and the systems and tools to make that possible for you in the cloud.

Better Luhn Formula CC Validator for PHP

I was doing some work integrating with a payment gateway in a PHP application, and decided it would be a good idea to validate credit card numbers using a Luhn Algorithm formula prior to forwarding them to the payment gateway for processing. I looked for existing PHP ones, and found a few.

The more I found the less I liked any of them. Some of them actually had bugs or typos and did not work at all, and most of them would incorrectly validate a credit card number that was all zeros.

I wrote my own that I’m pretty happy with. It’s a good deal more efficient that most that I found. It does not repeat the same math on the same figures like some of them out there do.


<?php

/*
 *   Copyright 2011 Adrian Otto
 *
 *   Licensed under the Apache License, Version 2.0 (the "License");
 *   you may not use this file except in compliance with the License.
 *   You may obtain a copy of the License at
 *
 *       http://www.apache.org/licenses/LICENSE-2.0
 *
 *   Unless required by applicable law or agreed to in writing, software
 *   distributed under the License is distributed on an "AS IS" BASIS,
 *   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *   See the License for the specific language governing permissions and
 *   limitations under the License.
 */

function luhn_validate($s) {
  if(
0==$s) { return(false); } // Don't allow all zeros
  
$sum=0;
  
$i=strlen($s);     // Find the last character
  
while ($i-- > 0) { // Iterate all digits backwards
    
$sum+=$s[$i];    // Add the current digit
    // If the digit is even, add it again. Adjust for digits 10+ by subtracting 9.
    
(0==($i%2)) ? ($s[$i] > 4) ? ($sum+=($s[$i]-9)) : ($sum+=$s[$i]) : false;
  }     
  return (
0==($sum%10)) ;

?>

The function contains 7 lines of code. Can you make this function better without making it harder to read and understand? Please let me know.

I’m Paranoid, just like you!

By: Adrian Otto

Over the years I’ve administered email systems that provided service to thousands of end user’s mailboxes. In the early years in the 1990’s most woes of a mail system administrator were about how to instrument the setting up of email accounts and related client settings, and changing passwords when they were forgotten by end users.

As the internet became more and more commercialized, spam exploded in our face. Everyone hates spam. Mail administrators hate it with a passion. They are doing everything they can to try and fight it… they filter, they black-hole, they tattle to abuse@whatever.com about it. Sometimes their own users send spam, and they get black-holed and need to jump through hoops to undo the damage.

At the time I reached my breaking point I managed email for about a dozen domain names, probably about two hundred mailboxes in total. I hated it. I hated every waking moment of it. The RBL’s that worked one day did not work the next. I’m convinced that e-mail system administration is the nastiest dirtiest job there is for a sysadmin.

People kept suggesting to me that I outsource email, which I shrugged off. I had problems with outsourcing:

1) I’m Paranoid about Uptime.

It’s hard for me to trust other people, let alone trust a company. And trusting a company with something as important as my email??? No way. I’m a control freak, and I was going to keep control at all costs. Yes, I hated email system administration. I wasn’t even a sysadmin any more, but I still did it just so that I could control it. It needed to be highly available. I simply could not trust anyone to do it better than me.

2) I’m Paranoid about Security.

Although email is inherently an insecure communication mechanism, all sorts of highly sensitive information is in there anyway. What would happen if a competitor would somehow get control of our email and read it. They could learn all of our secrets. No way, I’m keeping control of the security so that I know it’s locked down as much as humanly possible.

3) I’m Paranoid about Reliability and Control.

If something goes wrong, I want to be able to fix it quick. If I host it, I have full control of everything in the system. I can find what’s wrong and fix it fast. I’m really good at that.

I became a source code contributor for an open source email filtering system called bogofilter that uses Bayes filters to learn what’s spam and what’s not and filter based on that. I thought my spam filtering setup was the bomb! It worked great!

I got busier and busier with my work. I administered my email systems less and less. The better they worked, the less I would work on them because I had other fish to fry. The spammers got smarter and smarter, and soon enough my super cool spam filtering setup was becoming less and less effective.

So in 2006 something happened. I got super frustrated with spam administration. I was tired of having to keep finding or inventing better mouse traps to trap that nasty spam. So I thought to myself… There is an unlimited desire to send spam. Why? Because it works. If it did not work, the spammers would not be so determined to keep doing it. They are doing everything they can to outsmart you to get mail in your inbox. They keep getting smarter and smarter.

I thought some more… It’s like viruses. The hackers keep making better viruses, and the virus scanner software companies keep making their virus scanners better to clean them up and block them out. I needed something like virus scan, but for my email. I thought about all the technical ways to do it. I started hunting the web to find answers. I just wanted SOMEONE… anyone to handle this spam nonsense for me.

In the process, I stumbled across a company called “Webmail.us” (Later acquired by Rackpace and now called “Rackspace Email”). They had a great web site, said (at the time) they had 700,000 mailboxes in service. They had a complete spam filtering solution built in. The mailbox hosting was cheap. So cheap I could not ignore it. They were charging less for complete hosting of mailboxes than I was willing to pay for outsourced spam filtering.

In 2006 I did an experiment. I put my own domain name where I get my home email on webmail.us to see how it worked. I told myself that if it worked really well that I might switch all my email over to it, and wash my hands of email sysadmin work and all the spam nonsense that goes along with it. I did it for a month. It worked great. It was fast, it never went down. I got no spam. I was thrilled!

I did the unthinkable. I outsourced my email!

One by one I migrated all of my domains, and all my mail users over to the hosted system. I have never looked back. The system has been rock solid. The few problems I’ve seen over the past three years have been really minor, and solved more quickly than I would have been able to solve using my own systems. I had been converted.

I was so happy to finally be free of all the nuisance of administering email and spam filtering systems. It was great. Years later I ended up working with Rackspace, and told them the story of how I used and loved the email platform. I later met the people behind the system, and it was no wonder that it works as well as it does.

If you are still administering your own email… especially if you are running an Exchange system in your own office building. You need to take a serious look in the mirror and ask yourself why you are not outsourcing it to Rackspace Email. The truth is:

1) It’s more expensive to host it internally. Run the numbers.
2) Your uptime it a lot worse. Measure it.
3) Your security is no stronger. Audit it.
4) You are paranoid, just like me. Yes, you are.

You trust your bank with your money. You trust your phone company not to spy on all your phone calls. You do this stuff without worrying about it. These things are much bigger leaps of trust than outsourcing your email.

From me to you… do yourself a favor. Run the same experiment I did. You’ll be delighted. I work for Rackspace now, so my view is corrupt, right? Don’t take my word for it, because you’re paranoid. Just try it and see.