<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Adrian Otto&#039;s Blog</title>
	<atom:link href="http://adrianotto.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://adrianotto.com</link>
	<description>For those who care about technical details</description>
	<lastBuildDate>Sun, 14 Mar 2010 17:34:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Bandwidth != Network Performance</title>
		<link>http://adrianotto.com/2010/03/bandwidth-network-performance/</link>
		<comments>http://adrianotto.com/2010/03/bandwidth-network-performance/#comments</comments>
		<pubDate>Sun, 14 Mar 2010 17:34:33 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[best practices]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=237</guid>
		<description><![CDATA[You might think that if you want faster internet performance, you can simply get a connection to the internet that has higher bandwidth. When you get a &#8220;faster&#8221; internet connection you may observe faster downloads. But it&#8217;s less frequently the additional bandwidth, and more frequently reduced latency that actually produces increased interactive web performance. This [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://adrianotto.com/wp-content/uploads/2010/03/rj45.jpg"><img class="alignright size-full wp-image-302" title="rj45" src="http://adrianotto.com/wp-content/uploads/2010/03/rj45.jpg" alt="" width="240" height="240" /></a>You might think that if you want faster internet performance, you can simply get a connection to the internet that has higher bandwidth. When you get a &#8220;faster&#8221; internet connection you may observe faster downloads. But it&#8217;s less frequently the additional bandwidth, and more frequently reduced latency that actually produces increased interactive web performance. This post explains why.</p>
<p>First of all, let&#8217;s review some definitions:</p>
<ul>
<li><strong>Bandwidth</strong>: The amount of data that can be passed along a communications channel in a given period of time.</li>
<li><strong>Latency</strong>: The time it takes for a packet to cross a network connection, from sender to receiver.</li>
<li><strong>Speed</strong>: Fast and rapid moving, going, traveling, proceeding, or performing; swiftness.</li>
<li><strong>Throughput</strong>: The quantity data transmitted by a computer network over a given period of time.</li>
</ul>
<p>Now, all of these terms are related, and I want to highlight some of the minutia here:</p>
<p><strong>Bandwidth</strong></p>
<p>The higher the bandwidth is on a network connection, the more data it&#8217;s capable of transmitting in a given period of time. Higher bandwidth is better.</p>
<p><strong>Latency</strong></p>
<p>This is very very important, because latency effectively limits the amount of bandwidth you can consume if you are using a synchronous data transmission, like a TCP/IP download. Lower latency is better, and will yield faster speed.</p>
<p><strong>Throughput</strong></p>
<p>Throughput is another way of expressing speed. The higher the throughput, the faster your network communications will be. Note that your maximum possible throughput is your bandwidth. Actual throughput is equal to or less than your bandwidth.</p>
<p><strong>Speed</strong></p>
<p>If your network is high speed, you should observe high bandwidth, low latency, and high throughput.</p>
<h3>Latency and Bandwidth are Inversely Proportional</h3>
<p>For TCP/IP transmissions, the higher your latency is, the lower your throughput will be. Let&#8217;s explore why. The most common use of TCP/IP is for the web, which uses the HTTP protocol. HTTP works by making a TCP/IP connection to a remote server, issuing a request for a document, and then receiving the response. The protocol is text based. A simple HTTP transmission is illustrated below.</p>
<p>Client Request:</p>
<pre>GET / HTTP/1.1
User-Agent: Wget
Host: www.example.com
</pre>
<p>Server Response:</p>
<pre>HTTP/1.1 200 OK
Server: Apache/2.2.3 (Red Hat)
Last-Modified: Tue, 15 Nov 2005 13:24:10 GMT
ETag: "b300b4-1b6-4059a80bfd280"
Accept-Ranges: bytes
Content-Type: text/html; charset=UTF-8
Connection: Keep-Alive
Date: Wed, 18 Nov 2009 22:36:34 GMT
Age: 1010
Content-Length: 438

  Example Web Page

You have reached this web page by typing "example.com",
"example.net",
  or "example.org" into your web browser.

These domain names are reserved for use in documentation and are not available
  for registration. See &amp;lta href="http://www.rfc-editor.org/rfc/rfc2606.txt"&gt;RFC
  2606&lt;/a&gt;, Section 3.
</pre>
<p>Here is a trace of the TCP/IP packets that make up that request:</p>
<pre>14:57:47.146665 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: S 3717672264:3717672264(0) win 5840
14:57:47.220092 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: . ack 1 win 183
14:57:47.220309 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: P 1:123(122) ack 1 win 183  (GET Request)
14:57:47.300962 IP 192.0.32.10.80 &gt; 192.168.144.2.39556: P 1:728(727) ack 123 win 4502  (200 OK Response)
14:57:47.300993 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: . ack 728 win 228
14:57:47.302035 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: F 123:123(0) ack 728 win 228
14:57:47.375475 IP 192.0.32.10.80 &gt; 192.168.144.2.39556: . ack 124 win 4502
14:57:47.375499 IP 192.0.32.10.80 &gt; 192.168.144.2.39556: F 728:728(0) ack 124 win 4502
14:57:47.375510 IP 192.168.144.2.39556 &gt; 192.0.32.10.80: . ack 729 win 228
</pre>
<p>Notice that there are 10 packets in the above trace. It&#8217;s a three way handshake to set up the TCP session, then a round trip to send the data, then two more round trips to close down the connection. Each time the server receives a packet from the client, the connection may wait in the server&#8217;s connection queue to be processed, which can further increase the interactive protocol latency. Consider the impact of high latency on a connection like this. Suppose that it takes 0.2 seconds for each round trip. That connection would have a total throughput of 727 bytes downloaded in 0.8 seconds. That&#8217;s a rate of 909 Bytes/sec. Maybe your internet connection is 15 Mb/sec. bandwidth did not matter. Latency caused the throughput to be low.</p>
<p>Now, you might be wondering why we can&#8217;t just improve networking technology to make latency lower. We can, but that&#8217;s not going to help much, because we are still bounded by the speed of light, among other factors. <strong>The speed of light is slow when you consider the distance it has to travel to cross continents on the earth.</strong> Let&#8217;s look at some match to explain that:</p>
<ul>
<li>The speed of light in vacuum is 299,792,458 m/s.</li>
<li>The speed of light in fiber optic cable is ~200,000,000 m/s.</li>
<li>The distance from Anaheim, CA to New York is 4,494,898 meters</li>
<li>The one-way latency to New York is  4,494,898 / 200,000,000 = 22.47ms</li>
<li>The round-trip time between Anaheim, CA and New York is 44.95ms</li>
<li>The current ping time from Anaheim, CA to New York is 72 ms</li>
<pre>Tracing the route to sl-gw33-nyc.sprintlink.net (144.228.243.82)
  1 sl-crs1-ana-0-14-2-0.sprintlink.net (144.232.11.9) 0 msec
    sl-crs2-ana-0-14-2-0.sprintlink.net (144.232.11.11) 0 msec
    sl-crs1-ana-0-14-2-0.sprintlink.net (144.232.11.9) 4 msec
  2 sl-crs2-fw-0-13-3-0.sprintlink.net (144.232.19.197) 28 msec
    sl-crs2-fw-0-9-5-0.sprintlink.net (144.232.20.130) 28 msec
    sl-crs1-fw-0-3-3-0.sprintlink.net (144.232.9.65) 28 msec
  3 sl-crs2-kc-0-0-0-2.sprintlink.net (144.232.19.141) 40 msec
    144.232.20.57 40 msec
    sl-crs1-kc-0-5-5-0.sprintlink.net (144.232.24.9) 40 msec
  4 sl-crs2-chi-0-13-5-0.sprintlink.net (144.232.20.109) 52 msec
    sl-crs1-chi-0-1-0-3.sprintlink.net (144.232.18.214) 56 msec
    sl-crs2-chi-0-15-2-0.sprintlink.net (144.232.24.206) 52 msec
  5 sl-crs1-nyc-0-8-0-3.sprintlink.net (144.232.18.123) 72 msec
    sl-crs2-nyc-0-8-0-1.sprintlink.net (144.232.20.119) 72 msec
    sl-crs1-chi-0-10-3-0.sprintlink.net (144.232.9.148) 72 msec
  6 sl-gw33-nyc-14-0-0.sprintlink.net (144.232.6.56) 72 msec *
    sl-gw33-nyc-15-0-0.sprintlink.net (144.232.6.58) 72 msec
</pre>
</ul>
<p>This round trip time includes all of the switching and routing to get the packet through its full round trip. That means that even if all switching and routing were instantaneous, and we had a perfectly straight fiber path between all points on the earth, that we could only reduce latency by about 40%. We can not accelerate the speed of light, so without a significant advance in data transmission technology (perhaps a quantum physics approach) we must accept the speed of light as a performance boundary.</p>
<h3>Making Web Sites Faster</h3>
<p>If you&#8217;re a web content publisher, you can set up your systems to work around these natural limitations. One way to make interactive web performance faster is to place copies of your data in various geographic locations that are physically closer to your end users. Using a <a href="http://en.wikipedia.org/wiki/Content_delivery_network" target="_blank">CDN</a> for your media content is one way to do this. You can also make your web server as fast as possible so that your dynamically generated content can be processed as quickly as possible. Using <a href="http://memcached.org/" target="_blank">memcached</a> to speed up your web application can help. Also, take a look at some <a href="http://developer.yahoo.com/performance/rules.html" target="_blank">best practices</a> for web developers for good performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2010/03/bandwidth-network-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Put WiFi on your cell phone&#8217;s SIM Card!</title>
		<link>http://adrianotto.com/2010/02/put-wifi-on-your-sim-card/</link>
		<comments>http://adrianotto.com/2010/02/put-wifi-on-your-sim-card/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 17:31:30 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[VoIP]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Wireless]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=268</guid>
		<description><![CDATA[Have you ever wanted to surf the web from your laptop using the internet connection on your cell phone without connecting any wires, and with no hassle goofing around with software? Well guess what, for you happiness is close at hand!
Today Sagem Orga made a press release that raised my eyebrows. They have a new [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever wanted to surf the web from your laptop using the internet connection on your cell phone without connecting any wires, and with no hassle goofing around with software? Well guess what, for you happiness is close at hand!</p>
<p><img class="alignright size-full wp-image-269" title="wifi_sim" src="http://adrianotto.com/wp-content/uploads/2010/02/wifi_sim.jpg" alt="" width="126" height="119" />Today <a href="http://www.sagem-orga.com/" target="_blank">Sagem Orga</a> made a <a href="http://www.sagem-orga.com/index.php?mySID=f57afcdfff43f2fcba150c2e7d8d046a&amp;myELEMENT=World%20premier:%20Sagem%20Orga%20and%20Telefonica%20turn%20the%20SIM%20card%20into%20a%20Wi-Fi%20hotspot&amp;searchstring=SIMFi&amp;suchart=volltext" target="_blank">press release</a> that raised my eyebrows. They have a new SIM card (<a href="http://en.wikipedia.org/wiki/Sim" target="_blank">the identification chip in your GSM cell phone</a>) that has WiFi capability right on the chip. This is exciting, because it would enable otherwise ordinary cell phones to be used as WiFi internet gateways, running both WiFi and 3G data connections at the same time.</p>
<p><img class="alignnone size-full wp-image-274" title="laptop-to-phone-to-internet" src="http://adrianotto.com/wp-content/uploads/2010/02/laptop-to-phone-to-internet.png" alt="" width="538" height="198" /></p>
<p>This is something that most phones simply can not do. The ones that can do it require that a software program must be running on the phone to make it into a router that can relay WiFi signals over the web through a 3G data connection over the cell phone network. Getting this on a Blackberry, for example is a huge nuisance, if your service provider supports it at all.</p>
<p>Well, that nuisance may be a thing of the past once the new “SIMFi” technology hits the market. Imagine just plugging in the snazzy new card into your phone, joining its WiFi network from your laptop, and accessing the internet from practically anywhere. How cool is that!?!</p>
<p>There has been a discussion on <a href="http://mobile.slashdot.org/story/10/02/12/1824229/Wi-Fi-In-a-SIM-Card" target="_blank">Slashdot about this</a>. One of the interesting commentary was about the need for a 2.4 GHz antenna, which can actually fit fine on the SIM card itself, as long as it&#8217;s bent around a bit. An obvious question with any WiFi product is &#8220;what&#8217;s the implication on battery life?&#8221;. It will definitely be shorter. Hopefully this device will have some sort of a tunable transmit power adjustment for the WiFi signal so power consumption can be kept to a minimum. After all, your laptop and your cell phone will only be an arm&#8217;s length apart when you are using this setup anyway, so range is not a major concern.</p>
<p>Yes, I do love technical gadgets. The thought of where this could go is very exciting. I&#8217;ll be the first on the waiting list for this!</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2010/02/put-wifi-on-your-sim-card/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>CPU Time stolen from a virtual machine?</title>
		<link>http://adrianotto.com/2010/02/time-stolen-from-a-virtual-machine/</link>
		<comments>http://adrianotto.com/2010/02/time-stolen-from-a-virtual-machine/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 16:42:59 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[Xen]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=258</guid>
		<description><![CDATA[Those of you studying the vmstat(8) man page may be wondering what the &#8217;st&#8217; figure is in the CPU column. The manual refers to it as &#8220;Time stolen from a virtual machine&#8220;. More specifically:
It&#8217;s the time the hypervisor scheduled something else to run instead of something within your VM. This might be time for another [...]]]></description>
			<content:encoded><![CDATA[<p>Those of you studying the vmstat(8) man page may be wondering what the &#8217;st&#8217; figure is in the CPU column. The manual refers to it as &#8220;<em>Time stolen from a virtual machine</em>&#8220;. More specifically:</p>
<p>It&#8217;s the time the hypervisor scheduled something else to run instead of something within your VM. This might be time for another VM, or for the Hypervisor host itself. If no time were stolen, this time would be used to run your CPU workload or your idle thread.</p>
<p>There is some disagreement circulating about whether the Hypervisor will steal idle time, or only preempted time. In other words, it has been suggested that stolen time is where your local kernel scheduler within the VM wanted to run something but the Hypervisor made that impossible. I have found that stolen time does in fact count borrowed idle time, where the local scheduler actually had nothing to run. For example, here are some vmstat values from a VM that&#8217;s got a very low cpu workload on it:</p>
<pre>
vmstat -S M 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0    121     42     53    460    0    0     0     1    0    1  0  0 89  0 10
 0  0    121     42     53    460    0    0     0    28 1014   39  0  0 90  0 10
 0  0    121     42     53    460    0    0     0     0 1016   36  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1024   32  0  0 93  0  7
 0  0    121     42     53    460    0    0     0     0 1019   40  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1015   32  0  0 90  0 10
 0  0    121     42     53    460    0    0     0     0 1022   34  0  0 92  0  8
 0  0    121     42     53    460    0    0     0     0 1016   36  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1013   34  0  0 92  0  8
 0  0    121     42     53    460    0    0     0     0 1028   43  0  0 93  0  7
</pre>
<p>As you can see, user time (us), system time (sy), and iowait time (wa) are zero, but idle time is not 100%. This normally indicates that your system is doing something, but in this case idle time is actually the sum of the <em>id</em> and <em>st</em> columns.</p>
<p>In this example, I really don&#8217;t care that I have a nonzero <em>st</em> column because my workload is basically idle all the time anyway.</p>
<p>If you are on a cloud host where you purchase a small sliver of a server, you should expect to see nonzero values in this column when you run vmstat. If you have a heavy CPU load and need more processing power, you can solve this problem by upgrading to a larger VM server size so that you command a larger portion of the physical host.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2010/02/time-stolen-from-a-virtual-machine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ED Strikes Again?</title>
		<link>http://adrianotto.com/2010/02/ed-strikes-again/</link>
		<comments>http://adrianotto.com/2010/02/ed-strikes-again/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 22:28:19 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=251</guid>
		<description><![CDATA[It&#8217;s not the ED you are thinking of. Nope, it&#8217;s actually the External Dependency.
One piece of advice that I continually dispense is to try to reduce dependencies on remote web sites when coding your own. The problem strikes most dramatically when you run a very busy site, and you have some feed or resource that [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s not the ED you are thinking of. Nope, it&#8217;s actually the <span style="color: #ff0000;"><strong>E</strong></span>xternal <span style="color: #ff0000;"><strong>D</strong></span>ependency.</p>
<p>One piece of advice that I continually dispense is to try to reduce dependencies on remote web sites when coding your own. The problem strikes most dramatically when you run a very busy site, and you have some feed or resource that you download from a remote site. That remote site crashes, and oops, so does yours. It also happens when your busy site gets more traffic than the corresponding requests to the remote site can handle.</p>
<p>I ran into this again today. One site that I host was consuming a remote feed from a site that has a much smaller capacity than my customer does. The site on my end gets over 10 million page views a day (peak ~2000 page views per second). The capacity mismatch became very apparent when something went wrong on the remote end.</p>
<p>The code logic was:</p>
<ol>
<li>If you have a cached version of the feed, and its fresh, then use it.</li>
<li>If the cached entry is expired, then fetch a new one, and replace the one in cache.</li>
</ol>
<p>This logic is fundamentally flawed for busy sites. It seems sensible, but think about what happens when the cached entry expires, and the remote site is responding very slowly. All of a sudden a stampede of requests start stacking up, all trying to get the feed in parallel. It crashes the remote site even worse. The remote site tries to reboot, and you quickly crash it again. The sequence repeats indefinitely.</p>
<p>Why? Because the window of time during which the cache is invalid gets wider and wider as the remote site gets slower and slower. The longer that window is open, the more traffic the remote site will get from cache misses.</p>
<p>A clean solution is to update the cache asynchronously using a scheduled batch job that keeps a local cache of the data. Only attempt to update the cache when it has actually changed. The logic in the web appication changes to:</p>
<ol>
<li>Always use the data in the cached file.</li>
</ol>
<p>The feed site is consulted on regular intervals using a scheduled batch job (cron), and the cached data is updated if it&#8217;s able to get a response. If the remote site is down or too slow, then the application simply continues to use the version it had before. Problem solved!</p>
<p>Why is this not a best practice for all web developers? Because most web sites don&#8217;t get enough traffic for it to matter much. But, if you&#8217;ve got a busy site, and you don&#8217;t want it to crash when your remote feeds do, then you might want to consider getting that data asynchronously, or at least use a cache update procedure that&#8217;s serialized.</p>
<p>Here is <a href="http://cloudsites.rackspacecloud.com/index.php/How_to_download_data_from_remote_web_servers_efficiently" target="_blank">an example</a> of a non-blocking serialization approach that works for PHP applications.</p>
<p>So all you web developers out there who like to consume RSS feeds on the server-side of your web application&#8230; don&#8217;t say I didn&#8217;t warn you. Go look at all your code and make sure you don&#8217;t have an dependency on a remote site. If you do, you now know at least two ways to solve that problem.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2010/02/ed-strikes-again/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Putting Entropy in the Cloud</title>
		<link>http://adrianotto.com/2009/11/putting-entropy-in-the-cloud/</link>
		<comments>http://adrianotto.com/2009/11/putting-entropy-in-the-cloud/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 04:48:56 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Entropy]]></category>
		<category><![CDATA[Random]]></category>
		<category><![CDATA[RNG]]></category>
		<category><![CDATA[Xen]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=247</guid>
		<description><![CDATA[I was browsing through twitter mentions of @adrian_otto and found one posted by Ian Thompson mentioning an article about weak randomness in the cloud. It suggests that because there may be insufficient entropy sources on a Cloud Server or instance that it may make it easier to guess random number sequences because different cloud servers [...]]]></description>
			<content:encoded><![CDATA[<p>I was browsing through twitter mentions of <a href="http://twitter.com/adrian_otto" target="_blank">@adrian_otto</a> and found one posted by <a href="http://twitter.com/MystirrE" target="_blank">Ian Thompson</a> mentioning <a href="http://bit.ly/34Wom8" target="_blank">an article</a> about weak randomness in the cloud. It suggests that because there may be insufficient entropy sources on a Cloud Server or instance that it may make it easier to guess random number sequences because different cloud servers may have similar or even identical entropy pools (or worse yet identical host keys) when created, and therefore easier to break encryption algorithms that depend on them.</p>
<p>Yes, if you have similar entropy pools it is easier to break encryption dependent on it. It&#8217;s reasonably easy to work around this and make sure your entropy pool is uniquely initialized. You can consult the <a href="http://linux.die.net/man/4/random" target="_blank">random manual for the Linux Kernel</a> for information about how to seed your entropy pool with a particular set of data. If you are running an application in the cloud that utilizes encryption, and you are concerned about the initial state of your entropy pool, you can solve that. Use this procedure:</p>
<p>1) Seed your own pool from a long running system that has sufficient entropy in it, rather than relying on what you read from the kernel at startup.</p>
<p>2) Produce a network service that you use to seed your initial entropy pools. This service could be as simple as an entropy file that you create on pseudo-random time intervals, and just discard them as you serve them to cloud server instances (as they boot up) so you never serve the same one twice. At boot time from your VM, simply connect to wherever you run this service and download an input file to seed your entropy pool with. Restrict access to this so that it&#8217;s only available to your own server instances.</p>
<p>3) Make sure that your custom entropy pool initialization takes place prior to starting your encryption software.</p>
<p>4) If you are creating an AMI, or other server image that you plan to clone, be sure that it does not have a host key generated yet. Delete it and allow your initialization scripts to create it when the server is created (after step rather than making copies of the same one.</p>
<p>If you don&#8217;t trust what /dev/random or /dev/urandom emit, you can optionally use OpenSSL with <a href="http://prngd.sourceforge.net/" target="_blank">prngd</a> or <a href="http://egd.sourceforge.net/" target="_blank">egd</a> as alternate entropy sources, and potentially feed in your own sensory input data. If you want to go hardcore, you could add environmental noise such as resistor noise on the microphone input of a sound card, or some other sensory data. There is <a href="http://vanheusden.com/aed/" target="_blank">existing software for doing just that</a>. There&#8217;s all sorts of possibilities. Among them are a number of hardware solutions for RNG, most of which are pretty expensive and are not options for a cloud environment. There are sources of random numbers provided <a href="http://random.irb.hr/">as a service</a> from <a href="http://www.random.org/" target="_blank">various sources</a>.</p>
<p>There are things that we can do as Cloud Computing service providers to pre-initialize your entropy pools for you when the given server instance is created so the procedure above would be redundant. This still leaves the question as to the quality of the <a href="http://en.wikipedia.org/wiki/Random_number_generator" target="_blank">RNG</a> available to you on a cloud server.</p>
<p>There are two standard randomness sources that you should know about:</p>
<p>/dev/random   = produces actual entropy, if you have some, and blocks otherwise.<br />
/dev/urandom = produces available entropy regardless of quality, but does not block.</p>
<p>The Linux kernel has a paravirtual entropy driver which provides kernel-side support for the virtual <a href="http://en.wikipedia.org/wiki/Random_number_generator" target="_blank">RNG</a> hardware. The kernel compile option CONFIG_HW_RANDOM_VIRTIO enables it, and it can be built as a kernel module. There are drivers that run within the hypervisor host kernel that connect this with the RNG hardware available on the server (if any).</p>
<p>drivers/char/hw_random/amd-rng.ko = H/W RNG driver for AMD chipsets<br />
drivers/char/hw_random/intel-rng.ko = H/W RNG driver for Intel chipsets<br />
drivers/char/hw_random/virtio-rng.ko = VirtIO Random Number Generator support</p>
<p>How it works is the hypervisor host (dom0) runs <a href="http://linux.die.net/man/8/rngd/" target="_blank">rngd</a> to read data from /dev/hwrandom (using the Intel or AMD modules mentoined above) and feeds it into /dev/random, then the guest VM (domU) does the same thing. The rngd can mixes data from both /dev/random and /dev/urandom so you get as much random data as you need in a non-blocking fashion. You can consult the kernel <a href="http://lwn.net/Articles/282721/" target="_blank">source code</a> to learn more. Then you run rngd in the guest VM to feed that into the kernel.</p>
<p>What happens if multiple guest VM&#8217;s are reading this data at the same time using this arrangement? I&#8217;m not sure if it&#8217;s possible to deplete the entropy pool of the hypervisor host and produce <a href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator" target="_blank">PRNG</a> patterns that are therefore less random. So if one guest VM emptied the entropy pool by aggressively reading from the /dev/hwrandom device, you might cause someone else&#8217;s guest VM to get less data. This could be solved if there were a simply a rate limit enforced on the consumption of RNG data allowed per guest VM. There is <a href="http://lwn.net/Articles/283103/" target="_blank">further discussion</a> of that as well.</p>
<p>The truth is that for most needs you can have reasonably secure encryption by simply having an ordinary PRNG source like /dev/urandom that&#8217;s properly initialized with random data. I suggest that you use that approach in your cloud deployments.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/11/putting-entropy-in-the-cloud/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Writing Code That Scales</title>
		<link>http://adrianotto.com/2009/11/writing-code-that-scales/</link>
		<comments>http://adrianotto.com/2009/11/writing-code-that-scales/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 16:38:59 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Development]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=234</guid>
		<description><![CDATA[Check my post from today on the Rackspace Cloud blog. It covers several tips on planning ahead when writing a web-scale application.
]]></description>
			<content:encoded><![CDATA[<p>Check my <a href="http://www.rackspacecloud.com/blog/2009/11/18/writing-code-that-scales/" target="_blank">post from today</a> on the Rackspace Cloud blog. It covers several tips on planning ahead when writing a web-scale application.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/11/writing-code-that-scales/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Maytag Repair Man Reports No Problem</title>
		<link>http://adrianotto.com/2009/11/maytag-repair-man-reports-no-problem/</link>
		<comments>http://adrianotto.com/2009/11/maytag-repair-man-reports-no-problem/#comments</comments>
		<pubDate>Fri, 13 Nov 2009 14:37:46 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=182</guid>
		<description><![CDATA[Recently my clothes dryer had not been working well. Clothes inside were damp and warm rather than hot and dry at the end of a cycle. We called in the Maytag repair man. Actually I booked him on-line, and got an appointment just over 24 hours later. In all fairness, he was not a Maytag [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-184" title="maytag_repair_man" src="http://adrianotto.com/wp-content/uploads/2009/11/maytag_repair_man.png" alt="maytag_repair_man" width="150" height="188" />Recently my clothes dryer had not been working well. Clothes inside were damp and warm rather than hot and dry at the end of a cycle. We called in the <a href="http://www.maytag.com/" target="_blank">Maytag</a> repair man. Actually I booked him on-line, and got an appointment just over 24 hours later. In all fairness, he was not a Maytag repairman, but someone from their authorized service network. He listened to my description of the problem and proceeded to disconnect the exhaust vent hose attachment at the back of the dryer. OMG it was totally packed with lint!</p>
<p><a href="http://www.maytag.com/"><img class="alignright size-full wp-image-183" title="Maytag Dryer" src="http://adrianotto.com/wp-content/uploads/2009/11/maytag_dryer.png" alt="Maytag Dryer" width="131" height="209" /></a>He explained that the dryer has an overheat safety sensor. When the exhaust temperature gets close to the point where it might catch lint on <span style="color: #ff0000;">fire</span>, it shuts down the gas, so the dryer cools off. With no vent, the dryer cooks the air in the exhaust path. My clothes were getting hot at the start of a cycle, and just tumbling after that. No wonder they would not dry! Happily I paid him for telling me that there was nothing to fix!</p>
<p>I&#8217;ve had this washer and dryer for almost a decade, and have never had a problem. Now that I had my first problem, and found that it was not even the appliance that had the issue, but simply that I had never cleaned out the exhaust path. I have since learned that <a href="http://www.ehow.com/how_5529188_troubleshoot-dryer-exhaust-vent.html" target="_blank">this should be done regularly</a>.</p>
<p>After a few minutes with my shop vac sucking lint out of the exhaust hose, inside the dryer exhaust plenum, and up the external vent pipe, everything was working perfectly again. First of all, I&#8217;m happy that the dryer had this safety feature so my house did not catch fire. According to the US Consumer Product Safety Commission <a href="http://www.cpsc.gov/cpscpub/pubs/5022.html" target="_blank">people die from this all the time</a>. <img class="alignright size-full wp-image-201" title="dollar" src="http://adrianotto.com/wp-content/uploads/2009/11/dollar.jpg" alt="dollar" width="143" height="62" hspace="10"/>Secondly, I&#8217;m thrilled that there was nothing mechanically wrong with my dryer. Thirdly, I&#8217;m looking forward to saving some costs on my Gas bill which will certainly decrease now.</p>
<p><a href="http://www.maytag.com/"><img src="http://adrianotto.com/wp-content/uploads/2009/11/maytag_logo.png" alt="Maytag Logo" title="Maytag Logo" width="94" height="58" class="alignleft size-full wp-image-206" /></a>I am such a happy Maytag customer! I used to laugh at the Maytag Repair Man ads. Not anymore. They have earned my complete respect for their safe and reliable appliances. When you buy a washer/dryer, you want reliability. Maytag is the real McCoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/11/maytag-repair-man-reports-no-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Remus Project: Full Memory Mirroring!</title>
		<link>http://adrianotto.com/2009/11/remus-project-full-memory-mirroring/</link>
		<comments>http://adrianotto.com/2009/11/remus-project-full-memory-mirroring/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 22:30:10 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[VoIP]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[Remus]]></category>
		<category><![CDATA[Xen]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=163</guid>
		<description><![CDATA[Imagine that you have a cluster with two machines side by side in an active/standby configuration. Let&#8217;s say you have your data replicated, and the systems are basically identical except for the IP address and hostname. You can use heartbeat to share an IP address such that if the primary fails, the secondary takes over. [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-166" title="Mirrored Servers" src="http://adrianotto.com/wp-content/uploads/2009/11/server-mirror.jpg" alt="Mirrored Servers" width="130" height="90" />Imagine that you have a cluster with two machines side by side in an active/standby configuration. Let&#8217;s say you have your data replicated, and the systems are basically identical except for the IP address and hostname. You can use heartbeat to share an IP address such that if the primary fails, the secondary takes over. You can also perform the equivalent using &#8220;live migration&#8221; features in a Xen or VMWare hypervisor. The problem with these sorts of fail-overs is that any active TCP/IP sessions end up getting broken, and new connections must be established between clients and the application.</p>
<p>Okay, here&#8217;s something that fixes that problem: the <a href="http://dsg.cs.ubc.ca/remus/" target="_blank">Remus Project</a>. The approach is brilliant. On regular intervals it ships the changed memory registers from one host to the other. Memory reading does not need to be replicated, only writes, and writes to the same location don&#8217;t all need to be replicated, only the most recent write. The primary node simply delays its response to TCP/IP packets (output buffering) until after it has confirmed that the standby node has received the replicated memory data. Very very clever.</p>
<p>Here are the key features listed on the Remus web site:</p>
<ul>
<li>The backup VM is an <em>exact copy</em> of the primary VM. When     failure happens, it continues running on the backup host as if     failure had never occurred.</li>
<li>The backup is <em>completely up-to-date</em>. Even active TCP     sessions are maintained without interruption.</li>
<li>Protection is <em>transparent</em>. Existing guests can be     protected without modifying them in any way.</li>
</ul>
<p><a href="http://www.xen.org/"><img class="alignright size-full wp-image-170" title="Xen Logo" src="http://adrianotto.com/wp-content/uploads/2009/11/xen_logo.gif" alt="Xen Logo" width="149" height="67" /></a>Okay, I&#8217;ve been running HA systems in multiple geographies now for about a decade. I&#8217;ve experimented with lots and lots of clustering and replication technology. Most of the time when I hear about something new, I cringe and wonder if it&#8217;s just another thing that&#8217;s using the same old tricks I&#8217;ve been using for years, or if its something truly innovative and truly <a href="http://en.wikipedia.org/wiki/Open_source" target="_blank">open source</a>. Before you go making comments that VMWare has this feature or that feature, relax. This post is not about VMWare. It&#8217;s about open source Xen.</p>
<p>Now, you might already be wondering if this would work if you separated the two nodes to run in separate locations. The short answer is maybe. You would still need a very clever network configuration to re-route your traffic dynamically to the new location. For those of us that do operate our own Autonomous Systems, that may seem possible with a BGP route update. But here&#8217;s the bummer&#8230; The additional latency it would introduce would bring your performance to a screeching halt. You could probably afford to have about 25ms of average latency between two locations and get away with it. The cut-over would still be better than nothing, but you&#8217;d better have a rock solid network in there, and you&#8217;d better be ready to pump lots of bandwidth over it. Plan for 100Mb/sec if you checkpoint every 100ms.</p>
<p><a href="http://www.memcached.org/"><img class="size-full wp-image-164 alignright" style="margin-left: 10px; margin-right: 10px;" title="memcached logo" src="http://adrianotto.com/wp-content/uploads/2009/11/memcache_logo.png" alt="memcache_logo" hspace="10" width="76" height="75" /></a>This would be great for a high read application like a web cache, or some <a href="http://www.memcached.org" target="_blank">memcached</a> applications. People ask on the memcached mailing list all the time how they can set up replication and HA. The answer is always &#8220;it&#8217;s a cache&#8230; not a database.&#8221;. Well, for those of you that want to do HA for a memcached system, give Remus a try.</p>
<p><img class="alignright size-full wp-image-174" title="trixbox logo" src="http://adrianotto.com/wp-content/uploads/2009/11/trixbox_logo.png" alt="trixbox logo" />Let&#8217;s not stop there. Imagine you have a SIP call control platform or <a href="http://www.trixbox.org/" target="_blank">Trixbox</a> system, and you don&#8217;t want to lose all your active calls in the event of a system crash? Pretty much any mission critical application that supports long running connections over TCP/IP</p>
<p>Remus has been around for some time, so why am I so excited now? It&#8217;s now part of <a href="http://www.xen.org" target="_blank">Xen</a>! You don&#8217;t need to do anything special on the master or slave node to use it! Whoot! Now I&#8217;m impressed. Anyone out there have experience running it? I&#8217;d love to hear your thoughts.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/11/remus-project-full-memory-mirroring/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Advice for backing up your Macs</title>
		<link>http://adrianotto.com/2009/11/advice-for-backing-up-your-macs/</link>
		<comments>http://adrianotto.com/2009/11/advice-for-backing-up-your-macs/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 06:23:21 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Backup]]></category>
		<category><![CDATA[OSX]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=149</guid>
		<description><![CDATA[My wife asked me today if I could give a colleague some advice for how to backup a bunch of Macs. I&#8217;ll share my advice for you here. Over the past two decades I&#8217;ve used so many different backup systems and software and hardware combinations, I can&#8217;t even count them all. So this begs the [...]]]></description>
			<content:encoded><![CDATA[<p>My wife asked me today if I could give a colleague some advice for how to backup a bunch of Macs. I&#8217;ll share my advice for you here. Over the past two decades I&#8217;ve used so many different backup systems and software and hardware combinations, I can&#8217;t even count them all. So this begs the question, what do I do at home?</p>
<p><a href="http://www.apple.com/macosx/what-is-macosx/time-machine.html"><img title="Time Machine" src="http://images.apple.com/macosx/what-is-macosx/images/timemachine_title20090608.jpg" alt="" hspace="10" width="285" height="67" /></a></p>
<p>I use the <a href="http://www.apple.com/macosx/what-is-macosx/time-machine.html">TimeMachine</a> software built into Leopard (and newer) <a href="http://www.apple.com/macosx/" target="_blank">OSX</a>. I use a locally connected USB2. A Firewire drive would also be good. Here is a drive that I like because it has lots of capacity, reasonably affordable, compact, and runs quietly.</p>
<p><a href="http://www.buy.com/prod/fantom-greendrive-1tb-usb-2-0-and-esata-external-hard-drive-2-year/q/loc/101/208503758.html" target="_blank"><img class="alignright size-full wp-image-154" title="Fantom Drive" src="http://adrianotto.com/wp-content/uploads/2009/11/FantomDrive.png" alt="Fantom Drive" width="163" height="190" />Fantom GreenDrive Pro 2TB eSATA and USB 2.0 7200RPM 32MB External Hard Drive</a></p>
<p>Another that&#8217;s half the capacity, but cheaper:</p>
<p><a href="http://www.buy.com/prod/fantom-greendrive-1tb-usb-2-0-and-esata-external-hard-drive-2-year/q/loc/101/208503758.html" target="_blank">Fantom GreenDrive 1TB USB 2.0 and eSATA External Hard Drive</a></p>
<p>Now, for home use a 2TB drive is probably enough for all your computers. At first I networked them all together to use just one drive on one of my computers shared to all the others so that all the backups were on the one big drive. I later decided that every computer should have it&#8217;s own drive for backups. Why? A few reasons:</p>
<ol>
<li>To conserve electricity. When you are using the computer is when the backup snapshots should be taken and archived. When the computer is asleep, may not respond over the network depending on how it&#8217;s set up, meaning you need to keep that host machine powered up all the time wasting electricity.</li>
<li>Each computer does its backups when they get used, and in the idle time before they fall asleep again. It works much better for me this way.</li>
<li>Immediate restores. Having a local drive on each computer makes restoration super fast. It&#8217;s not like a network or tape backup where you need to wait for your data to transfer back on to your hard drive to begin using it.</li>
</ol>
<p>It&#8217;s easy to set up Time Machine. Connect the drive, open &#8220;Time Machine Preferences&#8221; and select the drive.</p>
<p>I re-initialized mine using the disk utility first so that it had a journaled MacOS filesystem on it instead of the default FAT partitioning that comes from the factory.</p>
<p><a href="http://www.apple.com/macosx/what-is-macosx/time-machine.html"><img class="alignleft" style="margin-left: 10px; margin-right: 10px;" title="Time Machine Desktop" src="http://images.apple.com/macosx/what-is-macosx/images/timemachine_imac20090608.jpg" alt="" hspace="10" width="267" height="220" /></a></p>
<p>One really nice thing about Time Machine is that you can easily revert to a prior point in time in the event you accidentally mess something up, get a virus, or whatever. It&#8217;s about the easiest tool I&#8217;ve ever used. it automatically rotates backups hourly, daily, weekly, etc and deletes old backups automatically to make room for new ones. It&#8217;s totally automatic whereas with other tools you need to set that all up yourself.</p>
<p>This sort of local backup does not help if your house or office gets burglarized or burns down because you lose both the primary and backup copy of the data.</p>
<p><img class="alignright size-full wp-image-156" title="Jungle Disk" src="http://adrianotto.com/wp-content/uploads/2009/11/jd-logo.png" alt="Jungle Disk" width="222" height="50" />Another option is to use <a href="http://www.jungledisk.com/" target="_blank">JungleDisk</a> to back your data up to the cloud. That has the advantage of only paying for the storage you actually use, the backups are off site, so if you have theft or fire, you can still restore, potentially somewhere else. A disadvantage is that it requires adequate internet connectivity. Your upload speed needs to be fast enough to accommodate all of the data you produce within each backup interval. If your network is already constrained on available bandwidth, running backups over it could potentially aggravate matters. In short, if you have a big fat internet connection, then use <a href="http://www.jungledisk.com/" target="_blank">JungleDisk</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/11/advice-for-backing-up-your-macs/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Scale -&gt; Complexity -&gt; Reliability -&gt; Support</title>
		<link>http://adrianotto.com/2009/09/scale-complexity-reliability-support/</link>
		<comments>http://adrianotto.com/2009/09/scale-complexity-reliability-support/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 15:50:15 +0000</pubDate>
		<dc:creator>Adrian Otto</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://adrianotto.com/?p=137</guid>
		<description><![CDATA[Linux magazine released an article today by Joe Brockmeier titled Rethinking Gmail: Reliability Matters. The article makes some good points, and makes an obvious statement that to some, email is a mission critical application. I don&#8217;t dispute the points. I&#8217;d like to discuss why these systems fail to begin with, and how as an end [...]]]></description>
			<content:encoded><![CDATA[<p>Linux magazine released an article today by Joe Brockmeier titled <a href="http://www.linux-mag.com/id/7542" target="_blank">Rethinking Gmail: Reliability Matters</a>. The article makes some good points, and makes an obvious statement that to some, email is a mission critical application. I don&#8217;t dispute the points. I&#8217;d like to discuss why these systems fail to begin with, and how as an end user you can have realistic expectations for web scale systems.</p>
<p>First of all, running a &#8220;web scale&#8221; application means you have millions of end users. Running a system at that scale commands a certain level of complexity. A &#8220;cloud computing&#8221; system used to address &#8220;web scale&#8221; requirements drives complexity. The more complex a system is, the higher the risk that it will fail as a result of its own complexity. Therefore, web scale systems are more difficult to provide on a reliable basis than more simple systems.</p>
<p>The simple truth of the matter is that all systems fail at one time or another. No matter how well designed it is, and how well you test it, eventually something will happen that you were not prepared for, and an outage will occur. System designers must be disciplined to plan for potential problems so they can be predicted and mitigated before they occur in production. However, it&#8217;s only a matter of time until an outage does occur. Anyone who tells you that you can have a perfect reliability record forever is a blathering idiot. Don&#8217;t be tempted to align your expectations based on what idiots say.</p>
<p>Can you design a system to be highly reliable? Of course. Can a complex system exhibit a reliability record that&#8217;s higher than a simple one? If course. However, if the system is driven by software, and that software is complex, then it will contain human errors in a ratio proportional to its complexity. Simply put, the more code there is, the more chance it will contain bugs, or design defects. Yes, these can be mitigated, but I maintain that this problem can not be solved 100%, and that unsolved defects eventually lead to service outages.</p>
<p>Not convinced? In 1986 the <a href="http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster" target="_blank">Space Shuttle <em>Challenger</em></a> exploded. Why? Because the decision making procedures were flawed. Human error ultimately resulted in the death of seven astronauts. Blame the problem on a mechanical failure of an o-ring? No. Flawed o-ring design and a bad decision making process lead to death. The same thing happens in computer networks. Even when the software or configurations are not flawed, human error can still lead to system outages. It happens all the time.</p>
<p>Ever heard of a service provider offering a 100% uptime guarantee? You think that means they are going to be up 100% of the time. No, it does not. It means that you will get a discount on your next bill if the system is not up 100% of the time. In severe cases it may give you the option to terminate your service contract. That&#8217;s it, plain and simple. If you look long and hard at these guarantees, you will see that the penalties never compensate you for the actual damage of the service being unavailable. It&#8217;s a marketing tactic.</p>
<p>As an end user of web scale systems, set some realistic expectations for yourself. The system will break sometimes. I&#8217;m sure that your service providers will do everything they reasonable can to avoid outages. In his article, Brockmeier makes a good point that for free services there&#8217;s no simple way to extend you a discount. That does not mean that they care any less about uptime. They care. The bottom line is that ALL large scale systems have an imperfect reliability record. Compare Gmail&#8217;s reliability record with your own internal corporate email systems. Your reliability is higher? You lie! Measure it, and be honest.</p>
<p>So now that we are being honest, and expect that sometimes systems will fail, I&#8217;d like to make my main point. <strong>When systems do fail, keeping customers satisfied is about how you <em>respond</em> to the problem, and how you commit to fixing it so that it won&#8217;t keep happening</strong>. To do this well, here are some guidelines:</p>
<p>1) <strong>No Excuses</strong>. Customers don&#8217;t want to hear about how this problem is not your fault, or how you never expected this. Simply accept responsibility. Be sincere and humble, and commit to taking care of the problem.</p>
<p>2) <strong>Communicate</strong>. Focusing all your energy on the solution and ignoring the suffering subscriber base during an outage is a mistake. Take enough time to get your facts together, verify them, and use them to keep your subscribers well informed during an outage. If you notice a significant outage before your customers do, find a way to tell them before they notice. They will appreciate your proactive notification.</p>
<p>3) <strong>Analyze and Correct</strong>. Once service is restored, scrutinize the problem&#8217;s root cause, and find a way to prevent a recurrence of the problem.</p>
<p>I could keep listing more and more things here, but these three are the most important to remember.</p>
<p>In conclusion, I agree 100% with Brockmeier&#8217;s article, but there is more to the story. Reliability does matter. But in addition, realistic expectations matter just as much.</p>
]]></content:encoded>
			<wfw:commentRss>http://adrianotto.com/2009/09/scale-complexity-reliability-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
