Latest Publications

CPU Time stolen from a virtual machine?

Those of you studying the vmstat(8) man page may be wondering what the ’st’ figure is in the CPU column. The manual refers to it as “Time stolen from a virtual machine“. More specifically:

It’s the time the hypervisor scheduled something else to run instead of something within your VM. This might be time for another VM, or for the Hypervisor host itself. If no time were stolen, this time would be used to run your CPU workload or your idle thread.

There is some disagreement circulating about whether the Hypervisor will steal idle time, or only preempted time. In other words, it has been suggested that stolen time is where your local kernel scheduler within the VM wanted to run something but the Hypervisor made that impossible. I have found that stolen time does in fact count borrowed idle time, where the local scheduler actually had nothing to run. For example, here are some vmstat values from a VM that’s got a very low cpu workload on it:

vmstat -S M 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0    121     42     53    460    0    0     0     1    0    1  0  0 89  0 10
 0  0    121     42     53    460    0    0     0    28 1014   39  0  0 90  0 10
 0  0    121     42     53    460    0    0     0     0 1016   36  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1024   32  0  0 93  0  7
 0  0    121     42     53    460    0    0     0     0 1019   40  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1015   32  0  0 90  0 10
 0  0    121     42     53    460    0    0     0     0 1022   34  0  0 92  0  8
 0  0    121     42     53    460    0    0     0     0 1016   36  0  0 91  0  9
 0  0    121     42     53    460    0    0     0     0 1013   34  0  0 92  0  8
 0  0    121     42     53    460    0    0     0     0 1028   43  0  0 93  0  7

As you can see, user time (us), system time (sy), and iowait time (wa) are zero, but idle time is not 100%. This normally indicates that your system is doing something, but in this case idle time is actually the sum of the id and st columns.

In this example, I really don’t care that I have a nonzero st column because my workload is basically idle all the time anyway.

If you are on a cloud host where you purchase a small sliver of a server, you should expect to see nonzero values in this column when you run vmstat. If you have a heavy CPU load and need more processing power, you can solve this problem by upgrading to a larger VM server size so that you command a larger portion of the physical host.

ED Strikes Again?

It’s not the ED you are thinking of. Nope, it’s actually the External Dependency.

One piece of advice that I continually dispense is to try to reduce dependencies on remote web sites when coding your own. The problem strikes most dramatically when you run a very busy site, and you have some feed or resource that you download from a remote site. That remote site crashes, and oops, so does yours. It also happens when your busy site gets more traffic than the corresponding requests to the remote site can handle.

I ran into this again today. One site that I host was consuming a remote feed from a site that has a much smaller capacity than my customer does. The site on my end gets over 10 million page views a day (peak ~2000 page views per second). The capacity mismatch became very apparent when something went wrong on the remote end.

The code logic was:

  1. If you have a cached version of the feed, and its fresh, then use it.
  2. If the cached entry is expired, then fetch a new one, and replace the one in cache.

This logic is fundamentally flawed for busy sites. It seems sensible, but think about what happens when the cached entry expires, and the remote site is responding very slowly. All of a sudden a stampede of requests start stacking up, all trying to get the feed in parallel. It crashes the remote site even worse. The remote site tries to reboot, and you quickly crash it again. The sequence repeats indefinitely.

Why? Because the window of time during which the cache is invalid gets wider and wider as the remote site gets slower and slower. The longer that window is open, the more traffic the remote site will get from cache misses.

A clean solution is to update the cache asynchronously using a scheduled batch job that keeps a local cache of the data. Only attempt to update the cache when it has actually changed. The logic in the web appication changes to:

  1. Always use the data in the cached file.

The feed site is consulted on regular intervals using a scheduled batch job (cron), and the cached data is updated if it’s able to get a response. If the remote site is down or too slow, then the application simply continues to use the version it had before. Problem solved!

Why is this not a best practice for all web developers? Because most web sites don’t get enough traffic for it to matter much. But, if you’ve got a busy site, and you don’t want it to crash when your remote feeds do, then you might want to consider getting that data asynchronously, or at least use a cache update procedure that’s serialized.

Here is an example of a non-blocking serialization approach that works for PHP applications.

So all you web developers out there who like to consume RSS feeds on the server-side of your web application… don’t say I didn’t warn you. Go look at all your code and make sure you don’t have an dependency on a remote site. If you do, you now know at least two ways to solve that problem.

Putting Entropy in the Cloud

I was browsing through twitter mentions of @adrian_otto and found one posted by Ian Thompson mentioning an article about weak randomness in the cloud. It suggests that because there may be insufficient entropy sources on a Cloud Server or instance that it may make it easier to guess random number sequences because different cloud servers may have similar or even identical entropy pools (or worse yet identical host keys) when created, and therefore easier to break encryption algorithms that depend on them.

Yes, if you have similar entropy pools it is easier to break encryption dependent on it. It’s reasonably easy to work around this and make sure your entropy pool is uniquely initialized. You can consult the random manual for the Linux Kernel for information about how to seed your entropy pool with a particular set of data. If you are running an application in the cloud that utilizes encryption, and you are concerned about the initial state of your entropy pool, you can solve that. Use this procedure:

1) Seed your own pool from a long running system that has sufficient entropy in it, rather than relying on what you read from the kernel at startup.

2) Produce a network service that you use to seed your initial entropy pools. This service could be as simple as an entropy file that you create on pseudo-random time intervals, and just discard them as you serve them to cloud server instances (as they boot up) so you never serve the same one twice. At boot time from your VM, simply connect to wherever you run this service and download an input file to seed your entropy pool with. Restrict access to this so that it’s only available to your own server instances.

3) Make sure that your custom entropy pool initialization takes place prior to starting your encryption software.

4) If you are creating an AMI, or other server image that you plan to clone, be sure that it does not have a host key generated yet. Delete it and allow your initialization scripts to create it when the server is created (after step rather than making copies of the same one.

If you don’t trust what /dev/random or /dev/urandom emit, you can optionally use OpenSSL with prngd or egd as alternate entropy sources, and potentially feed in your own sensory input data. If you want to go hardcore, you could add environmental noise such as resistor noise on the microphone input of a sound card, or some other sensory data. There is existing software for doing just that. There’s all sorts of possibilities. Among them are a number of hardware solutions for RNG, most of which are pretty expensive and are not options for a cloud environment. There are sources of random numbers provided as a service from various sources.

There are things that we can do as Cloud Computing service providers to pre-initialize your entropy pools for you when the given server instance is created so the procedure above would be redundant. This still leaves the question as to the quality of the RNG available to you on a cloud server.

There are two standard randomness sources that you should know about:

/dev/random = produces actual entropy, if you have some, and blocks otherwise.
/dev/urandom = produces available entropy regardless of quality, but does not block.

The Linux kernel has a paravirtual entropy driver which provides kernel-side support for the virtual RNG hardware. The kernel compile option CONFIG_HW_RANDOM_VIRTIO enables it, and it can be built as a kernel module. There are drivers that run within the hypervisor host kernel that connect this with the RNG hardware available on the server (if any).

drivers/char/hw_random/amd-rng.ko = H/W RNG driver for AMD chipsets
drivers/char/hw_random/intel-rng.ko = H/W RNG driver for Intel chipsets
drivers/char/hw_random/virtio-rng.ko = VirtIO Random Number Generator support

How it works is the hypervisor host (dom0) runs rngd to read data from /dev/hwrandom (using the Intel or AMD modules mentoined above) and feeds it into /dev/random, then the guest VM (domU) does the same thing. The rngd can mixes data from both /dev/random and /dev/urandom so you get as much random data as you need in a non-blocking fashion. You can consult the kernel source code to learn more. Then you run rngd in the guest VM to feed that into the kernel.

What happens if multiple guest VM’s are reading this data at the same time using this arrangement? I’m not sure if it’s possible to deplete the entropy pool of the hypervisor host and produce PRNG patterns that are therefore less random. So if one guest VM emptied the entropy pool by aggressively reading from the /dev/hwrandom device, you might cause someone else’s guest VM to get less data. This could be solved if there were a simply a rate limit enforced on the consumption of RNG data allowed per guest VM. There is further discussion of that as well.

The truth is that for most needs you can have reasonably secure encryption by simply having an ordinary PRNG source like /dev/urandom that’s properly initialized with random data. I suggest that you use that approach in your cloud deployments.