ED Strikes Again?
It’s not the ED you are thinking of. Nope, it’s actually the External Dependency.
One piece of advice that I continually dispense is to try to reduce dependencies on remote web sites when coding your own. The problem strikes most dramatically when you run a very busy site, and you have some feed or resource that you download from a remote site. That remote site crashes, and oops, so does yours. It also happens when your busy site gets more traffic than the corresponding requests to the remote site can handle.
I ran into this again today. One site that I host was consuming a remote feed from a site that has a much smaller capacity than my customer does. The site on my end gets over 10 million page views a day (peak ~2000 page views per second). The capacity mismatch became very apparent when something went wrong on the remote end.
The code logic was:
- If you have a cached version of the feed, and its fresh, then use it.
- If the cached entry is expired, then fetch a new one, and replace the one in cache.
This logic is fundamentally flawed for busy sites. It seems sensible, but think about what happens when the cached entry expires, and the remote site is responding very slowly. All of a sudden a stampede of requests start stacking up, all trying to get the feed in parallel. It crashes the remote site even worse. The remote site tries to reboot, and you quickly crash it again. The sequence repeats indefinitely.
Why? Because the window of time during which the cache is invalid gets wider and wider as the remote site gets slower and slower. The longer that window is open, the more traffic the remote site will get from cache misses.
A clean solution is to update the cache asynchronously using a scheduled batch job that keeps a local cache of the data. Only attempt to update the cache when it has actually changed. The logic in the web appication changes to:
- Always use the data in the cached file.
The feed site is consulted on regular intervals using a scheduled batch job (cron), and the cached data is updated if it’s able to get a response. If the remote site is down or too slow, then the application simply continues to use the version it had before. Problem solved!
Why is this not a best practice for all web developers? Because most web sites don’t get enough traffic for it to matter much. But, if you’ve got a busy site, and you don’t want it to crash when your remote feeds do, then you might want to consider getting that data asynchronously, or at least use a cache update procedure that’s serialized.
Here is an example of a non-blocking serialization approach that works for PHP applications.
So all you web developers out there who like to consume RSS feeds on the server-side of your web application… don’t say I didn’t warn you. Go look at all your code and make sure you don’t have an dependency on a remote site. If you do, you now know at least two ways to solve that problem.
You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
Social comments and analytics for this post…
This post was mentioned on Twitter by adrian_otto: ED Strikes Again? How web developers can keep it up. http://bit.ly/9C93Yp...