OpenStack Object Storage is Great For…
Soon, the OpenStack Object Storage software will be released. It’s available now as a Developer Preview if you would like to contribute, or perhaps if you’re just curious. The first release is expected later this month. This is a fantastic piece of software that really hits the mark for scalability, high availability, and performance.
About OpenStack Object Storage
OpenStack Object Storage was originally developed by Rackspace, and was released as Open Source Software earlier this year as part of the OpenStack Project. It was written for hosting the Rackspace Cloud Files service. It’s original project code name was swift, so you may see references to that in various documentation.
OpenStack Object Storage aggregates commodity servers to work together in clusters for reliable, redundant, and large-scale storage of static objects. Objects are written to multiple hardware devices in the datacenter, with the OpenStack software responsible for ensuring data replication and integrity across the cluster. Storage clusters can scale horizontally by adding new nodes, which are automatically configured. Should a node fail, OpenStack works to replicate its content from other active nodes. Because OpenStack uses software logic to ensure data replication and distribution across different devices, inexpensive commodity hard drives and servers can be used in lieu of more expensive equipment. [1]
The system uses a flat namespace, and has a concept an account (how you access the system), a container (like a directory) and an object (like a file). You can have an arbitrary number accounts each with an arbitrary number of containers. Each container can hold an arbitrary number of objects.
OpenStack Object Storage is very good for is storing unstructured data using an object name as a lookup key (like a filename). You access your data from a web client using the web service REST API, not like a filesystem. Download an object (like a file) using an HTTP GET request, fetch object metadata with an HTTP HEAD request, delete an object with an HTTP DELETE request, etc. There are multiple language bindings so you can access your files in OpenStack Object Storage from your favorite language natively (Java, Python, Perl, PHP, .NET, etc.).
The system has no central point of failure, so it’s extremely fault tolerant, and the data and related metadata are distributed throughout the system, so there are no central scalability constraints. You can store arbitrary amounts of data in the system in both large and small sizes. It performs very well, even under very high levels of concurrency. It keeps multiple replicas of each object, so it’s reliable, and the storage is very durable, without any expensive hardware. You don’t need any RAID on any of the servers unless you want it for additional performance.
Use OpenStack Object Storage For…
Here are some good use cases for OpenStack Object Storage:
- Storing media libraries (photos, music, videos, etc.)
- Archiving video surveillance files
- Archiving phone call audio recordings
- Archiving compressed log files
- Archiving backups (<5GB each object)
- Storing and loading of OS Images, etc.
- Storing file populations that grow continuously on a practically infinite basis.
- Storing small files (<50 KB). OpenStack Object Storage is great at this.
- Storing billions of files.
- Storing Petabytes (millions of Gigabytes) of data.
Recognize the Limitations
Objects must be <5GB
This is an arbitrary size limit, but it can not be set to an unlimited value because of the system design. If you want to store a backup something larger than 5GB, you’ll need to have a way of breaking it up into chunks, and storing some manifest of the parts so you can later join them back together again when you want to download the data and use it again.
Not a Filesystem
Uses a REST API, or a language binding that consumes the REST API. It does not use the typical POSIX filesystem semantics like open(), read(), write(), seek(), and close().
No User Quotas
There are no maximums that can be configured on a per-user basis to limit how much storage is used.
No Directory Hierarchies
You can create an arbitrary number of containers, but there is no nested container capability. You can simulate a directory structure using creative object names, but this is limited to a maximum string length. If you only need a shallow hierarchy, or don’t have long directory names, this might be fine. Just remember that I warned you this is generally a bad idea.
No writing to a byte offset in a file
The only way to update a file is to essentially overwrite it. The system creates a new version of an object each time you upload one with the same name.
No ACL’s
Per-Container ACL’s will probably be added in a later release. Per-Object ACL’s will probably not be supported, but maybe.
No Append Support
It’s possible that this may be added at a later time using a versioning trick.
No File Locking
Most filesystems integrate with the kernel to offer advisory locking. This is not possible with OpenStack Object Storage.
Eventual Consistency
Don’t expect version consistency between multiple nodes when data is being updated.
If you upload a new version of an object, and immediately GET that object from another client, you may get a previous version of the file. There is no way to know which version of a given object the system is responding with, unless you set version metadata on each object yourself. If there is any problem with the network, you may get outdated versions of objects, or be able to see objects that were deleted, but the local node may not yet know are deleted.
No Support for Data Encryption
You must encrypt the data yourself. The current version does not have SSL support either. Use an SSL proxy to work around this by terminating the SSL sessions on the same network where the OpenStack Object Storage system runs.
Not Compatible With Web Browsers
You must supply a storage token header to authorize each request. Regular web browsers can’t do this. This can be solved using a proxy between the client and the system to handle token authentication. This is not a problem is you are using one of the language bindings. They will take care of this when you integrate your web app with the system.
Not a Database
It supports no querying or processing of data on the servers. All you can do is list the objects within a given container. There is no way to search based on object metadata. You need to keep your own external search indexes.
Don’t try to frequently update large objects.
All updates produce a new version of an object, because objects are immutable.
Don’t store unlimited objects per container
You can store as many objects in a container as you wish. However, your per-object upload latency will increase considerably one you reach a certain point. I found the optimal number of objects per container to be just under one million. This number will vary depending on your equipment, and how heavy of a workload it’s subjected to.
Changing Swift Into a Filesystem
You might think of using FUSE to access objects and containers in OpenStack Object Storage as files and directories with a filesystem interface, but you’ll quickly discover that this is only really good for very simple use cases. Most of the things you need to implement what we think of as a filesystem are missing.
If you are a developer, and you are thinking of building a filesystem on top of OpenStack Object Storage using objects as blocks, that could possibly work, but would probably not perform very well compared to existing alternatives that are actually designed for distributed block storage. The blocks would need to be pretty large to keep the network/protocol overhead down. Frequent writing is not likely to work well. Most users of filesystems are not expecting eventual consistency behavior. They want strong data consistency. You would also want some strategy to handle read/write concurrency with some locking capability. Plus, you would need to have a way to keep track of the blocks like a filesystem does in some data structure or database. Frankly speaking, OpenStack Object Storage is probably not the right tool for the job.
Conclusion
You should probably only use OpenStack Object Storage for use cases it’s intended for. If what you really want is a clustered filesystem, you’re probably better off looking at other solutions like Lustre, GlusterFS, GFS, OCFS, etc. Keep in mind that each of these have their own strengths and weaknesses. Pay particular attention to what they are designed for, and use them accordingly. If you want to use OpenStack Object Storage for something that it was designed for, then you will probably be very happy with it. Keep in mind that it’s a blob storage system. It’s not a filesystem, not a file server, not a database, etc. To learn more about OpenStack Object Storage, please check out the Developer Documentation.
You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
So its not a database (cos its use case is unstructured content) and its not a DMS like Documentum etc. So why would someone want to use this?
Well if you are looking for a database for structured data, then you don’t want Openstack Object Storage, plain and simple.
Most document oriented systems don’t perform well when you have a _ton_ of small files in them. Many have the critical flaw of a centralized metadata resource. We have proven that this approach does not scale if you plan to have billions or trillions of files amounting to petabytes of data. If you only have a small population of files, than OpenStack Object Storage is probably not the right fit.
Another potential use case is when you need a near-line cache for unstructured data that is never automatically purged to evict storage for new content. If a small cyclical memory cache does not work because you need guaranteed persistence of the cache, you can put the smaller in-memory cache next to the application (if needed), and the OpenStack Object Storage system behind it as a backing store. This can help a lot when generating the cached object from it’s original data is particularly expensive, and you have the ability to delete a cached element at the end of some workflow. Simply add nodes to make sure you have enough capacity to hold the working set of cached data. This can be much cheaper than scaling out an in-memory system like memcached to very large storage capacities.
Otto, how do you see swift scalability when it comes to storing small files (lets say items in the scale of 40-100kB) (and billions of them).
Facebook apparently used haystack to solve this(as their nfs was blowing up).
Karri,
The OpenStack Object Storage (formerly swift) software is a very good fit for small files in the range you indicated. From what I know about haystack, they share similar designs. Swift can keep up with scaling writes and storage capacity simply by adding additional nodes and re-balancing.
During my extensibility testing I found that if I kept the number of objects per container to under ~1 million, that it performed very well, even with billions of very small files spread between several thousand containers. I tested the extreme case by creating billions of 1-byte files, and that worked fine, provided I did not try to store them all in the same container. Feel free to run your own study, and share your results.
Keep in mind that the performance barrier for number of objects per container for ideal performance will vary depending on the specifics of your equipment, and potentially even the version of SQLite software running on it. I suppose it’s possible to patch the container servers so that they automatically stripe the container information across multiple SQLite files, and relieve pressure on that particular bottleneck. You’d have to judge the complexity trade-off to see if that makes sense for you.
I suppose that you will also run into the same constraint when you have millions of containers, in which case you would want to split the files between multiple accounts, after which time, you’d hit the constraint at roughly a million accounts. So If you have more than 1×10^18 files (35,527 to 88,817 Petabytes in your case) to store, you’ll definitely want to work on solving that, or accept a performance penalty. To put things into perspective, if each of your nodes had 67TB of storage on-line, that would mean you’d have somewhere between 542,981 and 1,357,454 servers on-line. I’m guessing that you probably don’t have a server budget of between 42+ and 102+ billion dollars, so that’s probably not too shocking.
In short, yes, if you need something like haystack, chances are that OpenStack Object Storage will fit that need nicely.
Cheers,
Adrian
It seems exactly what am I looking for. I need to keep video surveillance files
I’m curious what kind of transaction rates you experienced when you say OpenStack is a good fit for lots of small files. We’re running a proof of concept on 3 rack units (our 3 zones). We’ve got 3 proxies and 4 linux VMs where we run JMeter servers to push a heavy load round robin-ing through the proxies. We’ve got a use case for storing millions of items (~18M) per day all in the 15k or less range. Around a 0.5PB of total storage (7 year retention).
However, even with 40 JMeter load threads (spread across the 4 VMs) we aren’t seeing anything close to the throughput we need to see.
I know you can’t comment on why we’re slow but I’m curious what you were able to achieve in your experience.
Thanks,
Matt
Matthew,
I ran my testing on a nice big environment in Rackspace’s multi-million dollar test lab. I was able to push transaction rates of 100′s of thousands of transactions per second in a read/write mix using an array of clients not much larger than what you described. In general, the key to expanding capacity is to add more nodes. If you double the number of nodes, your system performance will double, with a few exceptions…
Don’t put too many objects in the same container. I found when I ran my testing that the write performance of a single container would drop off considerably once it had about a million objects in it. So if you are simply dropping millions of objects into the same container you won’t see optimal performance no matter how many nodes you add because you are bottle-necked on the container. When I only put in a quarter million files into each container I got a very dramatic performance boost. Where this practical limit falls in terms of number of objects will depend on your equipment.
I found that performance was good up until I reached the “tipping point” in terms of number of objects in a container, and then once that point was reached performance became progressively slower until it was in the unacceptable range that you described. So, I urge you to do some testing and chart out what your performance is at various different numbers of objects and find the cutoff that’s optimal for your hardware setup. I did submit this performance limitation as a “bug” prior to product release because it can be solved with a more elegant system design for containers. A solution will make its way into the product at some point, if not already. In the mean time, simply know where the effective performance cliffs are for your installation, and create a new container as you approach them.
Note that you will get a similar problem if you simply create millions of containers, because you’ll hit the same sort of bottleneck at the account level. So fill up your containers to a level where they perform nicely, and create new ones as needed. If you get to the point when you have a million or so containers, simply create a new account and start filling that up. This will work until you have 10^18 files (A Quintillion) in a single swift system. Simply put, if each of your containers holds a million files nicely, then for your use case you would need to create a new account every 152 years or so. If it only holds 100,000 files nicely per container, then you’ll need to make a new account in just over 15 years. If you keep creating new accounts on this interval… you have an effectively unlimited storage capacity in the system as long as you continue to add new nodes at the rate proportional to the rate of adding new files.
If you are concerned with read performance, you can increase the replica count so that you have more replicas of the data, and the proxies can route client requests between them. Chances are if you need more than the default replica count, you’re probably better off front-ending swift with a CDN like Rackspace does with Akamai for the Cloud Files service. That way you can end up with effectively thousands of copies of the data distributed more closely to the end users.
More advice… most system engineers seem reluctant to waste storage capacity. However, sometimes adding a TON of storage is just what you need in order to get enough disk drives in a system to handle the desired transaction rate. Where you may only need 0.5PB of storage, you might decide to provision much more than that in order to get the throughput you want. Don’t fret about that. Put in the hardware you need, even if it’s leaving unused storage. So long as you can afford the expenditure, don’t allow the irrational fear of wasting storage get in the way of your performance needs.
And another bit of advice… compress your data before you put it in swift, even if you have tons of available storage. The smaller your data is, the higher your potential transaction rate will be. The CPU time needed to compress the data is not significant in comparison to the time it takes to write the data out to disk. You can easily compress data way faster than you can stream the compressed data to the drives. Adding SSD media is not going to change that dramatically either.
Adrian
Adrian -
I am working on a large OpenStack Swift/Nova implementation and I am looking in to ways to stress test it and understand the best metrics for building out the production system.
One question that I would like to explore the different behavior of a Swift implementation where there are separate account/container/object store nodes in the zones. It would seems to me that it would be dependent on the use case. While we are still working on this, in this case it is likely that we might have either millions of accounts with few containers and objects, or few accounts with millions of containers with few objects in them. Essentially think of this systems as being a consumer grade dropbox type application. Are there any test criteria that we can use to see what the behavior would be between these two configurations?
very nice documentation, one question on “If you only have a small population of files, than OpenStack Object Storage is probably not the right fit.”
very nice documentation, one question on “If you only have a small population of files, than OpenStack Object Storage is probably not the right fit.”
In cloud solution, I need several nodes to share configuration, and beside object storage, which is the better way ? database or filesystem seems not fit in cloud.
thanks
If you have a small number of files (they easily fit on a single node), it’s not worth the complexity of having a fully distributed storage system like OpenStack Object storage. You are probably better off simply placing them on a single server using a traditional approach and having a backup copy, either on another server or in an archive. If you need to grow beyond what will fit on a single node, then it makes sense to set up an OpenStack Object Storage cluster (typically starting at 4 nodes) to support that growth, and allow you to continue to expand.
Although it is possible to deploy all the components of the system on a single node, and later add more nodes as needed, my general attitude about that is that it’s overkill.
The system does discriminate between account, container, and object services, so each of those can be scaled independently. So if you end up with very large numbers of accounts or containers, you have a way to address that. In general, the more entries you have within a given node, the lower your overall performance expectations should be. From my experience, performance was pretty consistent until a node had about a million entries, after which operations that required modification of metadata did slow down considerably. So, as long as you are distributing the large number of accounts or containers over a sensible number of nodes, that should not be a major concern.
If you just have a small population of files, and want to share them between multiple servers, I would use a network file service like NFS (*nix) or CIFS2/SMB (Windows) to a single file server. If it’s a mission critical application that you can’t solve with a simple backup discipline, or is going to see very high activity levels, then using Open Stack Object Storage may be justified, because that’s roughly the same complexity as any active-passive high availability fileserver configuration would be.
thanks for several feedback, so it is also suitable to running a file server above cloud storage ? since I consider file server inside cloud is bad idea.
It depends.
Its possible to do stuff like run Gluster, and have a that store its chunks in swift. As long as your overall performance requirements are not too restrictive, that could work fine. However, the more layers of network services that you layer between the application and its data, the more you will slow it down. Remember also that there is a direct relationship between software+system complexity and the rate of software defects. In general its best to keep things simple wjere possible. Depending on what solution you select you may be forced to make compromises between availability rates, performance, and consistency.