Understanding Caringo erasure coding and replication in storage cluster patents

In 2005, Caringo founders Mark Goros, Jonathan Ring and Paul Carpentier saw an anomaly: consumer grade storage was getting cheaper but enterprise storage arrays remain very expensive. They decided to band together to build a new storage system. Carpentier co-developed Content Addressable Storage in his prior company FilePool – later bought by EMC to become EMC Centera.
 
U.S. Patent and Trademark Office patent number 8,799,746 was awarded to Caringo on September 16, 2014 for erasure coding and replication in storage clusters. The patent covers the combining of replication and erasure coding managed by the storage cluster in order to take advantage of the benefits of each without needing control computers or control databases including:
 

  • A method of specifying replication or erasure coding by “instructions from a client, an inherent property of the object, the metadata of the object, a setting of the cluster, or by other means.”
  • A method of replacing content associated with a unique identifier.
  • A method of using the manifest or replicated objects in the case of a failed disk scenario.
  • A method of relocating an object within a storage cluster without the need for an extra control computer or control database.
  • A method of using metadata to specify the transformation from replication to erasure coding automatically based on trigger conditions.
  • A method of moving an object from one cluster to another and automatically converting to the storage format used by the second cluster, dictated by default cluster settings, by user metadata of the object, or by instructions from the program initiating the move.

 
Caringo’s approach lets users take advantage of the footprint efficiency of erasure coding and the low processing utilization and rapid access of replication without the introduction of the complexity or single point of failure associated with control computers and databases.
 
Caringo Swarm is the company’s software-defined object storage that leverages simple and emergent behavior with decentralized coordination turning standard hardware into a reliable pool of shared resources that adapts to any workload or use case while offering a foundation for new data services.
 
Following the patent awarding, DataStorage&Asean contacted CEO Caringo Mark Soros regarding the company’s object storage strategy, successes and direction.
 
DSA: Object storage has been around as a discussion point for a few years now on the back of big data interests. Is big data the be all and end all of object storage?
 
Mark: Actually, Object Storage has much more to do with scalability, affordability​, long term data protection, and metadata as opposed to Big Data. Companies are storing more data than ever before. They need a way to store it cost effectively (Caringo runs on any x86 servers and allows you to use the most cost effective options available in the market for your use case). They need a way to be sure that they can seamlessly scale up and out. File Systems do not scale. They slow down as they get fuller and they break as they are stretched to their limits.
 
Caringo Swarm scales seamlessly. To add storage to your Caringo Swarm, simply plug another node (server) into the same subnet, power it on, and the Caringo Cluster Services Node will netboot the server. In 90 seconds or less, it will be up and running as a storage node taking read and write requests. No configuration, no provisioning, just seamless scaling of more space and more performance. No other vendor can do this. Others restrict you to how you add storage, how little or how much you must add at a time. Some actually make you rebalance your storage system or even take the system down! With Caringo Swarm, you scale how much and when you need to and you never bring the system down.
 
Companies are storing data for longer periods of time. There are regulations that even require some enterprises or hospitals to store data and protect it for very long periods. This requires a system that can protect data now, for 3 months, or for 30 years! You are guaranteed data integrity, access and protection for as long as you need it. You can even change the protection mechanism to reflect the value of the data (replication/erasure coding). File systems cannot do this. Other Object vendors cannot do this either. File systems slow down as they get fuller. Caringo Swarm goes as fast with 10 objects as it does with 10 billion objects.
 
Adding metadata to data empowers your organization to use data today and into the future. Data you store today, may have new, unintended uses in the future. Metadata stored with the data (in a true object model – Caringo is the only one without a separate metadata database) makes your data more valuable today and into the future. Metadata is the key to value in object storage.
 
What we see in the industry is companies taking data from many different repositories, copying that data into HDFS and then running Hadoop or big data analysis jobs on that data and then throwing the copied data away. ​As far as Caringo is concerned, this is a big waste of resources. Data stored in Caringo Swarm can be directly accessed by Hadoop jobs. As well, we have a metadata search capability that provides many opportunities for analytical views of data in Caringo Swarm. So, for big data, if you keep your data in Swarm, Big Data Analysis becomes much easier, faster and more cost effective.
 
DSA:  What are the alternative technologies promoted by other object storage vendors and why are these inferior to Coringo's architecture?
 
​Mark: There are three types of object storage vendors: those that lock you into their hardware model, open source and pure software. Caringo is a pure software model that will run on cost effective industry standard hardware with no vendor lock in. Architecturally pure where every node runs the same code so that every node can perform every function. This is important because if you have special purpose nodes (every other object vendor other than Caringo does) they represent bottlenecks and single points of failure. Caringo uses a pure swarm architecture which gives us the flexibility, robustness, performance and protection that other vendors cannot match.
 
Open source is fine for experiments. But when something goes wrong, who is going to fix it? You have to maintain a huge supply of developers to even have a chance at running a production application with open source storage. As well, there is no quality assurance in the open source world. If you depend on the integrity, accessibility and growth of your data, you would be foolish indeed to depend on open source.
 
Other object storage vendors have inflexible architectures that are based on file systems. They are simply inferior to Caringo Swarm. Here's an example: If I want to protect a 100K object using an erasure coding method of 10,6, that means I create 10 slices of data and 6 slices of parity. In a Caringo Swarm, it would be most efficient to just replicate that object which would take up 200K. If you wanted to use EC (erasure coding) with Caringo, that object would take up 160K. If you used EC with another vendor who has a file system below the object layer with a standard 64k default block size, each slice of the object which should be 10K (100K / 10 data segments) would end up using 640K (64k minimum block * 10) and then your parity segments would take another 384k (6*64k). So that object would take up 1,024k. That is a total waste of space. Why not replicate it? Most other vendors offer EC or replication but not both. Using either protection model, Caringo Swarm takes either 160K or 200K for this object. Others would take 1,024K.
 
Multiply this by 1 billion objects and you can get a feel for the inefficiency of competing vendors.
 
DSA: What applications are best suited to take advantage of Caringo Swarm? Please cite example.
 
Mark: ​Any use case that involves HTTP, the internet, lots of data, growing data, data that has to be protected are all great use cases for Caringo Swarm. Essentially anything except real-time transaction systems and databases.
 
Some examples:
 
Web 2.0 (Ask.com, MassiveMedia, Twoo.com, Terremark, Rightmove)
 
​Video archiving and surveillance (City of Austin, Bexar County)
 
Throuhput and resiliance (Verizon, KineticD)
 
Medical Imaging (JHU, The US Department of Defense medical health systems, Kure hospital in Japan)

* Editor's note: The IDC MarketSpace report Object-Based Storage 2013 Vendor assessment listed 13 object storage vendors that it considers to be representative of the market (in alphabetical order): Amplidata, Basho Technologies, Caringo, Cleversafe, Cloudian, DataDirect Networks (DDN), EMC, Hitachi Data Systems (HDS), Huawei, NEC, NetApp, Scality, and Tarmin.

You might also like
Most comment
share us your thought

0 Comment Log in or register to post comments