DSA recently met with Andy Walls, Chief Architect and CTO, Flash Systems. Andy Walls is Chief Architect and CTO for IBM’s FlashSystems. He is also an IBM Fellow which is IBM’s most prestigious Honor. Andy is a pioneer in enabling MLC flash in the enterprise and has developed storage and server products that achieve consistently high performance while still providing the endurance and availability that the enterprises require. A bit of trivia about Andy, he has applied for or filed over 100 patents and has worked for IBM his entire 35-year career holding a BSEE from UC Santa Barbara.
DSA attended a media briefing and managed to fire off too questions which we felt delved deep into his storage knowledge. We have detailed his answers in full here as the insight could be very helpful for anyone wanting to understand NVMe or wondering what a real industry guru looks for in an enterprise storage array.
DSA : Is NVMe better suited for specific types of use cases?
Andy Walls: NVMe has two “personalities” one of the personalities is 2 and a half inch or add in cards that run PCI or express that you can plug into servers. As an example, in our new 9100 platform, we took a flash core and put it into 2 and a half inch drives that are in a 2U enclosure. The beauty of NVMe in a storage product is that if you have a storage stack then when it accesses the data off the media, it has a memory protocol instead of an HDD protocol like SCSI. So it allows you to get better performance and more parallelism. That’s inside the storage product.
Then when you look at an application server, when its connected to its external storage, it does so over fibre channel or iSCSI. These are still HDD based protocols, the drivers are heavy, and there’s not as much parallelism, it also consumes quite a bit of CPU horsepower.
So NVMe then can come in. NVMe over fabrics or over fibre channel and do two things. It can reduce latency, which may not be the most important thing, but what it can also do is reduce CPU utilisation that is used for handling IO and allow that CPU to be used for their applications.
So in terms of the use cases, there is analytics like spark clusters Hadoop that when you scale out, they naturally use the ethernet today. When you scale out like that, being able to have low latency connection to your storage especially as we go to things like 3D XP, to get that low latency is very important. So I think cognitive applications as well as Spark and Hadoop, will be able to take advantage of NVMe.
DSA: Looking at your background you have been at the sharp end of designing storage for 37 years – when it comes to flash we always hear people make claims about IOPS. But in your view what are the key technical metrics people should assess when choosing an enterprise array?
Andy Walls: Good Question! If you consider our Flash System 900, it’s a 2U product and it’s the fastest external storage in the world. It can deliver 1.3million read IOPS, and 900,000 read write IOPS.
That’s the best performance in the world, but how many customers do I have that drive 1.3 million IOPS? Zero. Nobody actually drives 1.3 million IOPS all the time.
So you might ask why build an array with such high IOPS capability? If we build in the capability to have that kind of IOPS, it means you no longer have to worry about bursts and spikes and periods of time when a lot of things come in at one time.
If I can deliver high IOPS, that give you low latency so that for workloads that are “bursty” and “spikey”, we will deliver low latency no matter what.
I would say the most important thing is not high IOPS, it is consistent performance.
Consistent low latency means consistent response time all the time. Let me use a real-world example to show what this means. In the finance sector, when the market opens in the morning, and you get a burst of activity, the array will still deliver you the same response time as a few hours earlier before the markets opened.
A lot of aspects contribute to this low latency, think of it this way. What if, a customer has a flash card that fails? You still have to deliver consistent response time even in the face of that. What if a lot of activity comes in at one time then somebody starts to write into the device to input a new application and at the same time, other critical applications are running. I still have to be able to deliver consistent performance even when that is going on.
So, the most important thing is not the number of IOPS. If we handle bursts, spikes, failures, and massive throughput such as we see during the backup windows and still deliver you the response time that you need all the time, that’s what is key.
We need consistent low latency all the time. What you don’t want to have happened is that at some point in time in the middle of the night, because a batch job is happening at the time you go to check your credit card balance, you wait for a long time because the system is overloaded.
So the really important thing is not high IOPS alone, it’s building a system for low latency for all circumstances.”