03-05-2016 | | By Electropages
Altera’s Robert Pierce and NVMdurance’s Conor Ryan and Joe Sullivan explain how managing flash with an FPGA can extend the memory’s lifetime and cut the cost of owning and running SSDs.
The emergence of NAND flash memory has changed the way we store information in everything from mobile devices to large data centers.
Solid-state disks (SSDs) built using NAND flash are smaller, faster, quieter and cooler-running than their hard disk equivalents. Unfortunately SSDs have both an endurance problem (they wear out more quickly than HDDs), and a retention issue, in that the charge representing each data bit leaks away over time.
One reason for this is that to increase the storage of each chip, manufacturers have both shrunk the memory cell’s area, and started using each cell to store multiple bits by sensing the amount of charge it stores, rather than just the presence or absence of any charge.
In a three-bit-per-cell (TLC) design, for example, each cell stores and senses eight voltage levels. This increases storage density, at the cost of increased complexity and access time, reduced data reliability and faster wear-out.
Table 1. A comparison of the different classes of flash memory (Source: NVMdurance)
Error-correcting codes (ECC) in the flash controller can help identify and correct the resultant errors, but this comes at a cost because extra data has to be written and read to implement the ECC.
One such ECC scheme is the Bose-Chaudhuri-Hocquenghem (BCH) method, which offers a predictable operating time and relatively straightforward implementation for solutions requiring up to 50 bits of error correction per data block.
But BCH doesn’t scale well with increasing error rates, and so has been replaced in many TLC implementations by a low-density parity check (LDPC).
This approach uses both hard and soft information decode. Hard decode is like BCH, while soft decode adds further correction using hints extracted by characterizing the degradation of the NAND chip at the foundry.
These hints are coded into a read re-retry mechanism, in which a re-read is tailored to the degraded state of the NAND device. These re-reads occur randomly, and increasingly with age, which makes it difficult to predict how long a read will take. This can introduce unacceptable delays for file retrieval, leading to an SSD being assessed as having reached the end of its useful life.
A better approach would be to reduce the error rate in the data stream, rather than to correct the errors after they have occurred.
There are three main sources of error:
The wear-out rate of flash is related to the stress caused by each write. Higher stress, in the form of higher voltages or longer write times, leads to faster wear-out. However, if not enough electrons are injected onto the floating gate, the flash will suffer from retention issues and data will be lost.
This means that there is a trade-off between retention and endurance, and that early in a cell’s life, data writes can be achieved using low levels of stress, while later on, higher stress can be used to ensure retention.
Flash factories use a process called trimming to discover and set internal control registers with parameters such as voltages, write times, and read thresholds to achieve the best trade-off between retention and endurance.
In general, trim values are fixed, even though, in some applications, for example in SSDs for data centers, it could make sense to trade endurance for retention, since data is rarely kept on SSDs for long in these facilities.
Adjusting trim values can minimise wear during the early part of a device’s life, and applying more stress later in its life can ensure reliability, leading to longer overall operating lifetimes.
Using this management approach, SSD manufacturers can tailor their designs for different uses: one SSD model might need twelve months’ retention, while a hyper-scale customer might want an SSD that can be reprogrammed on demand to achieve retention figures of just one week.
This sort of active management relies on the SSD controller being able to monitor the degradation of the flash and decide when to change parameter sets. This is what NVMdurance Navigator does.
Figure 2. Managing flash with NVMdurance Navigator (Source: NVMndurance)
Actively managing flash is a delicate balancing act: too aggressive early on and the flash will wear out more quickly than necessary, but not aggressive enough and its retention may be compromised.
Flash cells don’t wear at the same rate, so a set of registers that gives one cell 500 cycles will only get 400 cycles from another. This variation is usually dealt with by guard-banding the flash, so that the even the weaker cells attain the specified endurance – although this wastes storage cycles on some cells. There are ways to exploit these spare cycles: blocks of memory that look less likely to reach the target cycling level can be ‘rested’ while their load is carried by the ‘spare’ cycles of other blocks.
Active flash management can’t work without optimal, or near optimal, parameters for every stage of the device’s life. Current devices have anywhere from 50 to 300 control registers, and this is rising.
Parameters are usually developed through a time-consuming process that combines engineering experience and massive characterisation efforts. The complexity of current devices is making this approach unmanageable, especially for actively managed flash for which each new set of trim settings demands a new characterisation run.
The NVMdurance Pathfinder tool automates the process of discovering and testing register values. Its machine-learning engine varies register values and monitors the effects in both hardware and in simulation models. Hundreds of millions of permutations are tried, and the results of the hardware trials are also used to improve the simulations.
The goal is produce a set of register values that, when managed by NVMdurance Navigator, help the flash last longer.
Figure 3. Automatically discovering parameter sets with machine learning (Source: NVMNdurance)
What does all this mean in practice? For end users, such as data centre managers, increasing the life expectancy of NAND arrays cuts the cost of owning and running an SSD. The ability to upgrade the management algorithms while the drive is in service may offer another way to extend the service life.
Data centre managers also worry about the lifetime consistency of the throughput and performance of the drives they use. Conventional controllers cannot counter the performance decline that comes with flash-cell ageing, caused by issues such as large writes, and multi-cycle error correction.
The growing complexity of flash management algorithms has also made SSD controller designs complex, and limited the diversity of NAND devices that each controller can support.
Altera, NVMdurance and Mobiveil are addressing these issues with a jointly developed, reconfigurable and upgradeable active flash memory controller on a single FPGA SoC. It uses the NVMdurance software to extend the lifetime of flash, and paves the way for a new class of field-reconfigurable SSDs.
Not only can the flash be removed when it has worn out, truly commoditising the SSD, the controller can also be reconfigured for different uses. It is this ability to easily modify the hardware that makes this an ideal approach for actively managing flash.
Robert Pierce is a Computer and Storage Architect at Intel Programmable Solutions Group. He is responsible for the strategic direction of the Computer and Storage Product line, development of advanced system and IP for the market segment. He works with strategic customers to enable new capabilities to improve cost, power and performance. He also develops emerging market segments to expand the capabilities of FPGAs and improve current solutions. Prior to Intel PSG/Altera, he has worked at a variety of industry leading semiconductor companies including Cadence and Infineon.
Conor Ryan is the co-founder and CTO for Software at NVMdurance. He is responsible for the software that discovers and tests the parameter sets. Ryan has a Computer Science background and is a well-published author in Machine Learning. He was recently a Fulbright Scholar at the Massachusetts Institute of Technology Computer Science and Artificial Intelligence (CSAIL) Laboratory. For 16 years prior to founding NVMdurance, Ryan worked at the University of Limerick, Ireland.
Joe Sullivan is the co-founder and CTO for hardware at NVMdurance. He is responsible for all the low-level interaction with the solid state discs including socket and interface design. Joe has been working on electronic design and machine learning systems for over 10 years. Prior to co-founding NVMdurance, Sullivan spent 12 years as a Senior Lecturer in Limerick Institute of Technology. His background is in engineering and he spent more than ten years’ with Analog Devices specialising in Non Volatile Memory.