Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding

Dr. Yuchong HU
Wednesday, 20 October, 2010
2:30 - 3:30 pm
Room 833, Ho Sin Hang Engineering Building The Chinese University of Hong Kong

Currently, the information technology industry has stepped into the storage era from the computing era in a way, which makes mass storage a trend. Distributed storage systems, which disperse a data file into cheap storage nodes (small servers or PCs) across wide areas in such a way that a data collector can retrieve the data from nearby nodes, are suitable for mass storage because of their low cost and high scalability. However, due to the low availability of the storage nodes, ensuring high data reliability has become a crucial problem. To ensure data reliability depends on data fault-tolerant technologies, and one key issue of fault tolerance is how to repair effectively to minimize the system resources consumed by repairing node failures. This talk summarizes my previous research on repair mechanisms of distributed storage systems based on network coding. Dimakis et al. firstly introduced network coding into distributed storage systems and proposed Regenerating Codes (RC) which optimally trade the bandwidth needed for repairing one node failure with the amount of data stored per node of the network. However, Dimakis's work mainly focuses on repairing one node failure. Actually, there exist many scenarios where multiple failures occur in real systems. For example, some distributed storage systems like Total Recall repair failures with lazy repair policy, where a recovery is triggered only when the total amount of losses reaches a given threshold. Besides the lazy repair policy, there are many other situations where multiple nodes fail at the same time in real storage systems, such as churn, i.e., a large percent of nodes often join and/or leave the network simultaneously. Moreover, some systems have to face with the sudden disconnections of multiple nodes because the power is cut off over a wide area, or a large number of maliciously controlled nodes leave the network at the same time. So in my work, a Mutually Cooperative Recovery (MCR) mechanism is presented for multiple node failures. The lower bound of maintenance bandwidth based on MCR is analyzed, and a repair algorithm is proposed based on random linear network coding. Because this algorithm is proved to reach the lower bound, the algorithm is the bandwidth-optimal repair algorithm based on MCR. By numerical comparisons of MCR with the RC, the maintenance bandwidth is reduced by 10% while the storage capacity is decreased by 20%.


YuChong Hu received the B.S. from the School for the Gifted Young, University of Science & Technology of China, Anhui, China, in 2005. He received the Ph.D degree at the School of Computer Science, University of Science & Technology of China in 2010. Now he is a postdoctoral Fellow in the Institute of Network Coding, CUHK. His research interests include network coding and distributed storage.