TY - GEN
T1 - LHS
T2 - 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012
AU - Zhang, Zhike
AU - Jiang, Zejun
AU - Liu, Zhiqiang
AU - Peng, Chengzhang
PY - 2012
Y1 - 2012
N2 - Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.
AB - Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.
KW - chunk-lookup disk bottleneck problem
KW - Deduplication
KW - index
KW - linear hashing
UR - http://www.scopus.com/inward/record.url?scp=84871605238&partnerID=8YFLogxK
U2 - 10.1109/ICMLC.2012.6359555
DO - 10.1109/ICMLC.2012.6359555
M3 - 会议稿件
AN - SCOPUS:84871605238
SN - 9781467314855
T3 - Proceedings - International Conference on Machine Learning and Cybernetics
SP - 1312
EP - 1318
BT - Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012
Y2 - 15 July 2012 through 17 July 2012
ER -