LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication

Zhike Zhang; Zejun Jiang; Zhiqiang Liu; Chengzhang Peng

doi:10.1109/ICMLC.2012.6359555

LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication

Zhike Zhang, Zejun Jiang, Zhiqiang Liu, Chengzhang Peng

网络空间安全学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

3 引用（Scopus）

摘要

Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.

源语言	英语
主期刊名	Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012
页	1312-1318
页数	7
DOI	https://doi.org/10.1109/ICMLC.2012.6359555
出版状态	已出版 - 2012
活动	2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012 - Xian, Shaanxi, 中国期限: 15 7月 2012 → 17 7月 2012

出版系列

姓名	Proceedings - International Conference on Machine Learning and Cybernetics
卷	4
ISSN（印刷版）	2160-133X
ISSN（电子版）	2160-1348

会议

会议	2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012
国家/地区	中国
市	Xian, Shaanxi
时期	15/07/12 → 17/07/12

访问文件

10.1109/ICMLC.2012.6359555

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, Z., Jiang, Z., Liu, Z., & Peng, C. (2012). LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication. 在 Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012 (页码 1312-1318). 文章 6359555 (Proceedings - International Conference on Machine Learning and Cybernetics; 卷 4). https://doi.org/10.1109/ICMLC.2012.6359555

@inproceedings{2efe0c806bab4b44a6c65c80434bc1a4,

title = "LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication",

abstract = "Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.",

keywords = "chunk-lookup disk bottleneck problem, Deduplication, index, linear hashing",

author = "Zhike Zhang and Zejun Jiang and Zhiqiang Liu and Chengzhang Peng",

year = "2012",

doi = "10.1109/ICMLC.2012.6359555",

language = "英语",

isbn = "9781467314855",

series = "Proceedings - International Conference on Machine Learning and Cybernetics",

pages = "1312--1318",

booktitle = "Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012",

note = "2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012 ; Conference date: 15-07-2012 Through 17-07-2012",

}

Zhang, Z, Jiang, Z, Liu, Z & Peng, C 2012, LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication. 在 Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012., 6359555, Proceedings - International Conference on Machine Learning and Cybernetics, 卷 4, 页码 1312-1318, 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012, Xian, Shaanxi, 中国, 15/07/12. https://doi.org/10.1109/ICMLC.2012.6359555

LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication. / Zhang, Zhike; Jiang, Zejun; Liu, Zhiqiang 等.
Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012. 2012. 页码 1312-1318 6359555 (Proceedings - International Conference on Machine Learning and Cybernetics; 卷 4).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - LHS

T2 - 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012

AU - Zhang, Zhike

AU - Jiang, Zejun

AU - Liu, Zhiqiang

AU - Peng, Chengzhang

PY - 2012

Y1 - 2012

N2 - Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.

AB - Indexing of RAM is important to information retrieval. In deduplication systems, we need to use methods of information retrieval to find duplicate data chunks quickly. Chunk-lookup disk bottleneck problem is one of the most important problems in the information retrieval of deduplication systems. Previous methods can reduce RAM usage of index a lot to avoid reading index from disk for every chunk search. However, these methods still need several TB of RAM to hold the index for dozens of PB of storage space utilization. We design Linear Hashing with Key Groups(LHs), a variation of Linear Hashing, to organize and address bins. Based on LHs, we propose a novel method of information retrieval in deduplication, which can avoid an index in RAM by utilizing LHs to compute the address of a bin. A bin contains the chunk IDs of the similar files to a file. Then, we do not need to maintain an index in RAM to do the same thing. Our method does not decrease the deduplication efficiency compared with Extreme Binning, when it needs one disk read for every file. For every file, our method firstly computes the bin address of this file using LHs, loads the bin and then deduplicates the file against the loaded bin. Experimental results show that, while our method does not need an index in RAM, the deduplication efficiency of our method is slightly better than that of Extreme Binning.

KW - chunk-lookup disk bottleneck problem

KW - Deduplication

KW - index

KW - linear hashing

UR - http://www.scopus.com/inward/record.url?scp=84871605238&partnerID=8YFLogxK

U2 - 10.1109/ICMLC.2012.6359555

DO - 10.1109/ICMLC.2012.6359555

M3 - 会议稿件

AN - SCOPUS:84871605238

SN - 9781467314855

T3 - Proceedings - International Conference on Machine Learning and Cybernetics

SP - 1312

EP - 1318

BT - Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012

Y2 - 15 July 2012 through 17 July 2012

ER -

Zhang Z, Jiang Z, Liu Z, Peng C. LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication. 在 Proceedings of 2012 International Conference on Machine Learning and Cybernetics, ICMLC 2012. 2012. 页码 1312-1318. 6359555. (Proceedings - International Conference on Machine Learning and Cybernetics). doi: 10.1109/ICMLC.2012.6359555

LHS: A novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此