Probabilistic analysis of data structures for locality-sensitive hashing functions

Shuxia Gao; Chaochao Li

doi:10.62852/csa/2025/134

Vol. 2025 No. 2 (2025)

Articles

Probabilistic analysis of data structures for locality-sensitive hashing functions

PDF

Shuxia Gao,
Chaochao Li

more info

Shuxia Gao
China Science and Technology University Press, No. 96 Jin Zhai Road, Hefei, Anhui Province,230026, China

Chaochao Li
China Science and Technology University Press, No. 96 Jin Zhai Road, Hefei, Anhui Province,230026, China

Published 17-03-2025

Keywords

Locality-Sensitive Hashing (LSH),
Collision Probability,
Probabilistic Analysis of Algorithms

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

[1]

S. Gao and C. Li, “Probabilistic analysis of data structures for locality-sensitive hashing functions”, Camb. Sci. Adv., vol. 2025, no. 2, pp. 13–16, Mar. 2025, doi: 10.62852/csa/2025/134.

Abstract

For the nearest neighbor search problem in high-dimensional spaces, Locality-Sensitive Hashing (LSH)has shown excellent performance in terms of query cost and disk space utilization. Under the traditional analysis model, LSH is considered a randomized algorithm, with the only uncertainty being the choice of the hash function. In this research, the collision probability obtained under this model is referred to as the hash-function-based collision probability. In this paper, a different analysis model is used to conduct a theoretical analysis of LSH. The motivation for this work is twofold:1) Under the existing analysis model, users must generate random data structures for each query point to achieve the theoretical effect, which is impractical in real applications.2) The performance metric that users are concerned with is the expected collision probability of a random query point in a single data structure. Based on this, this paper derives the collision probability of random point pairs under the Hamming distance for any single hash function. The collision probability derived under this model is referred to as the query-based collision probability. It is also proven that in the Hamming space, the two types of collision probabilities are identical.

PDF

References

ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun [J].ACM,2008,51(1):117-122. DOI: https://doi.org/10.1145/1327452.1327494
KE Yan, SUKTHANKARR, HUSTON L. An efficient parts-based near-duplicate and sub-image retrieval system [C]//ACM Multimedia. New York, NY, USA:ACM,2004:869-876. DOI: https://doi.org/10.1145/1027527.1027729
GAN Junhao, FENG Jianlin, FANG Qiong, et al. Locality-sensitive hashing scheme based on dynamic collision counting[C]//SIGMOD. Scottsdale, AZ, USA:ACM,2012:541-552. DOI: https://doi.org/10.1145/2213836.2213898
INDYK P, MOTWANI R. Approximate nearest neighbors: Towards removing the curse of dimensionality[C]//STOC. Dallas, Texas, USA:ACM, 1998:604-613. DOI: https://doi.org/10.1145/276698.276876
LV Qin, JOSEPHSON W, WANG Zhe, et al. Multiprobe ls h: Efficient indexing for high- dimensional similarity search[C]//VLDB. Vienna, Austria:ACM,2007:950-961.
WANG Hong Ya, CAO Jiao, SHU LC, et al. Locality sensitive hashing revisited: Filling the gap between theory and algorithm analysis [C] //CIKM. San Francisco, CA, USA: ACM, 2013:1969- 1978. DOI: https://doi.org/10.1145/2505515.2505765
BRODER AZ, CHARIKAR M, FRIEZE AM, et al. Min-wise independent permutations (extended abstract) [C]//STOC. Dallas, Texas, USA:ACM,1998,60(3):327-336. DOI: https://doi.org/10.1145/276698.276781
CHARIKAR M. Similarity estimation techniques from rounding algorithms [C]//STOC. Montreal, Quebec, Canadapages: ACM,2002:380-388. DOI: https://doi.org/10.1145/509907.509965
DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]//So CG. Brooklyn, New York, USA:ACM,2004:253-262. DOI: https://doi.org/10.1145/997817.997857
TAO Yufei, YI Ke, SHENG Cheng, et al. Quality and efficiency in high dimensional nearest neighbor search [C]//SIGMOD. Providence, Rhode Island, USA:ACM,2009:563-576. DOI: https://doi.org/10.1145/1559845.1559905
SUNDARAM N, TURMUKHAMETOVA A, SATISH N, et al. Streaming similarity search over one billion tweets' using parallel locality-sensitive hashing[J]. PVLDB, 2013,6(14): 1930-1941. DOI: https://doi.org/10.14778/2556549.2556574
MITZENMACHER M, UPFAL E. Probability and computing- randomized algorithms and probabilistic analysis[M]. Cambridge: Cambridge University Press,2005. DOI: https://doi.org/10.1017/CBO9780511813603
BRODER AZ, GLASSMANSC, MANASSE MS, et al. Syntactic clustering of the web[J]. Computer Networks,1997,29(8-13): 1157-1166. DOI: https://doi.org/10.1016/S0169-7552(97)00031-7
KLEINBERGJM. Two algorithms for nearest-neighbor search in high dimensions[C]//STOC. El Paso, Texas, USA:ACM,1997:599- 608. DOI: https://doi.org/10.1145/258533.258653

Probabilistic analysis of data structures for locality-sensitive hashing functions

Keywords

How to Cite

Download Citation

Abstract

References