Vol. 2025 No. 2 (2025)
Articles

Probabilistic analysis of data structures for locality-sensitive hashing functions

Shuxia Gao
China Science and Technology University Press, No. 96 Jin Zhai Road, Hefei, Anhui Province,230026, China
Chaochao Li
China Science and Technology University Press, No. 96 Jin Zhai Road, Hefei, Anhui Province,230026, China

Published 17-03-2025

Keywords

  • Locality-Sensitive Hashing (LSH),
  • Collision Probability,
  • Probabilistic Analysis of Algorithms

How to Cite

[1]
S. Gao and C. Li, “Probabilistic analysis of data structures for locality-sensitive hashing functions”, Camb. Sci. Adv., vol. 2025, no. 2, pp. 13–16, Mar. 2025, doi: 10.62852/csa/2025/134.

Abstract

For the nearest neighbor search problem in high-dimensional spaces, Locality-Sensitive Hashing (LSH)has shown excellent performance in terms of query cost and disk space utilization. Under the traditional analysis model, LSH is considered a randomized algorithm, with the only uncertainty being the choice of the hash function. In this research, the collision probability obtained under this model is referred to as the hash-function-based collision probability. In this paper, a different analysis model is used to conduct a theoretical analysis of LSH. The motivation for this work is twofold:1) Under the existing analysis model, users must generate random data structures for each query point to achieve the theoretical effect, which is impractical in real applications.2) The performance metric that users are concerned with is the expected collision probability of a random query point in a single data structure. Based on this, this paper derives the collision probability of random point pairs under the Hamming distance for any single hash function. The collision probability derived under this model is referred to as the query-based collision probability. It is also proven that in the Hamming space, the two types of collision probabilities are identical.

References

  1. ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun [J].ACM,2008,51(1):117-122. DOI: https://doi.org/10.1145/1327452.1327494
  2. KE Yan, SUKTHANKARR, HUSTON L. An efficient parts-based near-duplicate and sub-image retrieval system [C]//ACM Multimedia. New York, NY, USA:ACM,2004:869-876. DOI: https://doi.org/10.1145/1027527.1027729
  3. GAN Junhao, FENG Jianlin, FANG Qiong, et al. Locality-sensitive hashing scheme based on dynamic collision counting[C]//SIGMOD. Scottsdale, AZ, USA:ACM,2012:541-552. DOI: https://doi.org/10.1145/2213836.2213898
  4. INDYK P, MOTWANI R. Approximate nearest neighbors: Towards removing the curse of dimensionality[C]//STOC. Dallas, Texas, USA:ACM, 1998:604-613. DOI: https://doi.org/10.1145/276698.276876
  5. LV Qin, JOSEPHSON W, WANG Zhe, et al. Multiprobe ls h: Efficient indexing for high- dimensional similarity search[C]//VLDB. Vienna, Austria:ACM,2007:950-961.
  6. WANG Hong Ya, CAO Jiao, SHU LC, et al. Locality sensitive hashing revisited: Filling the gap between theory and algorithm analysis [C] //CIKM. San Francisco, CA, USA: ACM, 2013:1969- 1978. DOI: https://doi.org/10.1145/2505515.2505765
  7. BRODER AZ, CHARIKAR M, FRIEZE AM, et al. Min-wise independent permutations (extended abstract) [C]//STOC. Dallas, Texas, USA:ACM,1998,60(3):327-336. DOI: https://doi.org/10.1145/276698.276781
  8. CHARIKAR M. Similarity estimation techniques from rounding algorithms [C]//STOC. Montreal, Quebec, Canadapages: ACM,2002:380-388. DOI: https://doi.org/10.1145/509907.509965
  9. DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions [C]//So CG. Brooklyn, New York, USA:ACM,2004:253-262. DOI: https://doi.org/10.1145/997817.997857
  10. TAO Yufei, YI Ke, SHENG Cheng, et al. Quality and efficiency in high dimensional nearest neighbor search [C]//SIGMOD. Providence, Rhode Island, USA:ACM,2009:563-576. DOI: https://doi.org/10.1145/1559845.1559905
  11. SUNDARAM N, TURMUKHAMETOVA A, SATISH N, et al. Streaming similarity search over one billion tweets' using parallel locality-sensitive hashing[J]. PVLDB, 2013,6(14): 1930-1941. DOI: https://doi.org/10.14778/2556549.2556574
  12. MITZENMACHER M, UPFAL E. Probability and computing- randomized algorithms and probabilistic analysis[M]. Cambridge: Cambridge University Press,2005. DOI: https://doi.org/10.1017/CBO9780511813603
  13. BRODER AZ, GLASSMANSC, MANASSE MS, et al. Syntactic clustering of the web[J]. Computer Networks,1997,29(8-13): 1157-1166. DOI: https://doi.org/10.1016/S0169-7552(97)00031-7
  14. KLEINBERGJM. Two algorithms for nearest-neighbor search in high dimensions[C]//STOC. El Paso, Texas, USA:ACM,1997:599- 608. DOI: https://doi.org/10.1145/258533.258653