Nearest neighbor search: Difference between revisions
No edit summary |
|||
(167 intermediate revisions by 97 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Optimization problem in computer science}} |
|||
'''Nearest neighbor search''' ('''NNS'''), also known as '''proximity search''', '''similarity search''' or '''[[Closest pair of points problem|closest point search]]''', is an [[optimization problem]] for finding closest (or most similar) points. Closeness is typically expressed in terms of a dissimilarity function: The less similar are the objects the larger are the function values. Formally, the nearest-neighbor (NN) search problem is as follows: given a set ''S'' of points in a space ''M'' and a query point ''q'' ∈ ''M'', find the closest point in ''S'' to ''q''.[[Donald Knuth]] in vol. 3 of ''[[The Art of Computer Programming]]'' (1973) called it the '''post-office problem''', referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a ''k''-NN search, where we need to find ''k'' most closest points. |
|||
'''Nearest neighbor search''' ('''NNS'''), as a form of '''proximity search''', is the [[optimization problem]] of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less [[similarity measure|similar]] the objects, the larger the function values. |
|||
Formally, the nearest-neighbor (NN) search problem is defined as follows: given a set ''S'' of points in a space ''M'' and a query point ''q'' ∈ ''M'', find the closest point in ''S'' to ''q''. [[Donald Knuth]] in vol. 3 of ''[[The Art of Computer Programming]]'' (1973) called it the '''post-office problem''', referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a ''k''-NN search, where we need to find the ''k'' closest points. |
|||
Most commonly ''M'' is a [[metric space]] and the dissimilarity is expressed as a [[distance metric]], which satisfied the [[triangle inequality]]. However, the dissimilarity function can be arbitrary. One example are asymmetric [[Bregman divergence]]s <ref name=Cayton2008>{{Cite journal |
|||
Most commonly ''M'' is a [[metric space]] and dissimilarity is expressed as a [[distance metric]], which is symmetric and satisfies the [[triangle inequality]]. Even more common, ''M'' is taken to be the ''d''-dimensional [[vector space]] where dissimilarity is measured using the [[Euclidean distance]], [[Taxicab geometry|Manhattan distance]] or other [[Statistical distance|distance metric]]. However, the dissimilarity function can be arbitrary. One example is asymmetric [[Bregman divergence]], for which the triangle inequality does not hold.<ref name=Cayton2008>{{Cite book |
|||
| last1 = Cayton | first1 = Lawerence |
| last1 = Cayton | first1 = Lawerence |
||
| year = 2008 |
| year = 2008 |
||
| |
| chapter= Fast nearest neighbor retrieval for bregman divergences |
||
| |
| title = Proceedings of the 25th International Conference on Machine Learning |
||
| |
| series = <!-- --> |
||
| pages = 112–119 |
|||
}}</ref>. Even more common, ''M'' is taken to be ''d''-dimensional [[Euclidean space]] and the dissimilarity is measured using the [[Euclidean distance]], [[Taxicab geometry|Manhattan distance]] or other [[Statistical distance|distance metric]]. |
|||
| doi = 10.1145/1390156.1390171 |
|||
| isbn = 9781605582054 |
|||
| s2cid = 12169321 |
|||
}}</ref> |
|||
==Applications== |
==Applications== |
||
The nearest neighbor search problem arises in numerous fields of application, including: |
The nearest neighbor search problem arises in numerous fields of application, including: |
||
*[[Pattern recognition]] |
* [[Pattern recognition]] – in particular for [[optical character recognition]] |
||
*[[Statistical classification]] |
* [[Statistical classification]] – see [[k-nearest neighbor algorithm]] |
||
* [[Computer vision]] – for [[point cloud registration]]<ref>Qiu, Deyuan, Stefan May, and Andreas Nüchter. [https://core.ac.uk/download/pdf/22872975.pdf "GPU-accelerated nearest neighbor search for 3D registration."] International conference on computer vision systems. Springer, Berlin, Heidelberg, 2009.</ref> |
|||
*[[Computer vision]] |
|||
*[[Computational |
* [[Computational geometry]] – see [[Closest pair of points problem]] |
||
* [[Cryptanalysis]] – for [[lattice problem]]<ref>Becker, Ducas, Gama, and Laarhoven. [https://eprint.iacr.org/2015/1128.pdf "New directions in nearest neighbor searching with applications to lattice sieving."] Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms (pp. 10-24). Society for Industrial and Applied Mathematics.</ref> |
|||
*[[Database]]s - e.g. [[content-based image retrieval]] |
|||
* [[Database]]s – e.g. [[content-based image retrieval]] |
|||
*[[Coding theory]] - see [[Decoding methods|maximum likelihood decoding]] |
|||
* [[Coding theory]] – see [[Decoding methods|maximum likelihood decoding]] |
|||
*[[Data compression]] - see [[MPEG-2]] standard |
|||
* [[Semantic Search]] |
|||
*[[Recommender system|Recommendation systems]], e.g. see [[Collaborative filtering]] |
|||
* [[Data compression]] – see [[MPEG-2]] standard |
|||
*[[Internet marketing]] - see [[contextual advertising]] and [[behavioral targeting]] |
|||
* [[Robotic]] sensing<ref name=panSearch>{{cite conference|last1=Bewley|first1=A.|last2=Upcroft|first2=B.|date=2013|title=Advantages of Exploiting Projection Structure for Segmenting Dense 3D Point Clouds|conference=Australian Conference on Robotics and Automation |url=http://www.araa.asn.au/acra/acra2013/papers/pap148s1-file1.pdf}}</ref> |
|||
*[[DNA sequencing]] |
|||
* [[Recommender system|Recommendation systems]], e.g. see [[Collaborative filtering]] |
|||
*[[Spell checking]] - suggesting correct spelling |
|||
* [[Internet marketing]] – see [[contextual advertising]] and [[behavioral targeting]] |
|||
*[[Plagiarism detection]] |
|||
* [[DNA sequencing]] |
|||
*[[Contact searching algorithms in FEA]] |
|||
* [[Spell checking]] – suggesting correct spelling |
|||
*[[Similarity score]]s for predicting career paths of professional athletes. |
|||
* [[Plagiarism detection]] |
|||
*[[Cluster analysis]] - assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on [[Euclidean distance]] |
|||
* [[Similarity score]]s for predicting career paths of professional athletes. |
|||
*[[Chemical similarity]] |
|||
* [[Cluster analysis]] – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on [[Euclidean distance]] |
|||
* [[Chemical similarity]] |
|||
* [[Motion planning#Sampling-based algorithms|Sampling-based motion planning]] |
|||
==Methods== |
==Methods== |
||
Various solutions to the NNS problem have been proposed. |
Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the [[curse of dimensionality]] states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. |
||
=== |
=== Exact methods === |
||
The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a [[running time]] of ''O''(''Nd'') where ''N'' is the [[cardinality]] of ''S'' and ''d'' is the dimensionality of ''M''. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces.<ref>{{cite web|title=A quantitative analysis and performance study for similarity search methods in high dimensional spaces|author=Weber, Schek, Blott | url=http://www.vldb.org/conf/1998/p194.pdf}}</ref> |
|||
=== |
====Linear search==== |
||
The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a [[running time]] of ''O''(''dN''), where ''N'' is the [[cardinality]] of ''S'' and ''d'' is the dimensionality of ''S''. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces.<ref>{{cite book|chapter=A quantitative analysis and performance study for similarity search methods in high dimensional spaces|last1=Weber|first1=Roger|last2=Schek|first2=Hans-J.|last3=Blott|first3=Stephen | title=VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases | pages=194–205 | year=1998 | chapter-url=http://www.vldb.org/conf/1998/p194.pdf}}</ref> |
|||
Since the 1970s, [[branch and bound]] methodology has been applied to the problem. In the case of Euclidean space this approach is known as [[spatial index]] or spatial access methods. Several [[Space partitioning|space-partitioning]] methods have been developed for solving the NNS problem. Perhaps the simplest is the [[k-d tree]], which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is ''O''(log ''N'') <ref>{{cite web|title=An introductory tutorial on KD trees|author=Andrew Moore | url=http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en}}</ref> in the case of randomly distributed points, worst case complexity analyses have been performed.<ref name=Lee1977>{{Cite journal |
|||
The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. |
|||
====Space partitioning==== |
|||
Since the 1970s, the [[branch and bound]] methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses [[spatial index]] or spatial access methods. Several [[Space partitioning|space-partitioning]] methods have been developed for solving the NNS problem. Perhaps the simplest is the [[k-d tree]], which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is ''O''(log ''N'')<ref>{{cite web|title=An introductory tutorial on KD trees|author=Andrew Moore|url=http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en|access-date=2008-10-03|archive-url=https://web.archive.org/web/20160303203122/http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en|archive-date=2016-03-03|url-status=dead}}</ref> in the case of randomly distributed points, worst case complexity is ''O''(''kN''^(1-1/''k''))<ref name=Lee1977>{{Cite journal |
|||
| last1 = Lee | first1 = D. T. | author1-link = Der-Tsai Lee |
| last1 = Lee | first1 = D. T. | author1-link = Der-Tsai Lee |
||
| last2 = Wong | first2 = C. K. |
| last2 = Wong | first2 = C. K. |
||
| year = 1977 |
| year = 1977 |
||
| title = Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees |
| title = Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees |
||
| journal = Acta Informatica |
| journal = [[Acta Informatica]] |
||
| volume = 9 |
| volume = 9 |
||
| issue = 1 |
| issue = 1 |
||
| pages = 23–29 |
| pages = 23–29 |
||
| doi = 10.1007/BF00263763 |
| doi = 10.1007/BF00263763 |
||
| s2cid = 36580055 }}</ref> |
|||
| postscript = . |
|||
Alternatively the [[R-tree]] data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the [[R* tree]].<ref>{{Cite conference | doi = 10.1145/223784.223794| chapter = Nearest neighbor queries| title = Proceedings of the 1995 ACM SIGMOD international conference on Management of data – SIGMOD '95| pages = 71| year = 1995| last1 = Roussopoulos | first1 = N. | last2 = Kelley | first2 = S. | last3 = Vincent | first3 = F. D. R. | isbn = 0897917316| doi-access = free}}</ref> R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. |
|||
}}</ref> |
|||
Alternatively the [[R-tree]] data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions. |
|||
In case of general metric space |
In the case of general metric space, the branch-and-bound approach is known as the [[metric tree]] approach. Particular examples include [[vp-tree]] and [[BK-tree]] methods. |
||
Using a set of points taken from a 3-dimensional space and put into a [[Binary space partitioning|BSP tree]], and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. |
Using a set of points taken from a 3-dimensional space and put into a [[Binary space partitioning|BSP tree]], and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. {{Confusing|date=November 2021}} |
||
(Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. |
|||
=== Approximation methods === |
|||
===Locality sensitive hashing=== |
|||
An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most <math>c</math> times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter.<ref>{{Cite book|last1=Andoni|first1=A.|last2=Indyk|first2=P.|date=2006-10-01|chapter=Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions|title= 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)|pages=459–468|doi=10.1109/FOCS.2006.49|isbn=978-0-7695-2720-8|citeseerx=10.1.1.142.3471}}</ref> |
|||
====Greedy search in proximity neighborhood graphs==== |
|||
[[Locality sensitive hashing]] (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.<ref>{{cite web|author=A. Rajaraman and J. Ullman| url=http://infolab.stanford.edu/~ullman/mmds.html |title=Mining of Massive Datasets, Ch. 3. |year=2010}}</ref> |
|||
Proximity graph methods (such as navigable small world graphs<ref name=":1">{{Citation |last1=Malkov |first1=Yury |title=Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces |date=2012 |url=http://link.springer.com/10.1007/978-3-642-32153-5_10 |work=Similarity Search and Applications |volume=7404 |pages=132–147 |editor-last=Navarro |editor-first=Gonzalo |access-date=2024-01-16 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |doi=10.1007/978-3-642-32153-5_10 |isbn=978-3-642-32152-8 |last2=Ponomarenko |first2=Alexander |last3=Logvinov |first3=Andrey |last4=Krylov |first4=Vladimir |editor2-last=Pestov |editor2-first=Vladimir}}</ref> and [[Hierarchical Navigable Small World graphs|HNSW]]<ref name=":0">{{cite arXiv|last1=Malkov|first1=Yury|last2=Yashunin|first2=Dmitry|date=2016|title=Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs|eprint=1603.09320|class=cs.DS}}</ref><ref name=":2">{{Cite journal |last1=Malkov |first1=Yu A. |last2=Yashunin |first2=D. A. |date=2020-04-01 |title=Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs |url=https://ieeexplore.ieee.org/document/8594636 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=42 |issue=4 |pages=824–836 |doi=10.1109/TPAMI.2018.2889473 |pmid=30602420 |issn=0162-8828|arxiv=1603.09320 }}</ref>) are considered the current state-of-the-art for the approximate nearest neighbors search. |
|||
The methods are based on greedy traversing in proximity neighborhood graphs <math>G(V,E)</math> in which every point <math>x_i \in S </math> is uniquely associated with vertex <math>v_i \in V </math>. The search for the nearest neighbors to a query ''q'' in the set ''S'' takes the form of searching for the vertex in the graph <math>G(V,E)</math>. |
|||
===Nearest neighbor search in spaces with small intrinsic dimension=== |
|||
The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex <math>v_i \in V </math> by computing the distances from the query q to each vertex of its neighborhood <math>\{v_j:(v_i,v_j) \in E\}</math>, and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. |
|||
The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount,<ref>{{cite journal|last1=Arya|first1=Sunil|last2=Mount|first2=David|date=1993|title=Approximate Nearest Neighbor Queries in Fixed Dimensions|journal=Proceedings of the Fourth Annual {ACM/SIGACT-SIAM} Symposium on Discrete Algorithms, 25–27 January 1993, Austin, Texas.|pages=271–280}}</ref> in the VoroNet system for the plane,<ref name="voroNet">{{Cite book|last1=Olivier|first1=Beaumont|last2=Kermarrec|first2=Anne-Marie|last3=Marchal|first3=Loris|last4=Rivière|first4=Etienne|year=2006|chapter=Voro ''Net'': A scalable object network based on Voronoi tessellations|title= 2007 IEEE International Parallel and Distributed Processing Symposium|volume=RR-5833|issue=1|pages=23–29|doi=10.1109/IPDPS.2007.370210|isbn=1-4244-0909-8|s2cid=8844431|chapter-url=https://hal.inria.fr/inria-00071210/PDF/RR-5833.pdf}}</ref> in the RayNet system for the <math>\mathbb{E}^n</math>,<ref name="rayNet">{{Cite book|last1=Olivier|first1=Beaumont|last2=Kermarrec|first2=Anne-Marie|last3=Rivière|first3=Etienne|title=Principles of Distributed Systems |chapter=Peer to Peer Multidimensional Overlays: Approximating Complex Structures |series=Lecture Notes in Computer Science |year=2007|volume=4878|pages=315–328|doi=10.1007/978-3-540-77096-1_23|isbn=978-3-540-77095-4|citeseerx=10.1.1.626.2980}}</ref> and in the Navigable Small World,<ref name=":1" /> Metrized Small World<ref name="msw2014">{{Cite journal|last1=Malkov|first1=Yury|last2=Ponomarenko|first2=Alexander|last3=Krylov|first3=Vladimir|last4=Logvinov|first4=Andrey|year=2014|title=Approximate nearest neighbor algorithm based on navigable small world graphs|journal=Information Systems|volume=45|pages=61–68|doi=10.1016/j.is.2013.10.006|s2cid=9896397 }}</ref> and [[Hierarchical Navigable Small World graphs|HNSW]]<ref name=":0" /><ref name=":2" /> algorithms for the general case of spaces with a distance function. These works were preceded by a pioneering paper by Toussaint, in which he introduced the concept of a ''relative neighborhood'' graph.<ref>{{cite journal|last1=Toussaint|first1=Godfried|date=1980|title=The relative neighbourhood graph of a finite planar set|journal=Pattern Recognition|volume=12|issue=4|pages=261–268|doi=10.1016/0031-3203(80)90066-7|bibcode=1980PatRe..12..261T}}</ref> |
|||
The [[cover tree]] has a theoretical bound that is based on the dataset's [[doubling constant]]. The bound on search time is ''O''(''c''<sup>12</sup> log ''n'') where ''c'' is the [[Expansivity constant|expansion constant]] of the dataset. |
|||
=== |
====Locality sensitive hashing==== |
||
[[Locality sensitive hashing]] (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.<ref>{{cite web|author1=A. Rajaraman |author2=J. Ullman |name-list-style=amp | url=http://infolab.stanford.edu/~ullman/mmds.html |title=Mining of Massive Datasets, Ch. 3 |year=2010}}</ref> |
|||
In high dimensional spaces tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.<ref>{{cite web|title=An Approximation-Based Data Structure for Similarity Search|author=Weber, Blott}}</ref> |
|||
====Nearest neighbor search in spaces with small intrinsic dimension==== |
|||
===Compression/Clustering Based Search=== |
|||
{{see also|Intrinsic dimension}} |
|||
The VA-File approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge-gains over VA-File, tree-based indexes and sequential scan have been observed.<ref>{{cite web|title=Adaptive cluster-distance bounding for similarity search in image databases|author=Ramaswamy, Rose, ICIP 2007}}</ref><ref>{{cite web|title=Adaptive cluster-distance bounding for high-dimensional indexing|author=Ramaswamy, Rose, TKDE 2001}}</ref> Also note the parallels between clustering and LSH. |
|||
The [[cover tree]] has a theoretical bound that is based on the dataset's [[doubling constant]]. The bound on search time is ''O''(''c''<sup>12</sup> log ''n'') where ''c'' is the [[Expansivity constant|expansion constant]] of the dataset. |
|||
====Projected radial search==== |
|||
In the special case where the data is a dense 3D map of geometric points, the projection geometry of the sensing technique can be used to dramatically simplify the search problem. |
|||
This approach requires that the 3D data is organized by a projection to a two-dimensional grid and assumes that the data is spatially smooth across neighboring grid cells with the exception of object boundaries. |
|||
These assumptions are valid when dealing with 3D sensor data in applications such as surveying, robotics and stereo vision but may not hold for unorganized data in general. |
|||
In practice this technique has an average search time of ''O''(''1'') or ''O''(''K'') for the ''k''-nearest neighbor problem when applied to real world stereo vision data.<ref name=panSearch/> |
|||
====Vector approximation files==== |
|||
In high-dimensional spaces, tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.<ref>{{cite journal|title=An Approximation-Based Data Structure for Similarity Search|last1=Weber|first1=Roger|last2=Blott|first2=Stephen|s2cid=14613657|url=https://pdfs.semanticscholar.org/83e4/e3281411ffef40654a4b5d29dae48130aefb.pdf|archive-url=https://web.archive.org/web/20170304043243/https://pdfs.semanticscholar.org/83e4/e3281411ffef40654a4b5d29dae48130aefb.pdf|url-status=dead|archive-date=2017-03-04}}</ref> |
|||
====Compression/clustering based search==== |
|||
The VA-file approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is [[Vector Quantization]] (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge gains over VA-File, tree-based indexes and sequential scan have been observed.<ref>{{cite journal|title=Adaptive cluster-distance bounding for similarity search in image databases|last1=Ramaswamy|first1=Sharadh|last2=Rose|first2=Kenneth|journal=ICIP|date=2007}}</ref><ref>{{cite journal|title=Adaptive cluster-distance bounding for high-dimensional indexing|last1=Ramaswamy|first1=Sharadh|last2=Rose|first2=Kenneth|journal=TKDE|date=2010}}</ref> Also note the parallels between clustering and LSH. |
|||
==Variants== |
==Variants== |
||
There are numerous variants of the NNS problem and the two most well-known are the [[K-nearest neighbor algorithm|''k''-nearest neighbor search]] and the [[ |
There are numerous variants of the NNS problem and the two most well-known are the [[K-nearest neighbor algorithm|''k''-nearest neighbor search]] and the [[ε-approximate nearest neighbor search]]. |
||
===<span id="K-nearest neighbor"> ''k''-nearest |
===<span id="K-nearest neighbor"> ''k''-nearest neighbors </span>=== |
||
[[K-nearest neighbor algorithm|''k''-nearest neighbor search]] identifies the top ''k'' nearest neighbors to the query. |
[[K-nearest neighbor algorithm|''k''-nearest neighbor search]] identifies the top ''k'' nearest neighbors to the query. This technique is commonly used in [[predictive analytics]] to estimate or classify a point based on the consensus of its neighbors. ''k''-nearest neighbor graphs are graphs in which every point is connected to its ''k'' nearest neighbors. |
||
===Approximate nearest neighbor=== |
===Approximate nearest neighbor=== |
||
In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried. |
In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried. |
||
Algorithms that support the approximate nearest neighbor search include [[Locality-sensitive hashing# |
Algorithms that support the approximate nearest neighbor search include [[Locality-sensitive hashing#Algorithm for nearest neighbor search|locality-sensitive hashing]], [[best bin first]] and [[balanced box-decomposition tree]] based search.<ref>{{cite journal|first1=S.|last1=Arya|author2-link=David Mount|first2=D. M.|last2=Mount|author3-link=Nathan Netanyahu|first3=N. S.|last3=Netanyahu|first4=R.|last4=Silverman|first5=A.|last5=Wu|title=An optimal algorithm for approximate nearest neighbor searching|journal=Journal of the ACM|volume=45|number=6|pages=891–923|date=1998|url=http://www.cse.ust.hk/faculty/arya/pub/JACM.pdf|doi=10.1145/293347.293348|citeseerx=10.1.1.15.3125|s2cid=8193729|access-date=2009-05-29|archive-url=https://web.archive.org/web/20160303232202/http://www.cse.ust.hk/faculty/arya/pub/JACM.pdf|archive-date=2016-03-03|url-status=dead}}</ref> |
||
[[ε-approximate nearest neighbor search]] is becoming an increasingly popular tool for fighting the [[curse of dimensionality]]. {{citation needed|date=March 2011}} |
|||
===Nearest neighbor distance ratio=== |
===Nearest neighbor distance ratio=== |
||
[[Nearest neighbor distance ratio]] |
[[Nearest neighbor distance ratio]] does not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in [[Content-based image retrieval|CBIR]] to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several [[Pattern matching|matching]] problems. |
||
===Fixed-radius near neighbors=== |
===Fixed-radius near neighbors=== |
||
[[Fixed-radius near neighbors]] is the problem where one wants to efficiently find all points given in [[Euclidean space]] within a given fixed distance from a specified point. The |
[[Fixed-radius near neighbors]] is the problem where one wants to efficiently find all points given in [[Euclidean space]] within a given fixed distance from a specified point. The distance is assumed to be fixed, but the query point is arbitrary. |
||
===All nearest neighbors=== |
===All nearest neighbors=== |
||
For some applications (e.g. [[entropy estimation]]), we may have ''N'' data-points and wish to know which is the nearest neighbor ''for every one of those N points''. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these ''N'' queries to produce a more efficient search. As a simple example: when we find the distance from point ''X'' to point ''Y'', that also tells us the distance from point ''Y'' to point ''X'', so the same calculation can be reused in two different queries. |
For some applications (e.g. [[entropy estimation]]), we may have ''N'' data-points and wish to know which is the nearest neighbor ''for every one of those N points''. This could, of course, be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these ''N'' queries to produce a more efficient search. As a simple example: when we find the distance from point ''X'' to point ''Y'', that also tells us the distance from point ''Y'' to point ''X'', so the same calculation can be reused in two different queries. |
||
Given a fixed dimension, a semi-definite positive norm (thereby including every |
Given a fixed dimension, a semi-definite positive norm (thereby including every [[lp space|L<sup>p</sup> norm]]), and ''n'' points in this space, the nearest neighbour of every point can be found in ''O''(''n'' log ''n'') time and the ''m'' nearest neighbours of every point can be found in ''O''(''mn'' log ''n'') time.<ref>{{citation |
||
| last = Clarkson | first = Kenneth L. | author-link = Kenneth L. Clarkson |
| last = Clarkson | first = Kenneth L. | author-link = Kenneth L. Clarkson |
||
| contribution = Fast algorithms for the all nearest neighbors problem |
| contribution = Fast algorithms for the all nearest neighbors problem |
||
Line 104: | Line 132: | ||
| pages = 226–232 |
| pages = 226–232 |
||
| title = 24th IEEE Symp. Foundations of Computer Science, (FOCS '83) |
| title = 24th IEEE Symp. Foundations of Computer Science, (FOCS '83) |
||
| year = 1983}}.</ref><ref name=Vaidya>{{Cite journal |
| year = 1983| isbn = 978-0-8186-0508-6 | s2cid = 16665268 }}.</ref><ref name=Vaidya>{{Cite journal |
||
| doi = 10.1007/BF02187718 |
| doi = 10.1007/BF02187718 |
||
| last1 = Vaidya | first1 = P. M. |
| last1 = Vaidya | first1 = P. M. |
||
| year = 1989 |
| year = 1989 |
||
| title = An ''O''(''n'' log ''n'') Algorithm for the All-Nearest-Neighbors Problem |
| title = An ''O''(''n'' log ''n'') Algorithm for the All-Nearest-Neighbors Problem |
||
| journal = [[Discrete and Computational Geometry]] |
| journal = [[Discrete and Computational Geometry]] |
||
| volume = 4 |
| volume = 4 |
||
| issue = 1 |
| issue = 1 |
||
| pages = 101–115 |
| pages = 101–115 |
||
| doi-access = free |
|||
| url = http://www.springerlink.com/content/p4mk2608787r7281/?p=09da9252d36e4a1b8396833710ef08cc&pi=8 |
|||
}}</ref> |
|||
| postscript = . |
|||
}}</ref> |
|||
==See also== |
==See also== |
||
{{div col|colwidth=20em}} |
{{div col|colwidth=20em}} |
||
* [[Ball tree]] |
|||
* [[Closest pair of points problem]] |
|||
* [[Cluster analysis]] |
|||
* [[Content-based image retrieval]] |
|||
* [[Curse of dimensionality]] |
|||
* [[Digital signal processing]] |
|||
* [[Dimension reduction]] |
|||
* [[Fixed-radius near neighbors]] |
|||
* [[Fourier analysis]] |
|||
* [[Instance-based learning]] |
|||
* [[k-nearest neighbor algorithm|''k''-nearest neighbor algorithm]] |
|||
* [[Linear least squares (mathematics)|Linear least squares]] |
|||
* [[Locality sensitive hashing]] |
|||
* [[Maximum inner-product search]] |
|||
* [[MinHash]] |
|||
* [[Multidimensional analysis]] |
|||
* [[Nearest-neighbor interpolation]] |
|||
* [[Neighbor joining]] |
|||
* [[Principal component analysis]] |
|||
* [[Range search]] |
* [[Range search]] |
||
* [[ |
* [[Similarity learning]] |
||
* [[Singular value decomposition]] |
|||
*[[Statistical distance]] |
|||
* [[Sparse distributed memory]] |
|||
*[[Closest pair of points problem]] |
|||
*[[ |
* [[Statistical distance]] |
||
*[[ |
* [[Time series]] |
||
*[[ |
* [[Voronoi diagram]] |
||
* [[Wavelet]] |
|||
*[[Content-based image retrieval]] |
|||
*[[Curse of dimensionality]] |
|||
*[[Digital signal processing]] |
|||
*[[Dimension reduction]] |
|||
*[[Fixed-radius near neighbors]] |
|||
*[[Fourier Analysis]] |
|||
*[[Instance-based learning]] |
|||
*[[k-nearest neighbor algorithm|''k''-nearest neighbor algorithm]] |
|||
*[[Linear least squares (mathematics)|Linear least squares]] |
|||
*[[Locality sensitive hashing]] |
|||
*[[Multidimensional analysis]] |
|||
*[[Nearest-neighbor interpolation]] |
|||
*[[Principal Component Analysis]] |
|||
*[[Singular value decomposition]] |
|||
*[[Time series]] |
|||
*[[Voronoi diagram]] |
|||
*[[Wavelet]] |
|||
*[[MinHash]] |
|||
{{div col end}} |
{{div col end}} |
||
== |
== References == |
||
=== Citations === |
|||
<references/> |
|||
{{Reflist}} |
|||
== |
=== Sources === |
||
*Andrews |
* {{cite journal |last= Andrews |first=L. |title= A template for the nearest neighbor problem |journal = C/C++ Users Journal |volume=19 |number=11 |date= November 2001 |pages=40–49 |issn = 1075-2838 |url = http://www.ddj.com/architect/184401449}} |
||
* |
* {{cite journal |last1=Arya|first1=S. |first2=D.M. |last2=Mount|author2-link=David Mount |first3=N. S. |last3=Netanyahu|author3-link=Nathan Netanyahu |first4=R. |last4=Silverman|author4-link=Ruth Silverman |first5=A. Y. |last5=Wu|author5-link=Angela Y. Wu |title = An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions |journal= Journal of the ACM |volume = 45 |number=6 |pages= 891–923 |doi=10.1145/293347.293348|year=1998 |citeseerx=10.1.1.15.3125 |s2cid=8193729 }} |
||
*Beyer |
* {{cite journal |last1=Beyer |first1=K. |last2= Goldstein |first2=J. |last3= Ramakrishnan |first3=R. |last4=Shaft |first4=U. |year = 1999 |title=When is nearest neighbor meaningful? |journal=Proceedings of the 7th ICDT }} |
||
*Chung-Min Chen |
* {{cite journal |first1=Chung-Min |last1= Chen |first2=Yibei |last2=Ling|title=A Sampling-Based Estimator for Top-k Query |journal=ICDE |date=2002 |pages = 617–627 }} |
||
*Samet |
* {{cite book |last=Samet |first=H.|author-link=Hanan Samet |year = 2006 |title= Foundations of Multidimensional and Metric Data Structures |publisher= Morgan Kaufmann |isbn = 978-0-12-369446-1 }} |
||
*Zezula |
* {{cite book |last1=Zezula |first1=P. |last2= Amato |first2=G. |last3=Dohnal |first3=V. |last4=Batko |first4=M. |title= Similarity Search – The Metric Space Approach|publisher=Springer |year=2006 |isbn = 978-0-387-29146-8 }} |
||
==Further reading== |
==Further reading== |
||
*{{cite book | last = Shasha | first = Dennis | title = High Performance Discovery in Time Series | publisher = Springer | location = Berlin | year = 2004 | isbn = 0-387-00857-8 }} |
* {{cite book | last = Shasha | first = Dennis | title = High Performance Discovery in Time Series | publisher = Springer | location = Berlin | year = 2004 | isbn = 978-0-387-00857-8 }} |
||
==External links== |
==External links== |
||
{{commons category|Nearest neighbours search}} |
|||
*[http://simsearch.yury.name/tutorial.html Nearest Neighbors and Similarity Search] - a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits. |
|||
*[http:// |
* [http://simsearch.yury.name/tutorial.html Nearest Neighbors and Similarity Search] – a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits |
||
* [https://archive.today/20130222061350/http://sswiki.tierra-aoi.net/ Similarity Search Wiki] – a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours |
|||
*[http://www.cs.ubc.ca/research/flann/ FLANN is a library for performing fast approximate nearest neighbor searches in high dimensional spaces.] |
|||
*[http://sisap.org/?f=library Metric Spaces Library] - An open-source C-based library for metric space indexing by Karina Figueroa, Gonzalo Navarro, Edgar Chávez. |
|||
*[https://github.com/searchivarius/NonMetricSpaceLib Non-Metric Space Library] - An open-source C++ library for exact and approximate searching in non-metric and metric spaces. |
|||
*[http://www.cs.umd.edu/~mount/ANN/ ANN] - A Library for Approximate Nearest Neighbor Searching by David M. Mount and Sunil Arya. |
|||
*[http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN FLANN] - Fast Approximate Nearest Neighbor search library by Marius Muja and David G. Lowe |
|||
*[http://www.irisa.fr/texmex/people/jegou/ann.php Product Quantization] Matlab implementation of approximate nearest neighbor search in the compressed domain by Herve Jegou. |
|||
*[http://lsd.fi.muni.cz/trac/messif MESSIF] - Metric Similarity Search Implementation Framework by Michal Batko and David Novak. |
|||
*[http://www.obsearch.net/ OBSearch] - Similarity Search engine for Java (GPL). Implementation by Arnoldo Muller, developed during Google Summer of Code 2007. |
|||
*[http://mrim.imag.fr/georges.quenot/freesoft/knnlsb/ KNNLSB] - K Nearest Neighbors Linear Scan Baseline (distributed, LGPL). Implementation by Georges Quénot (LIG-CNRS). |
|||
*[http://neartree.sourceforge.net/ NearTree] - An API for finding nearest neighbors among points in spaces of arbitrary dimensions by Lawrence C. Andrews and Herbert J. Bernstein. |
|||
*[http://nearpy.io/ NearPy] - Python framework for fast approximated nearest neighbor search by Ole Krause-Sparmann. |
|||
{{DEFAULTSORT:Nearest Neighbor Search}} |
{{DEFAULTSORT:Nearest Neighbor Search}} |
||
Line 181: | Line 202: | ||
[[Category:Discrete geometry]] |
[[Category:Discrete geometry]] |
||
[[Category:Geometric algorithms]] |
[[Category:Geometric algorithms]] |
||
[[Category:Information retrieval]] |
|||
[[Category:Machine learning]] |
|||
[[Category:Numerical analysis]] |
|||
[[Category:Mathematical optimization]] |
[[Category:Mathematical optimization]] |
||
[[Category:Searching]] |
|||
[[Category:Search algorithms]] |
[[Category:Search algorithms]] |
Latest revision as of 11:34, 22 August 2024
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.
Formally, the nearest-neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q. Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points.
Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold.[1]
Applications
[edit]The nearest neighbor search problem arises in numerous fields of application, including:
- Pattern recognition – in particular for optical character recognition
- Statistical classification – see k-nearest neighbor algorithm
- Computer vision – for point cloud registration[2]
- Computational geometry – see Closest pair of points problem
- Cryptanalysis – for lattice problem[3]
- Databases – e.g. content-based image retrieval
- Coding theory – see maximum likelihood decoding
- Semantic Search
- Data compression – see MPEG-2 standard
- Robotic sensing[4]
- Recommendation systems, e.g. see Collaborative filtering
- Internet marketing – see contextual advertising and behavioral targeting
- DNA sequencing
- Spell checking – suggesting correct spelling
- Plagiarism detection
- Similarity scores for predicting career paths of professional athletes.
- Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance
- Chemical similarity
- Sampling-based motion planning
Methods
[edit]Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.
Exact methods
[edit]Linear search
[edit]The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces.[5]
The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results.
Space partitioning
[edit]Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N)[6] in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k))[7] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R* tree.[8] R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances.
In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods.
Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm.
This article may be confusing or unclear to readers. (November 2021) |
(Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.
Approximation methods
[edit]An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter.[9]
Greedy search in proximity neighborhood graphs
[edit]Proximity graph methods (such as navigable small world graphs[10] and HNSW[11][12]) are considered the current state-of-the-art for the approximate nearest neighbors search.
The methods are based on greedy traversing in proximity neighborhood graphs in which every point is uniquely associated with vertex . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex by computing the distances from the query q to each vertex of its neighborhood , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself.
The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount,[13] in the VoroNet system for the plane,[14] in the RayNet system for the ,[15] and in the Navigable Small World,[10] Metrized Small World[16] and HNSW[11][12] algorithms for the general case of spaces with a distance function. These works were preceded by a pioneering paper by Toussaint, in which he introduced the concept of a relative neighborhood graph.[17]
Locality sensitive hashing
[edit]Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.[18]
Nearest neighbor search in spaces with small intrinsic dimension
[edit]The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.
Projected radial search
[edit]In the special case where the data is a dense 3D map of geometric points, the projection geometry of the sensing technique can be used to dramatically simplify the search problem. This approach requires that the 3D data is organized by a projection to a two-dimensional grid and assumes that the data is spatially smooth across neighboring grid cells with the exception of object boundaries. These assumptions are valid when dealing with 3D sensor data in applications such as surveying, robotics and stereo vision but may not hold for unorganized data in general. In practice this technique has an average search time of O(1) or O(K) for the k-nearest neighbor problem when applied to real world stereo vision data.[4]
Vector approximation files
[edit]In high-dimensional spaces, tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.[19]
Compression/clustering based search
[edit]The VA-file approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge gains over VA-File, tree-based indexes and sequential scan have been observed.[20][21] Also note the parallels between clustering and LSH.
Variants
[edit]There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search and the ε-approximate nearest neighbor search.
k-nearest neighbors
[edit]k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.
Approximate nearest neighbor
[edit]In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.
Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first and balanced box-decomposition tree based search.[22]
Nearest neighbor distance ratio
[edit]Nearest neighbor distance ratio does not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.
Fixed-radius near neighbors
[edit]Fixed-radius near neighbors is the problem where one wants to efficiently find all points given in Euclidean space within a given fixed distance from a specified point. The distance is assumed to be fixed, but the query point is arbitrary.
All nearest neighbors
[edit]For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could, of course, be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.
Given a fixed dimension, a semi-definite positive norm (thereby including every Lp norm), and n points in this space, the nearest neighbour of every point can be found in O(n log n) time and the m nearest neighbours of every point can be found in O(mn log n) time.[23][24]
See also
[edit]- Ball tree
- Closest pair of points problem
- Cluster analysis
- Content-based image retrieval
- Curse of dimensionality
- Digital signal processing
- Dimension reduction
- Fixed-radius near neighbors
- Fourier analysis
- Instance-based learning
- k-nearest neighbor algorithm
- Linear least squares
- Locality sensitive hashing
- Maximum inner-product search
- MinHash
- Multidimensional analysis
- Nearest-neighbor interpolation
- Neighbor joining
- Principal component analysis
- Range search
- Similarity learning
- Singular value decomposition
- Sparse distributed memory
- Statistical distance
- Time series
- Voronoi diagram
- Wavelet
References
[edit]Citations
[edit]- ^ Cayton, Lawerence (2008). "Fast nearest neighbor retrieval for bregman divergences". Proceedings of the 25th International Conference on Machine Learning. pp. 112–119. doi:10.1145/1390156.1390171. ISBN 9781605582054. S2CID 12169321.
- ^ Qiu, Deyuan, Stefan May, and Andreas Nüchter. "GPU-accelerated nearest neighbor search for 3D registration." International conference on computer vision systems. Springer, Berlin, Heidelberg, 2009.
- ^ Becker, Ducas, Gama, and Laarhoven. "New directions in nearest neighbor searching with applications to lattice sieving." Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms (pp. 10-24). Society for Industrial and Applied Mathematics.
- ^ a b Bewley, A.; Upcroft, B. (2013). Advantages of Exploiting Projection Structure for Segmenting Dense 3D Point Clouds (PDF). Australian Conference on Robotics and Automation.
- ^ Weber, Roger; Schek, Hans-J.; Blott, Stephen (1998). "A quantitative analysis and performance study for similarity search methods in high dimensional spaces" (PDF). VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases. pp. 194–205.
- ^ Andrew Moore. "An introductory tutorial on KD trees" (PDF). Archived from the original (PDF) on 2016-03-03. Retrieved 2008-10-03.
- ^ Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica. 9 (1): 23–29. doi:10.1007/BF00263763. S2CID 36580055.
- ^ Roussopoulos, N.; Kelley, S.; Vincent, F. D. R. (1995). "Nearest neighbor queries". Proceedings of the 1995 ACM SIGMOD international conference on Management of data – SIGMOD '95. p. 71. doi:10.1145/223784.223794. ISBN 0897917316.
- ^ Andoni, A.; Indyk, P. (2006-10-01). "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions". 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06). pp. 459–468. CiteSeerX 10.1.1.142.3471. doi:10.1109/FOCS.2006.49. ISBN 978-0-7695-2720-8.
- ^ a b Malkov, Yury; Ponomarenko, Alexander; Logvinov, Andrey; Krylov, Vladimir (2012), Navarro, Gonzalo; Pestov, Vladimir (eds.), "Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces", Similarity Search and Applications, vol. 7404, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 132–147, doi:10.1007/978-3-642-32153-5_10, ISBN 978-3-642-32152-8, retrieved 2024-01-16
- ^ a b Malkov, Yury; Yashunin, Dmitry (2016). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs". arXiv:1603.09320 [cs.DS].
- ^ a b Malkov, Yu A.; Yashunin, D. A. (2020-04-01). "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (4): 824–836. arXiv:1603.09320. doi:10.1109/TPAMI.2018.2889473. ISSN 0162-8828. PMID 30602420.
- ^ Arya, Sunil; Mount, David (1993). "Approximate Nearest Neighbor Queries in Fixed Dimensions". Proceedings of the Fourth Annual {ACM/SIGACT-SIAM} Symposium on Discrete Algorithms, 25–27 January 1993, Austin, Texas.: 271–280.
- ^ Olivier, Beaumont; Kermarrec, Anne-Marie; Marchal, Loris; Rivière, Etienne (2006). "Voro Net: A scalable object network based on Voronoi tessellations" (PDF). 2007 IEEE International Parallel and Distributed Processing Symposium. Vol. RR-5833. pp. 23–29. doi:10.1109/IPDPS.2007.370210. ISBN 1-4244-0909-8. S2CID 8844431.
- ^ Olivier, Beaumont; Kermarrec, Anne-Marie; Rivière, Etienne (2007). "Peer to Peer Multidimensional Overlays: Approximating Complex Structures". Principles of Distributed Systems. Lecture Notes in Computer Science. Vol. 4878. pp. 315–328. CiteSeerX 10.1.1.626.2980. doi:10.1007/978-3-540-77096-1_23. ISBN 978-3-540-77095-4.
- ^ Malkov, Yury; Ponomarenko, Alexander; Krylov, Vladimir; Logvinov, Andrey (2014). "Approximate nearest neighbor algorithm based on navigable small world graphs". Information Systems. 45: 61–68. doi:10.1016/j.is.2013.10.006. S2CID 9896397.
- ^ Toussaint, Godfried (1980). "The relative neighbourhood graph of a finite planar set". Pattern Recognition. 12 (4): 261–268. Bibcode:1980PatRe..12..261T. doi:10.1016/0031-3203(80)90066-7.
- ^ A. Rajaraman & J. Ullman (2010). "Mining of Massive Datasets, Ch. 3".
- ^ Weber, Roger; Blott, Stephen. "An Approximation-Based Data Structure for Similarity Search" (PDF). S2CID 14613657. Archived from the original (PDF) on 2017-03-04.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Ramaswamy, Sharadh; Rose, Kenneth (2007). "Adaptive cluster-distance bounding for similarity search in image databases". ICIP.
- ^ Ramaswamy, Sharadh; Rose, Kenneth (2010). "Adaptive cluster-distance bounding for high-dimensional indexing". TKDE.
- ^ Arya, S.; Mount, D. M.; Netanyahu, N. S.; Silverman, R.; Wu, A. (1998). "An optimal algorithm for approximate nearest neighbor searching" (PDF). Journal of the ACM. 45 (6): 891–923. CiteSeerX 10.1.1.15.3125. doi:10.1145/293347.293348. S2CID 8193729. Archived from the original (PDF) on 2016-03-03. Retrieved 2009-05-29.
- ^ Clarkson, Kenneth L. (1983), "Fast algorithms for the all nearest neighbors problem", 24th IEEE Symp. Foundations of Computer Science, (FOCS '83), pp. 226–232, doi:10.1109/SFCS.1983.16, ISBN 978-0-8186-0508-6, S2CID 16665268.
- ^ Vaidya, P. M. (1989). "An O(n log n) Algorithm for the All-Nearest-Neighbors Problem". Discrete and Computational Geometry. 4 (1): 101–115. doi:10.1007/BF02187718.
Sources
[edit]- Andrews, L. (November 2001). "A template for the nearest neighbor problem". C/C++ Users Journal. 19 (11): 40–49. ISSN 1075-2838.
- Arya, S.; Mount, D.M.; Netanyahu, N. S.; Silverman, R.; Wu, A. Y. (1998). "An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions". Journal of the ACM. 45 (6): 891–923. CiteSeerX 10.1.1.15.3125. doi:10.1145/293347.293348. S2CID 8193729.
- Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. (1999). "When is nearest neighbor meaningful?". Proceedings of the 7th ICDT.
- Chen, Chung-Min; Ling, Yibei (2002). "A Sampling-Based Estimator for Top-k Query". ICDE: 617–627.
- Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 978-0-12-369446-1.
- Zezula, P.; Amato, G.; Dohnal, V.; Batko, M. (2006). Similarity Search – The Metric Space Approach. Springer. ISBN 978-0-387-29146-8.
Further reading
[edit]- Shasha, Dennis (2004). High Performance Discovery in Time Series. Berlin: Springer. ISBN 978-0-387-00857-8.
External links
[edit]- Nearest Neighbors and Similarity Search – a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits
- Similarity Search Wiki – a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours