Elasticsearch: Investigate various implementations of ann search for vector fields

Created on 21 May 2019  ·  45Comments  ·  Source: elastic/elasticsearch

ann (approximate nearest neighbours) will be a licensed feature of Elasticsearch (not OSS).

We plan to implement prototypes of various algorithms for ann for different distance metrics:

  • LSH and multiprobe LSH for euclidean distance
  • partition trees for euclidean/cosine distance
  • clustering-based approaches, including product quantization

We are interested in users' feedback about:

  • application domains
  • what distance metrics are used (euclidean vs cosine vs Hamming etc).
  • what are data types in vectors (vectors of integers, bits, longs, floats etc).
:SearcRanking >feature Search feedback_needed stalled

Most helpful comment

@dpmccabe theres also the general case of creating a document embedding based on your entire corpus of documents, encoding each document in that embedding, then encoding a search query into that vector space and searching by similarity. It’s a general mechanism for search that can work better than or better in combination with Elasticsearch’s existing methods than what exists currently.

It is surprising this doesn’t already exist!

All 45 comments

Pinging @elastic/es-search

You should also consider Covering-LSH - it, with some recent optimizations, appears to solve the false-negative problem common with LSH approaches.

https://arxiv.org/abs/1507.03225

There appear to be some early implementations in C++ already available, however, these do not appear to include the Hadamard code examples in the later white paper revisions that offer performance improvements.

https://github.com/kasperisager/hemingway

@mayya-sharipova are you interested in using existing commodity approximate nearest neighbors implementations or implementing an algorithm from scratch? Annoy is a simple to use, simple to tune library created by spotify to perform ann search from an index stored on disk. It indexes vectors with integers which I think you also do behind the scenes. It looks like they also have java bindings so it could maybe fit well with your stack https://github.com/spotify/annoy-java

@markhuds Thanks for your suggestion. We are currently evaluating different ann approaches based on the performance but also based how they fit into our elasticsearch/lucene infrastructure. We have considered spotify annoy's approach as well. We still need to make a decision.

We are interested in your use case for the ann search. For what use case have you found annoy compelling?

@mayya-sharipova Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

K-means could make sense if the domain was well known, and considerable historical data exists to create meaningful centroids from, and to make the cluster count decision from. K-Means also assumes that you don't expect vector attributes to vary much from the current state, doesn't it?

@wmelton

Isn't there an issue with using k-means as a possible solution in that you must also continually solve the 'optimal cluster count' problem as the index grows?

Indeed. You are right, a sample on which k-means centroids are found should be representative enough even as the index grows. It seems to me that for all ann algorithms a sample on which you build a model should be representative enough and if your new data vary from the current state than the model has to be rebuilt. We still need to figure out how to organize this in elasticsearch.

@mayya-sharipova We intend to use output from a deep ranking model that embeds products and queries into a retrieval space such that doing a nearest neighbor search with a query vector will retrieve optimal orderings of product vectors. This will allow us to integrate machine learning techniques to improve overall relevance while maintaining the rich filtering, aggregating, and sorting capabilities we currently enjoy from elasticsearch.

@mayya-sharipova I'd like to use dense vectors of floats (dimensions of 1x2048 -- last layer from Resnet50) for visual search.

@RyanZotti Thanks for your usecase. From 7.6 it would be possible to create vectors of 2048 dims (currently they are restricted to 1024 dims).

When will this be available in the enterprise edition? This is a critical feature for modern search.

@mayya-sharipova Check this out: https://github.com/facebookresearch/faiss

I would be nice to use some of these suggested distance metrics in more_like_this queries. Suppose I'm storing a dense vector for every document in an index and I feel that this vector represents the document's place in a corpus better than an analyzer+tf-idf could. Then, I could query for similar documents using cosine similarity or some other metric.

This would eliminate a round-trip network request (getting the source document's vector just to send it right back as part of a new query).

@dpmccabe theres also the general case of creating a document embedding based on your entire corpus of documents, encoding each document in that embedding, then encoding a search query into that vector space and searching by similarity. It’s a general mechanism for search that can work better than or better in combination with Elasticsearch’s existing methods than what exists currently.

It is surprising this doesn’t already exist!

@mayya-sharipova it's great to see ANN in Elasticsearch is being worked on. Embedding-driven search is something we use here at Cookpad. A scalable solution to nearest vector search within Elasticsearch would be very useful. Much of the rest of our search stack is Elasticsearch, so moving ANN into Elasticsearch is more attractive than monolithic ANN systems (e.g., Faiss, Annoy, etc.).

In terms of scale, we're looking to score over a corpus on the order of 10^6 to 10^7 vectors, and the corpus will continue to grow quickly. dense_vector is great for a two-stage query-then-rescore approach, but of course will be expensive to scale. (For curious – we ran some Exact NN (ENN) benchmarking experiments here to see how far this would take us.)

Metrics-wise, cosine distance is suitable for our current applications of ANN. The primary content so far are text and images – no surprises that one of our key data types here is recipes! Text and image are jointly combined into a single embedding per data item. The upper limit of 1024 to 2048 dimensions will be fine for now.

Other features that are interesting:

  • Incrementally write to an ANN index and have it available for search as soon as possible, rather than waiting for a whole re-index.
  • More-like-this for embeddings similarity.

Very excited about ANN search coming into Elastic. As @rjurney mentioned, it's a cornerstone for modern search. And outside of the commercial space, I've met research scientists and academics who'd love to drop their embeddings into Elasticsearch – rather than Faiss or Annoy – to get the combined benefit of scalable ANN + simple and flexible querying + a route to multi-node parallelism if needed. They'd love if this was as OSS rather than licensed.

@mattw Thanks a lot for details on your use case; interesting to see you benchmarks as well.

@mayya-sharipova no problem! If want any more information or context please don't hesitate to ask.

@mayya-sharipova excited to see that you aim to incorporate ANN within ES. Similar to what @mattjw mentioned above, there are several IR scenarios where one would like to combine simple Lucene-based search / querying, with efficient, large-scale ANN on dense vectors (embeddings)...
Given that you are looking for existing implementations, please notice that Erik Bernhardsson, the author of Annoy, provides a benchmark for the most popular ANN frameworks (including FAISS & Annoy) with frequent updates (see https://github.com/erikbern/ann-benchmarks). Also, another recent approach is https://milvus.io/, which combines features from both FAISS and MS SPTAG into a production-oriented system...

@ometaxas Thank you for your suggestions. We are using ann-benchmarks for prototype testing, it is indeed a very useful platform. Good to learn about https://milvus.io/, we will explore it more.

I have couple of doubts,

  • As I see, currently ealsticsearch uses LSH for nearest neighbor search which is going to be replaced by k-means (in near future), right?
  • where do you store those index files and in what format do you store? (this que may look general, but i am new to elasticsearch and would like to know these things)

Thanks

@RamanRajarathinam Currently we don't have a solution for ann search, only bruteforce precise knn search.
We are investigating algorithms and prototypes for ann search.

@RamanRajarathinam Currently we don't have a solution for ann search, only bruteforce precise knn search.
We are investigating algorithms and prototypes for ann search.

You can be inspired by FAISS, ANNOY or HNSW for fast ANN :)

@mayya-sharipova Another system similar to https://milvus.io/ is https://gnes.ai/ Both of them are Chinese companies (and moving very fast)

It is now a race for vector ANN superiority 🚀
Please don't take too long to investigate! Many 2020 product roadmaps depends on this 🤣

I have looked into this problem in the area of text embedding for some time now. There are various applications, but the ones we are mainly interested in is to use our own semantic embeddings (either pre-trained such as BERT etc. or custom-trained). Then use ANN in order to restrict the search space, and finally use some custom similarity score - preferably cosine distance - to calculate the best matching docs. The beauty of this approach is that you could even support + and - operators in the query using vector algebra in the following sense:

Keyword search:
(+) MUST CONTAIN
(-) MUST NOT CONTAIN

Semantic search:
(+) SHOULD CONTAIN, i.e. move the query vector into the direction of the +term
(-) SHOULD NOT CONTAIN, i.e. move the query vector away from the -term

From what I have seen so far, there are essentially 3 different approaches available for ANN:

  1. kd-trees
  2. LSH and derivatives of it
  3. representing vectors as strings, restrict the string lengths (e.g. by using floor functions, rounding and/or truncation), and then use "bucketing" of strings that fall into the same bucket. Probably works only for Euclidean distances though? Cannot find a good paper on this right now.

We have played around using kd-trees with Elasticsearch and have a working prototype, but there is an 8 dimension limitation for the vectors that would require re-compilation to overcome (we need ~300 dimensions for text representation, might be able to reduce to 150 dimensions according to research done, but not beyond), and it's questionable whether we would not run into JVM memory issues or alike doing so.

Having ANN as a licensed feature rather than OSS - if this is based on X-Pack licensing model where you end up paying per node, then this would be a no-go area for us. We have indexes of hundreds of millions of docs, and there is no way we could justify the corresponding costs with our employer.

Thanks for your feedback @fkostadinov - glad to hear you find this useful.
Regarding:

Having ANN as a licensed feature rather than OSS - if this is based on X-Pack licensing model where you end up paying per node, then this would be a no-go area for us. We have indexes of hundreds of millions of docs, and there is no way we could justify the corresponding costs with our employer.

Please note that what we have developed so far in this area we released under our Basic license tier, which means it is free of charge, so there are no licensing costs.

I think there are quite a number of recent academic works on implementing ANN using Elasticsearch, including one of the 2019 Elastic Search Award. Are we also planning to try them out?

One thing I have a question about is - I understand that the existence of a vector type for match in Solr's streaming expressions enable rapid prototyping of vector operations including vector match. Will Elasticsearch have a comparable capability?

@dragon-warrior-nyc do you have links to any source code for these?

@mayya-sharipova
You might find the following papers helpful in terms of their ANN implementations on Elasticsearch as well as their applications.
[1] Rygl, J., Pomikálek, J., Řehůřek, R., Růžička, M., Novotný, V. and Sojka, P., 2017. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. In ACL RepL4NLP.
[2] Amato, G., Bolettieri, P., Carrara, F., Falchi, F. and Gennaro, C., 2018. Large-Scale Image Retrieval with Elasticsearch. In _SIGIR._
[3] Mu, C., Zhao, J., Yang, G., Zhang, J. and Yan, Z., 2018. Towards practical visual search engine within elasticsearch. In _SIGIR eCom._

For using Elasticsearch to conduct k-NN in Hamming space specifically, an efficient method called FENSHSES is proposed by the following paper:
[4] Mu, C., Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In _SISAP._

The comparison between FENSHSES and FAISS could be found in the following paper as well:
[5] Mu, C., Yang, B. and Yan, Z., 2019. An Empirical Comparison of FAISS and FENSHSES for Nearest Neighbor Search in Hamming Space. In _SIGIR eCom._

@rjurney I have listed related recent papers above ^. I do not think they published their codes but their methods are quite clearly described in their published conference papers.

For HNSW, I found this piece of work very useful https://github.com/nmslib/hnswlib
Paper: https://arxiv.org/abs/1603.09320
This method is extremely fast, but you have to compromise accuracy for speed. The ideal use case is landing page vector based recommendations.

For better accuracy, VPTree might be better.

Amazon just integrated KNN search into their "Elasticsearch" service offering. It seems to be based on this project: https://github.com/nmslib/nmslib, maybe worth looking into?
https://aws.amazon.com/about-aws/whats-new/2020/03/build-k-nearest-neighbor-similarity-search-engine-with-amazon-elasticsearch-service/?nc1=h_ls

We have decided to contribute and adopt Lucene implementations on ann search, so development of ann search is moved here. Relevant Lucene issues: https://issues.apache.org/jira/browse/LUCENE-9004, https://issues.apache.org/jira/browse/LUCENE-9322, https://issues.apache.org/jira/browse/LUCENE-9136

@mayya-sharipova Is there an estimate on when these implementations will be available?

@dgrahn Thanks for checking it. Since the development of ann search was moved to Open Source Lucene, there will not be any estimates. You can subscribe to watch the Lucene issues I mentioned above, and if there is some progress on them, you will be notified.

@mayya-sharipova great to see this reopened.

There are a heap of application domains where our clients would like to make use of ANN for doc retrieval:

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

@BountyCountry Thanks for sharing your applications.

A bit hard to tell, but are things stalled around https://issues.apache.org/jira/browse/LUCENE-9322 ?

This indeed may progress slower, since vectors would be a new format, it would be nice to get an approval from the community to make sure we cover broad enough cases.

@mayya-sharipova - thanks Maya! Yes we are relatively agnostic to formats although nice to have both dense and spare vector queries available.

This repo looks pretty on the money! Have you played with it at all? http://elastiknn.klibisz.com/api/#nearest-neighbor-queries
Perhaps could be added as a supported plugin while the Lucene route is going slow?

LMK if you all have questions or would like to contribute to elastiknn. There's also a rough roadmap in the wiki.

This has been open for more than a year. Is this feature dead? Are any resources being put into building this or is it still in the investigative stage? Is this at all important to Elastic? This is a basic feature of modern search and Elastic lacks it.

@mayya-sharipova Is that going to change in the next six months? Year? Two years? Five years? What is the timeline?

+1 @rjurney
With more searches either completely or partially shifting to semantic-based approaches, it's really important to have atleast some implementations of ann search. Since elasticsearch is already integrated with our systems it would be great if this could be done, instead of us having to move to something like FAISS or Annoy.

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

  • Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.
  • Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.
  • Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.
  • Compose nearest neighbor queries with standard Elasticsearch queries.
  • Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

At this point it is crystal clear that this feature will not be solved by Elastic HQ, it will be solved by the community who will advertise that fact here. Then the winning package will be merged into Elastic.

@rjurney @joseph-macraty, and others: I don't mean to hijack the thread, but I have made a lot of progress on an implementation of KNN/ANN for elasticsearch here: https://github.com/alexklibisz/elastiknn
It should be ready for people to start experimenting with; there are installation and API docs here: http://elastiknn.klibisz.com/
Some highlights (copying from the docs):

  • Datatypes to efficiently store floating-point and boolean vectors in Elasticsearch documents.
  • Exact nearest neighbor queries for five similarity functions: L1, L2, Angular, Jaccard, and Hamming.
  • Approximate nearest neighbor queries using Locality Sensitive Hashing and related algorithms for all five similarity functions.
  • Compose nearest neighbor queries with standard Elasticsearch queries.
  • Implemented using standard Elasticsearch and Lucene constructs, so indexing and queries scale horizontally with Elasticsearch.

Right now my main focus is benchmarking, specifically integrating it with ann-benchmarks.
After that I'd like to get it published for several ES versions (right now I'm only on 7.6.2), and add some more user-friendly tutorials.

This is all a side/hobby project for me, so I appreciate any help. A huge help right now would be having people just try it out and provide issues/feedback.

@alexklibisz we are successfully using your plugin on 7.6.2 with nearest neighbours queries (angular) against dense vectors. Everything worked out of the box and snappy so a huge thank you for your work. @mayya-sharipova - this one needs backing :)

@mayya-sharipova +1 to @BountyCountry's comment.

Adding my few cents here,

The common problem with most vector store lib(s) is that it doesn't support update/delete an element to/from the existing index. Integrating standard vector search libs to Lucene or ES seems a good idea, since we can make use of its functionalities like merge policy, isAlive bitset and post-processing results. Though, Milvus is built on top of various vectors libs, it handles all these fundamental problems on its own.

People who are looking for KNN support on ES, either they can use elastiknn or opendistro-KNN.

  1. Opendistro-KNN has integrated nmslib via ES plugin (java bindings) to support KNN. This plugin is also available in AWS. I am pretty sure they chose nmslib by merely seeing its performance on ANN benchmarks. I observed a few things while using Opendistro,

    • Graphs are built when you call iw.commit(), and merged when segments gets merged. so if you're going to index (~1M) documents, it is going to take a lot of time.
    • Larger the dimension, higher the time to build the graph
    • nmslib doesn't read its graph (index files) via mmap, leads to higher memory consumption.
  2. On the other hand, Elastic-KNN implemented everything from scratch. As @alexklibisz said, we really need some benchmark scores to compare Elastic-KNN with other native KNN libs such as (faiss, annoy, nmslib, hnswlib,...etc). I haven't explored Elastic-KNN much, so I can't give my feedback on this.

I think these both are the right place to start with unless you're waiting for ES to have ANN feature.

@RamanRajarathinam I have some initial benchmarking results here: http://elastiknn.klibisz.com/performance/. I started working on Elastiknn before opendistro k-nn was released. I haven't measured it exactly, but I wouldn't be surprised if it's faster than elastiknn, since it's using a C library under the hood. However, introducing a subprocess with non-trivial memory usage comes with some deployment tradeoffs like you mentioned.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Praveen82 picture Praveen82  ·  3Comments

abrahamduran picture abrahamduran  ·  3Comments

clintongormley picture clintongormley  ·  3Comments

brwe picture brwe  ·  3Comments

clintongormley picture clintongormley  ·  3Comments