Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
My company is using vector search with Elasticsearch. It’s working well so far. IMO Elastic will eat most vector-first/only products because of its strength at full-text search, plus all the other stuff it does.
I tend to agree - search, and particularly search-for-humans, is really a team sport - meaning, very rarely do you have a single search algo operating in isolation. You have multiple passes, you filter results through business logic.
Having said that, I think pgvector has a chance for less scale-intense needs - embedding as a column in your existing DB and a join away from your other models is where you want search.
I don’t get why you’d want to bolt RBAC onto these new vector dbs, unless it’s because they’ve caused this problem in the first place…
They have beef with ES since they took the software, made a bunch of cash on it, then never contributed back. ES called them out and it started a feud.
I'd go on ES over Amazon-built software any day. I worked on RDS and I've used RDS at several companies, it's a mess.
Longer story:
One day one of our table went missing on Aurora, we couldn't figure out why, it was in the schema, etc. Devops panicked and restarted the instance, and then another table was missing. We ended up creating 10 empty tables and restarted it until it hit one of those.
We contacted RDS support after that, and the conclusion of their 3 month investigation is: "Yeah, it's not supposed to do that."
There's some really smart people working at Amazon, unfortunately the incentives is to push new stuff out and get promoted ASAP. If you can do that better than others and before your house of cards falls, you're safe. If the house of card crumbles after you're gone, it's their problem.
>Longer story: One day one of our table went missing on Aurora, we couldn't figure out why, it was in the schema, etc. Devops panicked and restarted the instance, and then another table was missing. We ended up creating 10 empty tables and restarted it until it hit one of those.
Are there any report this? How come this is the first time I heard of this? How can companies trust this kind of managed DB services?
We worked with dedicated support on this, but I don't think they had enough knowledge to dig deep into it and just gave up. There is a huge backlog of critical issues at most AWS services. It looks great from the outside in, but the sausage making process is extremely messy.
Amazon forked ElasticSearch into OpenSearch. When deciding which platform to go with (we are an AWS customer) I decided to stick with the company whose future depends on their search product (Elastic), not the one that could lose interest and walk away and suffer almost no consequences (AWS). If OpenSearch is still around in 5 years, and keeping pace with ElasticSearch, then maybe I'd consider it the next time I'm making this choice.
Also there's a lot more to ElasticSearch than full-text search (aggregations, lifecycle management, Kibana). Doesn't seem like Kendra is going to be a replacement for our use case.
Until very recently, “dense retrieval” was not even as good as bm25, and still is not always better.
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
A lot of these things are use-case dependent. Like the characteristics even of BM-25 varies a lot depending on whether the query is over or under specified, the nature of the query and so on.
I don't think there will ever be an answer to what is the best way of doing information retrieval for a search engine scale corpus of document that is superior for every type of queries.
more commonly you use approximate KNN vector search with LLM based embeddings, which can find many fitting documents bm25 and similar would never manage to
the tricky part if to properly combine the results
Vector search is not exclusively in the domain of text search. There is always image/video search.
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
much more mature and feature rich then many of the competition listed in the article
to some degree it's more a platform you can use to efficiently and flexible build your own more complicated search system, which is both a benefit and drawback
some good parts:
- very flexible text search (bm25), more so then elastic search (or at least easier to user/better documented when it comes to advanced features)
- fast flexible enough vector search, with good filtering capabilities
- build in support for defining more complicated search piplines, including multi phase search (also known as rerankin)
- quite nice approach for more fine controlling about what kind of indices are build for which fields
- when doing schema changes has safety checks to make sure you don't accidentally brake anything, which you can override if you are sure you want that
- ton of control in a cluster about where which search system resources get allocated (e.g. which schemas get stored on which storage clusters, which cluster nodes should act as storage nodes, which should e.g. only do preprocessing or post processing steps in a search piplines and which e.g. should be used for calculating embeddings using some LLM or similar) Not something you for demos but definitly something you need once you customers have enough data.
- child documents, and document references
- multiple vectors per document
- quite a interesting set of data types for fields and related ways you can use them in a search pipline
- an flexible reasonable easy to use system for plugins/extensions (through Java only)
- support building search piplines which have sub-searches in extern potentially non vespa systems
- really well documented
Through the main benefit *and drawback* is that it's not just a vector database, but a full fledged search system platform.
generally if you have multiple embeddings for the same document you have two choices:
- create one document for each embedding and make sure non membedding specific attributes are the same across all of this document clones -- vespa makes this more convenient by having child documents
- have a field with multiple documents, i.e. there are multipel vectors in the HNSW-index which point to the same document -- vespa support this, too. It's what I meant.
vespa is currently the only vector search enabled search system which supports both in a convenient way, but then there are so many "vector databases" poping up every month that I might have missed some
Check out FeatureBase, when you get a chance. Vectors and super fast operations on sets. I'm using it for managing keyterms extracted from the text and stored along with the vectors.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete