Recently semantic search has exploded. Everyone is chatting with their data, using the same basic recipe:
Take a set of documents and split them into paragraphs.
Convert each paragraph into a vector of embeddings that represent its coordinates in the semantic space.
Given a question, embed it in the same space. Find a few relevant paragraphs via semantic similarity between the vectors.
Tell GPT to answer the question using the information contained in them, with a prompt like “You are an expert on the hermeneutics of ancient Sumerian tablets. The user just asked you a question about this subject. Answer it using only the information contained in the following paragraphs.”
There are two ingredients necessary for this recipe: the language model that will answer the question, and the embeddings model that will pick the source material from the knowledge base. Right now, the options for the LLM are very limited. As I write this OpenAI is clearly in the lead, and there is little reason to use anything besides GPT4 or GPT3.5. However, there are many more options for embeddings. Unlike the GPT models, OpenAI’s embedding are not clearly superior. If you look at benchmarks such as this one, you will find models that score higher than ada-002. In particular the Instructor models (xl and large) do very well. Of course benchmarks don’t mean much in isolation. What matters to you is the right compromise between variables such as cost, performance, or speed.
As an example, I decided to download all my tweets (about 20k) and build a semantic searcher on top. For my first prototype I used mpnet-v2 from sentence-transformers, a relatively small model (438Mb) that should run on any cpu or gpu. It worked fine as long as I used relatively common words that the model had seen, but it didn’t do so well for my tweets in other languages (Spanish, mostly). The next step was to try the Instructor models. They are larger, but I have an 8Gb GPU on my machine that can load instructor-xl into memory. I tried both the large and xl models, and my subjective impression was that xl was indeed more accurate.
Here’s a snippet of code, if you want to try it with your own tweets. It uses Chroma to store the embeddings.
Could I do better with OpenAI?
Before running the experiment, it’s worth being aware of the costs involved in using OpenAI’s embeddings. For one, you can’t download the model and use it without an internet connection. You depend on having internet connectivity, and on the reliability of the OpenAI apis:
Secondly, you have to trust OpenAI to keep the model around in the future. What if you embedded millions of documents and then one day ada-002 is discontinued? What if your usage explodes, and you find yourself embedding millions of queries per day at whatever cost OpenAI currently charges for the API?
So I decided to randomize queries using both sets of embeddings to see if I could tell the difference. Turns out, I could not. So here’s the procedure I recommend:
Try the lightest embedding model first
If it doesn’t work, try a beefier model and do a blind comparison
If you are already using a relatively large model like Instructor XL, only then try some blind test against ada-002 from OpenAI. If you really find it that OpenAI is better for your application, then go for it.
If you do go with OpenAI, one word of advice: make sure you don’t spend $50M embedding the whole internet, become successful and then depend on OpenAI’s api to run your search engine!
Most embedding models are "English only" - not many are labeled as multilingual. It's really an annoying problem how much (free/academic) NLP stuff is focused on English.
At least with Hugging Faces, the models you mention are also only labeled as "English only". The training datasets used for these Embedding models might also contain other languages, but it's not clear how well the various language semantics are aligned, or whether non-English languages have been filtered out entirely.
ada-002 is very good with multiple languages and also very good with long paragraphs (relatively high dimensions help here). Your "Probably not" may be good for English-only and simple stuff, not so sure about more complex searches. If your embedding doesn't work really well, you need more prompt size for the final question answering with GPT. This might be the opposite of "saving money" very fast...
This is very true, I have experimented with both OpenAI and Instructor Embeddings XL to query questions on medical records. The instructor embeddings have much longer responses with more details. Instructor embeddings seems to create embeddings that are also task specific which what could account for its outperformance. Thanks for sharing!