Should you use OpenAI's embeddings? Probably not, and here's why.
Recently semantic search has exploded. Everyone is chatting with their data, using the same basic recipe:
Take a set of documents and split them into paragraphs.
Convert each paragraph into a vector of embeddings that represent its coordinates in the semantic space.
Given a question, embed it in the same space. Find a few relevant paragraphs via semantic similarity between the vectors.
Tell GPT to answer the question using the information contained in them, with a prompt like “You are an expert on the hermeneutics of ancient Sumerian tablets. The user just asked you a question about this subject. Answer it using only the information contained in the following paragraphs.”
There are two ingredients necessary for this recipe: the language model that will answer the question, and the embeddings model that will pick the source material from the knowledge base. Right now, the options for the LLM are very limited. As I write this OpenAI is clearly in the lead, and there is little reason to use anything besides GPT4 or GPT3.5. However, there are many more options for embeddings. Unlike the GPT models, OpenAI’s embedding are not clearly superior. If you look at benchmarks such as this one, you will find models that score higher than ada-002. In particular the Instructor models (xl and large) do very well. Of course benchmarks don’t mean much in isolation. What matters to you is the right compromise between variables such as cost, performance, or speed.
As an example, I decided to download all my tweets (about 20k) and build a semantic searcher on top. For my first prototype I used mpnet-v2 from sentence-transformers, a relatively small model (438Mb) that should run on any cpu or gpu. It worked fine as long as I used relatively common words that the model had seen, but it didn’t do so well for my tweets in other languages (Spanish, mostly). The next step was to try the Instructor models. They are larger, but I have an 8Gb GPU on my machine that can load instructor-xl into memory. I tried both the large and xl models, and my subjective impression was that xl was indeed more accurate.
Could I do better with OpenAI?
Before running the experiment, it’s worth being aware of the costs involved in using OpenAI’s embeddings. For one, you can’t download the model and use it without an internet connection. You depend on having internet connectivity, and on the reliability of the OpenAI apis:
Secondly, you have to trust OpenAI to keep the model around in the future. What if you embedded millions of documents and then one day ada-002 is discontinued? What if your usage explodes, and you find yourself embedding millions of queries per day at whatever cost OpenAI currently charges for the API?
So I decided to randomize queries using both sets of embeddings to see if I could tell the difference. Turns out, I could not. So here’s the procedure I recommend:
Try the lightest embedding model first
If it doesn’t work, try a beefier model and do a blind comparison
If you are already using a relatively large model like Instructor XL, only then try some blind test against ada-002 from OpenAI. If you really find it that OpenAI is better for your application, then go for it.
If you do go with OpenAI, one word of advice: make sure you don’t spend $50M embedding the whole internet, become successful and then depend on OpenAI’s api to run your search engine!