Most embedding models are "English only" - not many are labeled as multilingual. It's really an annoying problem how much (free/academic) NLP stuff is focused on English.
At least with Hugging Faces, the models you mention are also only labeled as "English only". The training datasets used for these Embedding models might also contain other languages, but it's not clear how well the various language semantics are aligned, or whether non-English languages have been filtered out entirely.
ada-002 is very good with multiple languages and also very good with long paragraphs (relatively high dimensions help here). Your "Probably not" may be good for English-only and simple stuff, not so sure about more complex searches. If your embedding doesn't work really well, you need more prompt size for the final question answering with GPT. This might be the opposite of "saving money" very fast...
This is true. I've had good results with instructor-xl and Spanish, but I haven't tried other languages. Like the post says, it's worth trying these first before knowing that you have to commit to ada-002 because they don't work for your use case.
This is very true, I have experimented with both OpenAI and Instructor Embeddings XL to query questions on medical records. The instructor embeddings have much longer responses with more details. Instructor embeddings seems to create embeddings that are also task specific which what could account for its outperformance. Thanks for sharing!
Great article! One can't overlook the importance of open-source solutions in this context. Open-source models often offer better performance, more flexibility with multi-language support, and, most importantly, they eliminate the worry of vendor lock-in!
If vendor reliability is really a concern, one should consider platforms like embaas.io. They provide open-source embeddings as a service, giving users more control over their solutions. Even though they don't help with the reliablity issues.
I'm trying to use local iOS and macOS capabilities to generate embeddings and later do vector search on it. Do you think it'll be good enough? I don't know how to measure the quality difference.
No, you should expect them to be completely incompatible. The embedding space of a given model is completely arbitrary and has no reason to resemble that of another, even if they both had the same number of dimensions.
hey Tao! This is literally what I have been experimenting on in last few days. I found USE significantly better than OpenAI Embeddings. This is also the reason why I am on this page :)
Checkout the embeddings from alternative https://text-generator.io they download links/image content and analyse with neutral networks/OCR so are multi modal.
There's still a similar trust issue, if you change embeddings providers you'll have to re embed everything but that's the same if you use your own model and decide to change too, they are much cheaper than OpenAI too
Most embedding models are "English only" - not many are labeled as multilingual. It's really an annoying problem how much (free/academic) NLP stuff is focused on English.
At least with Hugging Faces, the models you mention are also only labeled as "English only". The training datasets used for these Embedding models might also contain other languages, but it's not clear how well the various language semantics are aligned, or whether non-English languages have been filtered out entirely.
ada-002 is very good with multiple languages and also very good with long paragraphs (relatively high dimensions help here). Your "Probably not" may be good for English-only and simple stuff, not so sure about more complex searches. If your embedding doesn't work really well, you need more prompt size for the final question answering with GPT. This might be the opposite of "saving money" very fast...
This is true. I've had good results with instructor-xl and Spanish, but I haven't tried other languages. Like the post says, it's worth trying these first before knowing that you have to commit to ada-002 because they don't work for your use case.
This is very true, I have experimented with both OpenAI and Instructor Embeddings XL to query questions on medical records. The instructor embeddings have much longer responses with more details. Instructor embeddings seems to create embeddings that are also task specific which what could account for its outperformance. Thanks for sharing!
Great article! One can't overlook the importance of open-source solutions in this context. Open-source models often offer better performance, more flexibility with multi-language support, and, most importantly, they eliminate the worry of vendor lock-in!
If vendor reliability is really a concern, one should consider platforms like embaas.io. They provide open-source embeddings as a service, giving users more control over their solutions. Even though they don't help with the reliablity issues.
It's twice as expensive as Ada!
I'm trying to use local iOS and macOS capabilities to generate embeddings and later do vector search on it. Do you think it'll be good enough? I don't know how to measure the quality difference.
https://developer.apple.com/documentation/naturallanguage/nlembedding
Although I am not new to coding, I am new to LLMs. I thank you for your post. I have one question:
Are embeddings from one model guaranteed to be compatible with a different LLM?
No, you should expect them to be completely incompatible. The embedding space of a given model is completely arbitrary and has no reason to resemble that of another, even if they both had the same number of dimensions.
I found that OpenAI ada is not good on “not” in sentence semantic comparison
Eg, when compare following two short-sentence-pairs
1) “ugly” vs “not beautiful”
2) “ugly” vs “not ugly”
OpenAI said sentences in 2) are more similar with each other than that of in 1)
Any embedding models work better on the mentioned situation? Thanks in advance.
hey Tao! This is literally what I have been experimenting on in last few days. I found USE significantly better than OpenAI Embeddings. This is also the reason why I am on this page :)
Thanks for writing this Diego! Clean and helpful
what USE stands for?
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf
Universal Sentence Encoder by Google, my bad
Checkout the embeddings from alternative https://text-generator.io they download links/image content and analyse with neutral networks/OCR so are multi modal.
There's still a similar trust issue, if you change embeddings providers you'll have to re embed everything but that's the same if you use your own model and decide to change too, they are much cheaper than OpenAI too