13 Comments

Most embedding models are "English only" - not many are labeled as multilingual. It's really an annoying problem how much (free/academic) NLP stuff is focused on English.

At least with Hugging Faces, the models you mention are also only labeled as "English only". The training datasets used for these Embedding models might also contain other languages, but it's not clear how well the various language semantics are aligned, or whether non-English languages have been filtered out entirely.

ada-002 is very good with multiple languages and also very good with long paragraphs (relatively high dimensions help here). Your "Probably not" may be good for English-only and simple stuff, not so sure about more complex searches. If your embedding doesn't work really well, you need more prompt size for the final question answering with GPT. This might be the opposite of "saving money" very fast...

Expand full comment

This is very true, I have experimented with both OpenAI and Instructor Embeddings XL to query questions on medical records. The instructor embeddings have much longer responses with more details. Instructor embeddings seems to create embeddings that are also task specific which what could account for its outperformance. Thanks for sharing!

Expand full comment

Great article! One can't overlook the importance of open-source solutions in this context. Open-source models often offer better performance, more flexibility with multi-language support, and, most importantly, they eliminate the worry of vendor lock-in!

If vendor reliability is really a concern, one should consider platforms like embaas.io. They provide open-source embeddings as a service, giving users more control over their solutions. Even though they don't help with the reliablity issues.

Expand full comment

I'm trying to use local iOS and macOS capabilities to generate embeddings and later do vector search on it. Do you think it'll be good enough? I don't know how to measure the quality difference.

https://developer.apple.com/documentation/naturallanguage/nlembedding

Expand full comment

Although I am not new to coding, I am new to LLMs. I thank you for your post. I have one question:

Are embeddings from one model guaranteed to be compatible with a different LLM?

Expand full comment

I found that OpenAI ada is not good on “not” in sentence semantic comparison

Eg, when compare following two short-sentence-pairs

1) “ugly” vs “not beautiful”

2) “ugly” vs “not ugly”

OpenAI said sentences in 2) are more similar with each other than that of in 1)

Any embedding models work better on the mentioned situation? Thanks in advance.

Expand full comment
Mar 30, 2023·edited Mar 30, 2023

Checkout the embeddings from alternative https://text-generator.io they download links/image content and analyse with neutral networks/OCR so are multi modal.

There's still a similar trust issue, if you change embeddings providers you'll have to re embed everything but that's the same if you use your own model and decide to change too, they are much cheaper than OpenAI too

Expand full comment