Discussion about this post

User's avatar
André Pankraz's avatar

Most embedding models are "English only" - not many are labeled as multilingual. It's really an annoying problem how much (free/academic) NLP stuff is focused on English.

At least with Hugging Faces, the models you mention are also only labeled as "English only". The training datasets used for these Embedding models might also contain other languages, but it's not clear how well the various language semantics are aligned, or whether non-English languages have been filtered out entirely.

ada-002 is very good with multiple languages and also very good with long paragraphs (relatively high dimensions help here). Your "Probably not" may be good for English-only and simple stuff, not so sure about more complex searches. If your embedding doesn't work really well, you need more prompt size for the final question answering with GPT. This might be the opposite of "saving money" very fast...

Expand full comment
Layla's avatar

This is very true, I have experimented with both OpenAI and Instructor Embeddings XL to query questions on medical records. The instructor embeddings have much longer responses with more details. Instructor embeddings seems to create embeddings that are also task specific which what could account for its outperformance. Thanks for sharing!

Expand full comment
11 more comments...

No posts