Comments
There's unfortunately not much to read here yet...
Follow the full discussion on Reddit.
I’m working on comparing different vector databases for similarity search. I’ve looked at a bunch of large scale datasets such as the BIGANN-1B dataset but one issue I have with these datasets is that the embedding are compressed to a smaller dimension such as 128 in the case of the BIGANN dataset, whereas for my use case I want to use gte-small as the embedding model, which uses 384 dimension meaning benchmarks would not be representative. Are there any datasets that contain the ground truth labels and the raw data, i.e., if it’s a text dataset it should contain the raw text data for me to convert to an embedding using my custom embedding model.
There's unfortunately not much to read here yet...
Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.
Discover the best guides, books, papers and news in Machine Learning, once per week.