Large Scale Similarity Search Datasets with Raw data

Follow the full discussion on Reddit.
I’m working on comparing different vector databases for similarity search. I’ve looked at a bunch of large scale datasets such as the BIGANN-1B dataset but one issue I have with these datasets is that the embedding are compressed to a smaller dimension such as 128 in the case of the BIGANN dataset, whereas for my use case I want to use gte-small as the embedding model, which uses 384 dimension meaning benchmarks would not be representative. Are there any datasets that contain the ground truth labels and the raw data, i.e., if it’s a text dataset it should contain the raw text data for me to convert to an embedding using my custom embedding model.

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter