I’m working on comparing different vector databases for similarity search. I’ve looked at a bunch of large scale datasets such as the BIGANN-1B dataset but one issue I have with these datasets is that the embedding are compressed to a smaller dimension such as 128 in the case of the BIGANN dataset, whereas for my use case I want to use gte-small as the embedding model, which uses 384 dimension meaning benchmarks would not be representative. Are there any datasets that contain the ground truth labels and the raw data, i.e., if it’s a text dataset it should contain the raw text data for me to convert to an embedding using my custom embedding model.


