Good models to use for multimodal object detection when both the modalities are image based or some object detection models which support ensembling out of the box like Yolov5?

So basically I have a dataset with images of vehicles in top down view in both RGB and IR, what are some models I can use for both unimodal and multimodal object detection to compare their performance. Links to GitHub repos would be helpful. Thanks


