Collaborative dense SLAM is fundamental for outdoor multi-robot teams to achieve scalable and consistent 3D perception across large environments. Existing systems typically rely on depth sensors, imposing significant payload penalties and prohibitive power and calibration costs. Monocular RGB cameras offer an appealing lightweight alternative; however, collaborative monocular dense SLAM remains challenging due to inherent scale ambiguity and unreliable inter-agent data association. These difficulties are further exacerbated in outdoor scenes by low overlap, extreme viewpoint variations, and repetitive structures that undermine sparse feature matching.
We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that places learned feed-forward 3D reconstruction priors at its core for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion via ray-range residuals on a generic central-camera model, while a coordinator retrieves candidates, verifies them with dense pointmap matching, synchronizes gauges via closed-form Sim(3) alignment, and refines the map with GPU-accelerated global bundle adjustment and segment-level depth optimization. By leveraging dense learned geometry instead of sparse features and requiring no depth sensors or parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric reconstructions.
On Tanks and Temples (T&T) and Waymo settings, CoMo3R-SLAM achieves the best ATE on three of four T&T scenes and competitive accuracy on Waymo driving sequences, matching or exceeding state-of-the-art RGB-D collaborative SLAM methods at real-time 8 FPS.
Two-agent collaborative reconstructions. Click any thumbnail below to play the corresponding result.
Collaborative reconstructions with three or four agents. Click any thumbnail below to play the corresponding result.
Our method generalizes across illumination conditions. The outdoor scene is reconstructed under daylight and at night with no assumption about a fixed or parametric camera model, demonstrating robustness to illumination changes.