CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications

Department of Computer Science and Technology, University of Cambridge
CoRL 2024

TL;DR: CoViS-Net is a decentralized, real-time, multi-robot visual spatial model that learns spatial priors from data to provide relative pose estimates and bird's-eye-view representations. We demonstrate its effectiveness in real-world multi-robot formation control tasks.

Our model can be used to control the relative pose of multiple follower robots (depicted in yellow and magenta) to a leader robot (blue) following a reference trajectory using visual cues in the environment only (the field-of-view is depicted by cones).

Abstract

Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. In this work, we propose CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird's-eye-view (BEV) representation, even without camera overlap between robots (in contrast to classical methods), and can predict BEV representations of unseen regions. We demonstrate its use in a multi-robot formation control task across various real-world settings.

Model architecture

Model

CoViS-Net consists of four primary components: an image encoder $f_\mathrm{enc}$, a pairwise pose encoder $f_\mathrm{pose}$, a multi-node aggregator $f_\mathrm{agg}$, and a BEV predictor $f_\mathrm{BEV}$. The image encoder uses a pre-trained DinoV2 model with additional layers to generate the embedding $\mathbf{E}_i$ from image $I_i$. These embeddings are communicated between robots. The pose estimator takes two embeddings $\mathbf{E}_i$ and $\mathbf{E}_j$ as input and predicts pose estimates with uncertainty. The multi-node aggregator combines the estimated poses with image embeddings from multiple robots and aggregates them into a common representation. Finally, the BEV predictor generates a bird's-eye-view representation from the aggregated information.

Training

We train CoViS-Net using supervised learning on data from the Habitat simulator with the HM3D dataset. This provides a diverse range of photorealistic indoor environments. Our loss functions include components for pose estimation, uncertainty prediction, and BEV representation accuracy. CoViS-Net incorporates uncertainty estimation using Gaussian Negative Log Likelihood (GNLL) Loss. This allows the model to learn and predict aleatoric uncertainty $\hat{\sigma}^2$ from data points $\mu$, which is crucial for downstream robotic applications. By providing uncertainty estimates, the system can make more informed decisions in challenging scenarios.

Model architecture

Video Presentation

Heterogeneous deployment

To demonstrate CoViS-Net's flexibility and platform-agnostic nature, we conducted heterogeneous multi-robot experiments. The system was deployed on a diverse team of robots, including wheeled platforms and a quadruped (Unitree Go1). Despite the significant differences in robot morphology, dynamics, and camera placement, CoViS-Net maintained consistent performance across all platforms. This experiment showcased the model's ability to generalize across different robot types without requiring platform-specific training or adjustments, highlighting its potential for diverse multi-robot applications and easy integration into existing robotic systems.

Indoor Experiments

We extensively tested CoViS-Net in various indoor environments, including corridors, offices, and rooms with challenging lighting. Using up to four follower robots maintaining fixed relative poses to a remote-controlled leader, we evaluated performance across different scenarios. Results showed consistent accuracy, with median localization errors as low as 22 cm and 5.2° for visible samples. Even without direct visual overlap, the system achieved median errors of 77-146 cm for position estimates. These tests demonstrated CoViS-Net's robustness to real-world variability in lighting, layout, and obstacles.

Outdoor Experiments

Despite being trained solely on indoor data, CoViS-Net showed impressive generalization to outdoor settings. Experiments in streets and open areas yielded median pose estimation errors of 134 cm and 9.5° for non-overlapping views, and 49 cm and 6.9° for overlapping views. While slightly less accurate than indoor performance, these results remain valuable for many multi-robot applications. The system's adaptability to outdoor environments is attributed to the robust spatial understanding of the DinoV2 encoder, opening possibilities for urban exploration, search and rescue, and autonomous navigation in unstructured outdoor spaces.

Homing

Lastly, we show a single-robot application of CoViS-Net in a homing task. We remote control the robot through an occluded area. The robot memorizes the path by storing keypoints based on distance and uncertainty to the previous keypoint. After we are done teaching the path, the robot can replay the keypoints by moving from one to the next.

Real-World Dataset

We show qualitative evaluations of the BEV representation prediction on scenes of our real-world dataset. The top row shows the image for each node, the middle row the ground-truth poses in the coordinate frame of each node, and the bottom row pose predictions and pose uncertainty with the BEV representation prediction in the background.

BibTeX


            @inproceedings{blumenkamp2024covisnet,
              title={CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications},
              author={Blumenkamp, Jan and Morad, Steven and Gielis, Jennifer and Prorok, Amanda},
              booktitle={Conference on Robot Learning},
              year={2024}
            }