Combining foundation models with clustering for zero-shot terrain identification.
Continuous Multi-Modal Sensing
The vehicle records images, state, and control data at each timestep, building a rich dataset during real operation.
Extracting Meaningful Semantic Features
A Vision-Language Model (e.g., CLIP) translates images into semantic latent vectors by matching them to text-based queries. These are enhanced with basic visual features (like brightness and color) for additional context.
Unsupervised Clustering for Terrain Discovery
Latent features are automatically grouped into clusters—each representing a unique, discovered terrain type – without manual labeling or prior knowledge.
Physics-Informed Optimization for Actionable Parameters
For each terrain cluster, a gradient-based optimizer fine-tunes friction and related parameters, using a differentiable vehicle dynamics model to backpropagate losses from observed driving behavior.
Takeaway:
Our method demonstrates that vision-language foundation models, combined with physics-informed optimization, empower autonomous vehicles to adapt to unseen terrain in real time—without human supervision or advance mapping—unlocking new possibilities for robust off-road autonomy.