Connecting Figure 01 to a large pretrained multimodal model gives it some interesting new capabilities:
- Describe its surroundings.
- Use common sense reasoning when making decisions.
- Translate ambiguous, high-level requests to some context-appropriate behavior.
- Describe *why* it executed a particular action in plain english.