There’s no shortage of rumors about Apple’s plans to release camera-equipped wearables. And while it’s easy to get fatigued by yet another wave of upcoming AI-powered hardware, one powerful use case often gets lost in the shuffle: accessibility.
SceneScout, a new research prototype from Apple and Columbia University, isn’t a wearable. Yet. But it hints at what AI could eventually unlock for blind and low-vision users. As Apple’s and Columbia University’s researchers explain it:
People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people.
To try to close this gap, the researchers present this project that combines Apple Maps APIs with a multimodal large language model to provide interactive, AI-generated descriptions of street view images.
Instead of just relying on turn-by-turn directions or landmarks, users can explore an entire route or virtually explore a neighborhood block by block, with street-level descriptions that are tailored to their specific needs and preferences.
The system supports two main modes:
Route Preview, which lets users get a sense of what they’ll encounter along a specific path. That means sidewalk quality, intersections, visual landmarks, what a bus stop looks like, etc.
Virtual Exploration, which is more open-ended. Users describe what they’re searching for (like a quiet residential area with access to parks), and the AI helps them navigate intersections and explore in any direction based on that intent.
Behind the scenes, SceneScout grounds a GPT-4o-based agent within real-world map data and panoramic images from Apple Maps.
It simulates a pedestrian’s view, interprets what’s visible, and outputs structured text, broken into short, medium, or long descriptions. The web interface, designed with screen readers in mind, presents all of this in a fully accessible format.
... continue reading