top of page

Borrowing From the Real World: Designing for Gesture and Vision AI

Apple's working on an AI pin. OpenAI's building hardware. Google's pushing TPUs. The AI device race is officially heating up, and everyone's talking about the tech. But here's what keeps me up at night: we're about to design interfaces for gesture and vision-based interactions, and we're basically starting from scratch.


No established patterns. No muscle memory. No "swipe right to like" equivalent.


This is both terrifying and exciting. Because unlike buttons, swipes, and clicks—interactions we've spent decades refining—physical AI devices need us to figure out entirely new ways for people to communicate with technology.


The good news? We're not actually starting from nothing. The real world is full of interaction patterns we've been using our entire lives. We just need to figure out which ones translate to AI devices and which ones... really don't.


The Object Manipulation Playbook

Think about how you interact with physical objects. You don't press a button on a book to turn the page—you grab the corner and flip it. You don't swipe a door handle—you twist it. These interactions are so ingrained that we don't even think about them anymore.


This is where gesture-based AI interfaces get interesting. What if pinching and pulling in the air mimicked manipulating physical objects? Expanding photos by "stretching" them with your hands feels natural because you're borrowing from how you'd unfold a map. Rotating objects by circling your finger in the air works because it's how you'd spin a globe.


The challenge is figuring out which physical metaphors work when there's no actual object to touch. You can't feel resistance. There's no weight. No texture feedback. So some gestures that work beautifully with real objects fall apart in mid-air.


We Already Speak Gesture

Here's the thing about gestures: we're already using them constantly. When you wave hello, point at something across the room, or give a thumbs up, you're communicating without words. These are universal (well, mostly universal) gestures that don't require explanation.


Vision-based AI interfaces can tap into this existing vocabulary. A wave to dismiss. A point to select. A hand up to pause. These gestures already mean something to us, so the learning curve is minimal.


But—and this is a big but—context matters. A thumbs up in a video call is friendly. A thumbs up while driving past someone could mean something very different. AI devices need to be smart enough to understand contextual intent, not just recognize the gesture itself.


Spatial Computing Isn't New

We navigate physical space every day without thinking about it. You know how far away the door is. You can reach for your coffee without looking. You understand "in front of," "behind," "above," and "below" intuitively.


This spatial awareness is gold for vision-based interfaces. Ambient-aware tech—devices that understand their physical context—can use your natural spatial navigation to create interfaces that feel intuitive. Want to "place" a virtual sticky note on your desk? Point at the desk. Need to pull up information about something across the room? Look at it.


The challenge isn't teaching people these interactions—they already know them. The challenge is making the technology responsive and accurate enough that it doesn't break the illusion. If there's lag, if the device misunderstands where you're pointing, the magic disappears instantly.


What Doesn't Translate (Yet)

Not every real-world pattern works for AI devices. Precision is one. In the real world, you can thread a needle or sign your name with millimeter accuracy. Try doing that with gesture controls in mid-air. It's... frustrating.


Then there's sustained actions. You can hold a door open for five minutes if you need to. Try holding your arm up in the air for five minutes while gesture-controlling something. Your arm will hate you.


And haptic feedback—the real MVP of physical interaction—is mostly missing. When you press a button on your phone, it clicks (or vibrates). When you gesture in the air, there's nothing. No confirmation. No "yes, I understood that." You're left wondering if the device registered your input or if you need to do it again.


The question we need to ask: Which real-world patterns make sense when you remove the physical feedback? And which patterns need to be completely reimagined for gesture and vision interfaces?


The Fun Part

Here's what makes this exciting: we get to experiment. We get to try things that might fail spectacularly. We get to watch people interact with AI devices and learn what feels natural versus what requires training.


Maybe conducting gestures—like an orchestra conductor—become the way we orchestrate complex AI actions. Maybe we borrow from sports: a basketball shooting motion to "send" something, a catching motion to "receive."


Or maybe we discover entirely new interaction patterns that have no real-world equivalent. Something that only makes sense when you're interacting with an intelligent device that can see and understand context.


The hardware race is fun to watch, but the real innovation happens in the design. How do we make alien technology feel natural? How do we borrow from the real world without being constrained by it?


That's the challenge. And honestly? That's the best part.



Comments


bottom of page