AI Video to 3D: CVPR 2025 Breakthrough for AR & Robots

Insights | 22-08-2025 | By Robin Mitchell

Key Things to Know:

Smartphone video → 3D twins: Cornell’s DRAWER turns ordinary clips into interactive, photorealistic rooms with movable doors, drawers and objects.
Low friction capture: No special sensors or manual mapping—short casual videos feed geometry, textures and articulation (hinges, slides) in real time.
Beyond demos: Teams trained a robot in the digital kitchen and transferred skills to the real world, pointing to faster, safer automation workflows.
What’s next: Support for soft/reflective materials and larger indoor–outdoor scenes is planned; performance, reliability and edge-case handling remain active challenges.

Despite years of bold promises and heavy investment, augmented reality (AR) remains a technology struggling to bridge the gap between imagination and implementation. From clunky hardware to complex software challenges, the dream of seamlessly blending the digital and physical worlds has been harder to realise than many expected.

Now, a research team at Cornell University, together with collaborators at the University of Illinois and University of Washington, may have taken a significant step forward with a new AI system that creates interactive 3D environments from nothing more than ordinary smartphone video. The breakthrough, formally presented at CVPR 2025 and reported by the Cornell Chronicle, is attracting attention across both the gaming and robotics communities.

Why has AR fallen short so far, what makes this breakthrough different, and how could this shift the way robots interact with the real world?

The Challenge of Augmented Worlds

Augmented reality (AR) has been hailed as the future for at least a decade, yet it still feels more like a tech demo than a serious tool. Despite billions in investment and some very flashy marketing, AR struggles to gain serious traction among consumers and developers alike. But it's not just a matter of hype failing to meet expectations; the underlying challenges are both technical and practical, and they go deeper than most people realise.

Starting with the obvious, the technology might simply not be ready yet. Of course, we have smartphones that can place digital furniture in our living rooms and filters that turn your face into something humorous, but that is far from the seamless, immersive augmentation of reality that science fiction has promised. Simply put, the current generation of hardware and software simply isn't up to the job.

It's also possible that AR is simply ahead of its time. Some technologies need the world to catch up before they make sense, just like how it took a long time for mobile computing and electric vehicles to become mainstream. With AR, we're still figuring out where the actual value lies beyond novelty. Training, navigation, and industrial repair seem like useful applications, but social media goggles and digital billboards floating in your eyeline are not exactly killer apps.

Then there's the usability problem, as AR is notoriously awkward to use in public. Headsets are bulky, holding a phone up to "look" at things through a screen is tedious, battery life tanks, and context matters. We're still a long way from making AR feel natural, intuitive, and socially acceptable.

But zoom in further, and you hit the real brick wall: the engineering. Building augmented worlds that are accurate, interactive, and fast enough for real-time use is technically brutal. One of the hardest problems is mapping the real world into the virtual one, all while keeping it consistent.

LiDAR (Light Detection and Ranging) is often used for this purpose, and it's a great technology, but LiDAR isn't cheap, power-efficient, or compact enough to squeeze into every consumer-grade device. Even when you do, you're still limited by resolution, speed, and environmental conditions.

Environments themselves also don't play nice, as real-world spaces are complex, dynamic, and full of edge cases. Reflective surfaces, moving people, low-light conditions, and messy rooms all interfere with spatial tracking and scene understanding.

Finally, there's the elephant in the room: processing power. Creating a convincing AR experience in real-time means you're simultaneously scanning the environment, building a 3D map, recognising objects, predicting motion, rendering graphics, and maintaining frame rates, all on a mobile chip that is often passively cooled. That's a lot to ask, and without dedicated hardware acceleration, performance quickly becomes the bottleneck, and with it, user experience.

Researchers Create AI That Builds Augmented Realities From Vision

A new AI system developed by researchers at Cornell University is transforming everyday smartphone videos into interactive, photorealistic 3D worlds, opening exciting possibilities for gaming, robotics, and digital design. The technology, called DRAWER, can take a simple video of a room, such as a kitchen or office, and instantly build a detailed digital twin where objects like drawers and cabinet doors can be virtually opened and moved, creating a truly immersive experience.

Unlike previous methods that required specialised equipment or manual mapping, DRAWER only needs a short, casual video captured on an ordinary smartphone. Hongchi Xia, a Ph.D. student at Cornell and one of the lead developers, explained that the system analyses the visual data from the video to reconstruct the room's shapes, textures, and dimensions. This enables the AI to create a 3D model that looks lifelike and also responds naturally to user interaction, a significant advance over existing technologies that often produce static or non-interactive models.

Assistant Professor Wei-Chiu Ma, who leads the project, highlighted that while many current approaches can synthesise scenes from various camera angles, they often lack true interactivity. DRAWER breaks new ground by combining fine geometric detail with an understanding of which parts of the environment can move and how they should move. The system's articulation module detects elements such as drawers that slide horizontally or doors that swing outward, and a perception module identifies hinge locations and articulation types. It even uses generative AI to imagine hidden areas, like the inside of cabinets, enhancing realism.

Watch DRAWER in Action

To fully appreciate DRAWER’s capabilities, it helps to see how an ordinary kitchen video can be transformed into a responsive, interactive digital twin. The following demonstrations, provided by the research team, highlight the system’s ability to reconstruct environments with movable objects in real time.

Original smartphone video: a static recording of a real kitchen scene.

Interactive 3D reconstruction: drawers, doors, and objects can now be virtually opened and moved.

Integrating these capabilities into a seamless, functional framework required months of development. Once operational, Xia demonstrated DRAWER's versatility by recreating various rooms, including his own office and bathroom. In one demonstration, the digital kitchen was converted into a simple game environment using Unreal Engine, allowing users to interact dynamically by throwing virtual balls to knock over objects, proving the system's natural response to movement.

Beyond gaming, DRAWER offers serious potential for robotics. Robots can be trained within these digital replicas through a process called real-to-sim-to-real transfer, where tasks are learned virtually before being applied in the real world. Ma's team successfully trained a robotic arm in the digital kitchen to put objects away in drawers, and the robot performed the same tasks accurately in the physical space. This method greatly reduces the costs and risks of physical robot training, making automation safer and more accessible.

Looking ahead, the researchers plan to expand DRAWER's capabilities beyond rigid objects to include soft or deformable materials such as fabrics, as well as reflective and dynamic surfaces like mirrors and windows. They are also working on scaling the technology to model entire buildings and outdoor environments, envisioning applications in urban planning, smart farming, and disaster response, where virtual testing in accurate digital twins could improve decision-making.

How Could Such Technologies Change the Future of Robotics?

The ability to create fully interactive, photorealistic environments from nothing more than smartphone video is more than just a novelty; it could represent a major shift in how robots perceive and operate in the real world. As systems like DRAWER don't just capture geometry, but extract meaning, this enables machines to build a working mental model of their environment, complete with object relationships, constraints, and potential interactions.

Such a concept represents a significant change in the field of robotics and general environment interaction. Traditionally, robots "see" through sensors like cameras or LiDAR, but what they actually understand is minimal, often limited to bounding boxes or depth maps. With AI-driven scene reconstruction, that changes, as a robot could recognise that a drawer is not just an obstacle, but something that opens, holds objects, and follows specific rules of movement.

This concept unlocks serious new possibilities in robot planning, as a robot generating a digital twin of its surroundings could simulate its intended actions before committing to them. For example, the robot could virtually test whether it has clearance to open a door or whether its gripper will fit into a narrow cabinet. This kind of real-time, self-generated foresight brings robotics closer to something resembling human forethought—not just reacting to the world, but predicting it.

The applications of this technology go well beyond industrial arms or smart homes, with one prime example being autonomous vehicles. In this case, self-driving vehicles could benefit from richer environmental models: instead of merely identifying lanes and cars, a vehicle could interpret subtler features, like the intent of a person reaching for a car door, or the motion path of a partially obscured cyclist, and thus make better decisions as a result.

Of course, this level of sophistication doesn't come for free. Technology is still evolving, and many challenges remain, especially regarding processing power, reliability, and edge-case behaviour. However, the direction is clear: tools like DRAWER point toward a future where robots don't just see the world; they understand it. And that understanding could be the key to making intelligent machines actually useful, not just impressive.

By Robin Mitchell

Robin Mitchell is an electronic engineer who has been involved in electronics since the age of 13. After completing a BEng at the University of Warwick, Robin moved into the field of online content creation, developing articles, news pieces, and projects aimed at professionals and makers alike. Currently, Robin runs a small electronics business, MitchElectronics, which produces educational kits and resources.