WildDet3D: Scaling Promptable 3D Detection in the Wild

Weikai Huang1,2 , Jieyu Zhang1,2
Sijun Li2, Taoyang Jia2, Jiafei Duan1,2, Yunqian Cheng1, Jaemin Cho1,2, Matthew Wallingford1, Rustin Soraki1,2, Chris Dongjoo Kim1, Shuo Liu1,2, Donovan Clay1,2, Taira Anderson1, Winson Han1
Ali Farhadi1,2, Bharath Hariharan3, Zhongzheng Ren1,2,4 , Ranjay Krishna1,2
1Allen Institute for AI    2University of Washington    3Cornell University    4UNC-Chapel Hill
denotes core contributors.

Abstract

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection—recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer.

In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes.

WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D-dist on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

WildDet3D overview: single RGB image with optional depth, flexible prompts, and 3D bounding box predictions across diverse scenes
Overview of WildDet3D. Given a single RGB image and an optional depth map, WildDet3D performs open-vocabulary monocular 3D object detection by accepting flexible prompt modalities—text queries, 2D point clicks, or 2D bounding boxes—and predicting full 3D bounding boxes for the specified objects. This unified framework enables interactive, open-world 3D perception across diverse scenes and thousands of object categories.

Video Demo

iPhone App Demo

Interactive Visualizations

Explore WildDet3D-Data and model predictions interactively through our visualization servers.

Dataset Viewer

Browse WildDet3D-Data interactively—explore 3D bounding box annotations across 1M+ images and 13.5K categories in diverse scenes.

Model Comparison Visualizer

Compare WildDet3D predictions against baselines on the full WildDet3D-Bench with side-by-side 3D box visualizations.

Model Architecture

WildDet3D uses a unified geometry-aware architecture with dual-vision encoders for RGB and optional RGBD input. A depth fusion module integrates geometric cues when available, while a promptable detector unifies text, point, and box prompts. Cascaded 2D and 3D detection heads predict full 3D bounding boxes with metric depth, dimensions, and 6-DoF orientation. The model gracefully degrades to monocular mode when depth is unavailable.

WildDet3D model architecture diagram
WildDet3D architecture. Dual-vision encoders process the image and optional depth. A ControlNet-style depth fusion module injects geometric features, and a promptable detector supports text, point, and box prompts. The 3D detection head produces metric 3D bounding boxes with depth, dimensions, and rotation.

WildDet3D-Data

We introduce WildDet3D-Data, the largest open 3D detection dataset to date. It is constructed by generating candidate 3D boxes from existing 2D annotations across COCO, LVIS, Objects365, and V3Det, then filtering with geometric/semantic checks and retaining only human-verified annotations. The result is a diverse, large-scale dataset spanning indoor, outdoor, and nature scenes.

1M+
Images
13.5K
Categories
3.7M
3D Annotations
22
Scene Types
138×
vs. Omni3D Categories
Scene category distribution of WildDet3D-Data
Scene category distribution. WildDet3D-Data spans 22 scene categories across indoor (52%), urban (32%), and nature (15%) environments, providing broad coverage of real-world settings.
Qualitative examples from WildDet3D-Data
Qualitative examples from WildDet3D-Data. Each pair shows 3D bounding box annotations overlaid on the input image with category labels (left) and the corresponding 3D bounding boxes rendered in the reconstructed point cloud (right). The dataset covers diverse settings including indoor scenes, outdoor environments, and animals in the wild.

Qualitative Results

Text-Prompted Detection

Text-prompted 3D detection results across diverse scenes
Text-prompted comparison. WildDet3D detects objects specified by open-vocabulary text categories, producing accurate 3D bounding boxes across indoor and outdoor scenes.

Box-Prompted Detection

Box-prompted 3D detection results
Box-prompted comparison. Given 2D bounding boxes as prompts, WildDet3D lifts them into accurate 3D bounding boxes with metric depth, dimensions, and orientation.

Applications

Beyond benchmark evaluation, WildDet3D is deployed across a range of real-world platforms, demonstrating its versatility as a general-purpose 3D perception module.

WildDet3D web demo

Web Demo

Interactive demo on Hugging Face Spaces. Upload any image, provide text or box prompts, and visualize 3D bounding box predictions in real time.

WildDet3D iPhone app

iPhone App

On-device 3D detection via ARKit with LiDAR depth, supporting open-vocabulary text queries and 2D box prompts with AR overlays anchored to the physical scene.

WildDet3D VLM agent

VLM Agent

Paired with vision-language models for referring expression localization—the VLM reasons and produces a 2D box, WildDet3D lifts it to a full 3D bounding box.

WildDet3D zero-shot tracking

Zero-Shot Tracking

Track objects in video sequences with zero-shot 3D detection, combining per-frame predictions with temporal consistency.

WildDet3D on Meta Quest 3

Augmented Reality (Meta Quest 3)

Passthrough AR with 3D bounding boxes rendered in real time. Users can query objects by category and see metric 3D boxes anchored in physical space.

WildDet3D for robotic manipulation

Robotics

Open-vocabulary 3D detection for Franka Emika Panda manipulation. Predicted 3D boxes are transformed to the robot's frame for zero-shot grasp pose generation.

More Examples

Citation

  Paper and BibTeX coming soon.