Group3D: MLLM-Driven Semantic Grouping
for Open-Vocabulary 3D Object Detection

Youbin Kim1, Jinho Park1, Hogun Park1, Eunbyung Park2

1Sungkyunkwan University 2Yonsei University

Paper Code
🔤 Open-vocabulary Beyond fixed category taxonomy
🖼️ Multi-view RGB No point cloud or depth required
📷 Pose-free Works without camera poses
🎯 Zero-shot No 3D supervision required

Abstract

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios.

Method

Group3D pipeline
Given multi-view RGB images, an MLLM predicts object categories across views, which are aggregated into a Scene Vocabulary Memory. Category-aware masks are lifted into 3D to construct a 3D Fragment Memory. The MLLM then organizes the vocabulary into semantic compatibility groups, which gate fragment merging together with geometric consistency to produce the final open-vocabulary 3D object instances. Finally, multi-view evidence is accumulated to determine the final open-vocabulary category and 3D bounding box for each object instance.

Quantitative Results

Method Pose-free Zero-shot ScanNet20 ScanNet60
mAP25 mAP50 mAP25 mAP50
Point cloud-based
Det-PointCLIPv2ICCV 230.2
3D-CLIPICML 214.0
OV-3DETCVPR 2318.0
CoDaNeurIPS 2319.39.0
INHAECCV 2410.7
GLISECCV 2420.8
ImOV3DNeurIPS 2421.5
OV-Uni3DETRECCV 2425.319.4
Zoo3D0CVPR 2634.723.927.118.7
Zoo3D1CVPR 2637.226.332.020.8
Multi-view image-based
OV-Uni3DETRECCV 2411.2
OpenM3DICCV 2519.87.3
Zoo3D0CVPR 2630.517.322.010.4
Zoo3D1CVPR 2632.815.523.910.8
Group3D (Ours)51.127.429.113.9
Zoo3D0CVPR 2624.28.813.34.1
Zoo3D1CVPR 2627.910.415.35.6
Group3D (Ours)41.218.522.38.5

Zoo3D0 and Zoo3D1 denote two variants of Zoo3D: zero-shot and self-supervised.

We compare Group3D with prior open-vocabulary 3D detection approaches under two regimes, pose-known and pose-free. We focus exclusively on the multi-view RGB setting and report comparisons to both point cloud-based open-vocabulary detectors and multi-view image-based pipelines.

Method Pose-free Zero-shot ScanNet200 ARKitScenes
mAP25 mAP50 mAP25 mAP50
Posed images
OpenM3DICCV 254.2
Zoo3D0CVPR 2614.36.2
Zoo3D1CVPR 2616.56.3
Group3D (Ours)17.98.720.55.9
Unposed images
Zoo3D0CVPR 268.32.913.02.6
Zoo3D1CVPR 2610.73.816.13.5
Group3D (Ours)12.65.718.44.5

Zoo3D0 and Zoo3D1 denote two variants of Zoo3D: zero-shot and self-supervised.

Group3D consistently outperforms prior methods on ScanNet200 (200-category) and ARKitScenes in both pose-known and pose-free settings, demonstrating strong generalization to larger open-vocabulary taxonomies and cross-domain 3D scenes.

Qualitative Results

Ground Truth
Loading…
Pose-known
Loading…
Pose-free
Loading…
Ground Truth
Loading…
Pose-known
Loading…
Pose-free
Loading…
Ground Truth
Loading…
Pose-known
Loading…
Pose-free
Loading…
Ground Truth
Loading…
Pose-known
Loading…
Pose-free
Loading…
Ground Truth
Loading…
Pose-known
Loading…
Pose-free
Loading…

BibTeX

citation.bib
@misc{kim2026group3dmllmdrivensemanticgrouping,
      title={Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection}, 
      author={Youbin Kim and Jinho Park and Hogun Park and Eunbyung Park},
      year={2026},
      eprint={2603.21944},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.21944}, 
}