Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Abstract

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios.

Method

Given multi-view RGB images, an MLLM predicts object categories across views, which are aggregated into a Scene Vocabulary Memory. Category-aware masks are lifted into 3D to construct a 3D Fragment Memory. The MLLM then organizes the vocabulary into semantic compatibility groups, which gate fragment merging together with geometric consistency to produce the final open-vocabulary 3D object instances. Finally, multi-view evidence is accumulated to determine the final open-vocabulary category and 3D bounding box for each object instance.

Quantitative Results

Method	Pose-free	Zero-shot	ScanNet20		ScanNet60
Method	Pose-free	Zero-shot	mAP₂₅	mAP₅₀	mAP₂₅	mAP₅₀
Point cloud-based
Det-PointCLIPv2ICCV 23	–	✗	–	–	0.2	–
3D-CLIPICML 21	–	✗	–	–	4.0	–
OV-3DETCVPR 23	–	✗	18.0	–	–	–
CoDaNeurIPS 23	–	✗	19.3	–	9.0	–
INHAECCV 24	–	✗	–	–	10.7	–
GLISECCV 24	–	✗	20.8	–	–	–
ImOV3DNeurIPS 24	–	✗	21.5	–	–	–
OV-Uni3DETRECCV 24	–	✗	25.3	–	19.4	–
Zoo3D₀CVPR 26	–	✓	34.7	23.9	27.1	18.7
Zoo3D₁CVPR 26	–	✗	37.2	26.3	32.0	20.8
Multi-view image-based
OV-Uni3DETRECCV 24	✗	✗	–	–	11.2	–
OpenM3DICCV 25	✗	✗	19.8	7.3	–	–
Zoo3D₀CVPR 26	✗	✓	30.5	17.3	22.0	10.4
Zoo3D₁CVPR 26	✗	✗	32.8	15.5	23.9	10.8
Group3D (Ours)	✗	✓	51.1	27.4	29.1	13.9
Zoo3D₀CVPR 26	✓	✓	24.2	8.8	13.3	4.1
Zoo3D₁CVPR 26	✓	✗	27.9	10.4	15.3	5.6
Group3D (Ours)	✓	✓	41.2	18.5	22.3	8.5

Zoo3D₀ and Zoo3D₁ denote two variants of Zoo3D: zero-shot and self-supervised.

We compare Group3D with prior open-vocabulary 3D detection approaches under two regimes, pose-known and pose-free. We focus exclusively on the multi-view RGB setting and report comparisons to both point cloud-based open-vocabulary detectors and multi-view image-based pipelines.

Method	Pose-free	Zero-shot	ScanNet200		ARKitScenes
Method	Pose-free	Zero-shot	mAP₂₅	mAP₅₀	mAP₂₅	mAP₅₀
Posed images
OpenM3DICCV 25	✗	✗	4.2	–	–	–
Zoo3D₀CVPR 26	✗	✓	14.3	6.2	–	–
Zoo3D₁CVPR 26	✗	✗	16.5	6.3	–	–
Group3D (Ours)	✗	✓	17.9	8.7	20.5	5.9
Unposed images
Zoo3D₀CVPR 26	✓	✓	8.3	2.9	13.0	2.6
Zoo3D₁CVPR 26	✓	✗	10.7	3.8	16.1	3.5
Group3D (Ours)	✓	✓	12.6	5.7	18.4	4.5

Zoo3D₀ and Zoo3D₁ denote two variants of Zoo3D: zero-shot and self-supervised.

Group3D consistently outperforms prior methods on ScanNet200 (200-category) and ARKitScenes in both pose-known and pose-free settings, demonstrating strong generalization to larger open-vocabulary taxonomies and cross-domain 3D scenes.

Group3D: MLLM-Driven Semantic Grouping
for Open-Vocabulary 3D Object Detection

Abstract

Method

Quantitative Results

Qualitative Results

BibTeX