Long-range 3D object detection is critical for safe autonomous driving at highway speeds, yet existing radar-camera fusion methods face notable limitations at extended ranges. BEV-based approaches effectively encode scene-level context but incur rapidly growing computational cost and struggle to preserve fine-grained object-level detail, while query-based methods provide efficient object-centric encoding but lack sufficient scene-level context. In this paper, we propose Horizon3D, a sparse radar-camera fusion framework for long-range 3D object detection that jointly captures object-level detail and scene-level context in both spatial and temporal dimensions through a hybrid representation that combines Gaussian primitives with sparse BEV features. Horizon3D first employs Keypoint-Guided Gaussian Initialization (KGGI) to initialize Gaussian primitives at object keypoints estimated from radar and camera features. Object-Centric Sparse Fusion (OCSF) aggregates cross-modal features around these primitives and splats the refined Gaussians onto the BEV plane, where they are fused with sparse radar BEV features. Finally, Dual-Path Temporal Fusion (DPTF) aggregates temporal cues through a BEV path for multi-frame feature accumulation and a Gaussian path for propagating primitives across frames to encode per-object motion. Extensive evaluations on TruckScenes demonstrate that Horizon3D achieves state-of-the-art performance, outperforming the previous best method by +3.0 NDS and +1.6 mAP while maintaining competitive inference speed.
Overall architecture of Horizon3D. Multi-view camera images and 4D radar points are encoded by their respective backbones. KGGI initializes sparse Gaussian primitives at estimated object keypoints. OCSF aggregates cross-modal features and produces a sparse BEV representation. DPTF fuses temporal information through Gaussian and BEV paths.
Keypoint-Guided Gaussian Initialization estimates object keypoints from both radar and camera features and initializes a compact set of Gaussian primitives at these locations, enabling efficient object-centric encoding even at extended detection ranges up to 150m.
Object-Centric Sparse Fusion aggregates cross-modal features around Gaussian primitives via deformable cross-attention, iteratively refines their parameters, and splats them onto the BEV plane to produce a sparse representation fused with radar BEV features.
Dual-Path Temporal Fusion integrates temporal cues through two complementary paths: a BEV path that accumulates foreground features over multiple frames, and a Gaussian path that propagates primitives across frames with velocity-guided motion compensation.
Object-Centric Sparse Fusion (OCSF). Refined Gaussian primitives are splatted to form a class-agnostic BEV mask supervised against the BEV mask GT, while sampled active Gaussians aggregate cross-modal features around each object to produce object-centric BEV features.
Dual-Path Temporal Fusion (DPTF). The Gaussian path propagates a velocity-matched history Gaussian set through a FIFO memory, while the BEV path accumulates filtered sparse BEV features in a memory bank — jointly encoding per-object motion and multi-frame scene context.
Step through the OCSF pipeline: from detected objects to Gaussian primitives to BEV splatting.
Detected objects on the BEV plane (forward view). Cyan = GT, red dashed = enlarged GT box (BEV-mask target), orange = predictions.
Real inference on a TruckScenes highway scene (forward view). Top row — Gaussian path: propagated past primitives are velocity-compensated; without it (left) they lag behind moving objects. Bottom row — BEV path: both are ego-motion compensated, but without the velocity head (left) moving objects' BEV features smear along their motion; with it (right) they stay sharp.
| Method | Input | Backbone | Split | NDS | mAP | mATE | mASE | mAOE | mAVE | mAAE |
|---|---|---|---|---|---|---|---|---|---|---|
| CenterPoint-V | L | Voxel | val | 35.3 | 22.6 | 0.461 | 0.405 | 0.468 | 3.028 | 0.261 |
| RCTrans | C+R | R50 | val | 22.9 | 12.6 | 1.025 | 0.507 | 0.608 | 0.894 | 0.331 |
| CRT-Fusion | C+R | R50 | val | 28.8 | 16.9 | 0.833 | 0.512 | 0.444 | 0.852 | 0.322 |
| BEVFusion | C+R | R50 | val | 30.4 | 18.2 | 0.941 | 0.442 | 0.389 | 0.892 | 0.208 |
| Horizon3D (Ours) | C+R | R50 | val | 37.4 | 23.6 | 0.876 | 0.410 | 0.331 | 0.625 | 0.198 |
| Far3D | C | V2-99 | val | 21.4 | 10.7 | 0.883 | 0.507 | 0.671 | 1.352 | 0.338 |
| SpaRC | C+R | V2-99 | val | 35.4 | 22.5 | 0.798 | 0.449 | 0.476 | 0.613 | 0.248 |
| Horizon3D (Ours) | C+R | V2-99 | val | 38.4 | 24.1 | 0.833 | 0.404 | 0.355 | 0.561 | 0.208 |
| CenterPoint-V | L | Voxel | test | 41.0 | 26.7 | 0.409 | 0.352 | 0.277 | 2.730 | 0.201 |
| HyDRa | C+R | V2-99 | test | 22.4 | 12.8 | 0.725 | 0.544 | 0.744 | 1.180 | 0.388 |
| SpaRC | C+R | V2-99 | test | 37.4 | 27.2 | 0.759 | 0.413 | 0.411 | 0.814 | 0.227 |
| Horizon3D (Ours) | C+R | V2-99 | test | 41.8 | 27.7 | 0.779 | 0.354 | 0.269 | 0.637 | 0.167 |
L = LiDAR, C = Camera, R = Radar. All metrics follow the official TruckScenes evaluation protocol.
BEV detections (left) alongside the four surround-view cameras (right) across diverse weather and lighting conditions. On the BEV, GT boxes (green dashed) are shown vs predictions (orange solid); cameras show predictions only. Faster objects have darker fill.
BibTeX will be available once the paper is released. (TBD)