ECCV 2026

Horizon3D: Sparse Radar-Camera Fusion for Long-Range 3D Perception in Autonomous Driving

Geonho BangGeunju BaekDongyoung LeeWonjun JeongJun Won Choi

Seoul National University

Abstract

Long-range 3D object detection is critical for safe autonomous driving at highway speeds, yet existing radar-camera fusion methods face notable limitations at extended ranges. BEV-based approaches effectively encode scene-level context but incur rapidly growing computational cost and struggle to preserve fine-grained object-level detail, while query-based methods provide efficient object-centric encoding but lack sufficient scene-level context. In this paper, we propose Horizon3D, a sparse radar-camera fusion framework for long-range 3D object detection that jointly captures object-level detail and scene-level context in both spatial and temporal dimensions through a hybrid representation that combines Gaussian primitives with sparse BEV features. Horizon3D first employs Keypoint-Guided Gaussian Initialization (KGGI) to initialize Gaussian primitives at object keypoints estimated from radar and camera features. Object-Centric Sparse Fusion (OCSF) aggregates cross-modal features around these primitives and splats the refined Gaussians onto the BEV plane, where they are fused with sparse radar BEV features. Finally, Dual-Path Temporal Fusion (DPTF) aggregates temporal cues through a BEV path for multi-frame feature accumulation and a Gaussian path for propagating primitives across frames to encode per-object motion. Extensive evaluations on TruckScenes demonstrate that Horizon3D achieves state-of-the-art performance, outperforming the previous best method by +3.0 NDS and +1.6 mAP while maintaining competitive inference speed.

Comparison of radar-camera fusion paradigms

Comparison of radar-camera fusion paradigms. (a) Dense BEV fusion encodes scene-level context but incurs growing cost with range. (b) Sparse query-based fusion achieves efficient object-centric encoding but lacks scene-level context. (c) Our hybrid BEV fusion employs Gaussian primitives and sparse BEV features for object- and scene-level encoding, and uses dual-path temporal fusion.

Method

Overall architecture of Horizon3D

Overall architecture of Horizon3D. Multi-view camera images and 4D radar points are encoded by their respective backbones. KGGI initializes sparse Gaussian primitives at estimated object keypoints. OCSF aggregates cross-modal features and produces a sparse BEV representation. DPTF fuses temporal information through Gaussian and BEV paths.

KGGI

Keypoint-Guided Gaussian Initialization estimates object keypoints from both radar and camera features and initializes a compact set of Gaussian primitives at these locations, enabling efficient object-centric encoding even at extended detection ranges up to 150m.

OCSF

Object-Centric Sparse Fusion aggregates cross-modal features around Gaussian primitives via deformable cross-attention, iteratively refines their parameters, and splats them onto the BEV plane to produce a sparse representation fused with radar BEV features.

DPTF

Dual-Path Temporal Fusion integrates temporal cues through two complementary paths: a BEV path that accumulates foreground features over multiple frames, and a Gaussian path that propagates primitives across frames with velocity-guided motion compensation.

Object-Centric Sparse Fusion (OCSF)

Object-Centric Sparse Fusion (OCSF). Refined Gaussian primitives are splatted to form a class-agnostic BEV mask supervised against the BEV mask GT, while sampled active Gaussians aggregate cross-modal features around each object to produce object-centric BEV features.

Dual-Path Temporal Fusion (DPTF)

Dual-Path Temporal Fusion (DPTF). The Gaussian path propagates a velocity-matched history Gaussian set through a FIFO memory, while the BEV path accumulates filtered sparse BEV features in a memory bank — jointly encoding per-object motion and multi-frame scene context.

Demo

Object-Centric Sparse Fusion (OCSF)

Step through the OCSF pipeline: from detected objects to Gaussian primitives to BEV splatting.

Detected objects on the BEV plane (forward view). Cyan = GT, red dashed = enlarged GT box (BEV-mask target), orange = predictions.

GT · Enlarged GT · Prediction · Keypoint / Gaussian · Glow = Splatted BEV feature

Dual-Path Temporal Fusion (DPTF)

Real inference on a TruckScenes highway scene (forward view). Top row — Gaussian path: propagated past primitives are velocity-compensated; without it (left) they lag behind moving objects. Bottom row — BEV path: both are ego-motion compensated, but without the velocity head (left) moving objects' BEV features smear along their motion; with it (right) they stay sharp.

Ego: 0 km/h
Front camera view
Front cameras (LEFT_FRONT + RIGHT_FRONT)
Gaussian Path — 2 tracked dynamic objects: why propagation needs velocity compensation
Before Vel. Comp.
After Vel. Comp.
tracked objects' Gaussians · Velocity vector · dashed line = drift from the object when velocity comp. is off · Play (1 fps)

Ego: 0 km/h
Front camera view
Front cameras (LEFT_FRONT + RIGHT_FRONT)
BEV Path — velocity compensation of accumulated BEV features
Without Velocity Comp.
With Velocity Comp.
GT · Prediction · Heatmap = accumulated BEV features (both ego-motion compensated)

Main Results on TruckScenes

Method Input Backbone Split NDS mAP mATE mASE mAOE mAVE mAAE
CenterPoint-VLVoxelval 35.322.60.4610.4050.4683.0280.261
RCTransC+RR50val 22.912.61.0250.5070.6080.8940.331
CRT-FusionC+RR50val 28.816.90.8330.5120.4440.8520.322
BEVFusionC+RR50val 30.418.20.9410.4420.3890.8920.208
Horizon3D (Ours)C+RR50val 37.423.60.8760.4100.3310.6250.198
Far3DCV2-99val 21.410.70.8830.5070.6711.3520.338
SpaRCC+RV2-99val 35.422.50.7980.4490.4760.6130.248
Horizon3D (Ours)C+RV2-99val 38.424.10.8330.4040.3550.5610.208
CenterPoint-VLVoxeltest 41.026.70.4090.3520.2772.7300.201
HyDRaC+RV2-99test 22.412.80.7250.5440.7441.1800.388
SpaRCC+RV2-99test 37.427.20.7590.4130.4110.8140.227
Horizon3D (Ours)C+RV2-99test 41.827.70.7790.3540.2690.6370.167

L = LiDAR, C = Camera, R = Radar. All metrics follow the official TruckScenes evaluation protocol.

Qualitative Results

BEV detections (left) alongside the four surround-view cameras (right) across diverse weather and lighting conditions. On the BEV, GT boxes (green dashed) are shown vs predictions (orange solid); cameras show predictions only. Faster objects have darker fill.

Citation

BibTeX will be available once the paper is released. (TBD)