Horizon3D: Sparse Radar-Camera Fusion for Long-Range 3D Perception

Abstract

Long-range 3D object detection is critical for safe autonomous driving at highway speeds, yet existing radar-camera fusion methods face notable limitations at extended ranges. BEV-based approaches effectively encode scene-level context but incur rapidly growing computational cost and struggle to preserve fine-grained object-level detail, while query-based methods provide efficient object-centric encoding but lack sufficient scene-level context. In this paper, we propose Horizon3D, a sparse radar-camera fusion framework for long-range 3D object detection that jointly captures object-level detail and scene-level context in both spatial and temporal dimensions through a hybrid representation that combines Gaussian primitives with sparse BEV features. Horizon3D first employs Keypoint-Guided Gaussian Initialization (KGGI) to initialize Gaussian primitives at object keypoints estimated from radar and camera features. Object-Centric Sparse Fusion (OCSF) aggregates cross-modal features around these primitives and splats the refined Gaussians onto the BEV plane, where they are fused with sparse radar BEV features. Finally, Dual-Path Temporal Fusion (DPTF) aggregates temporal cues through a BEV path for multi-frame feature accumulation and a Gaussian path for propagating primitives across frames to encode per-object motion. Extensive evaluations on TruckScenes demonstrate that Horizon3D achieves state-of-the-art performance, outperforming the previous best method by +3.0 NDS and +1.6 mAP while maintaining competitive inference speed.

Comparison of radar-camera fusion paradigms. (a) Dense BEV fusion encodes scene-level context but incurs growing cost with range. (b) Sparse query-based fusion achieves efficient object-centric encoding but lacks scene-level context. (c) Our hybrid BEV fusion employs Gaussian primitives and sparse BEV features for object- and scene-level encoding, and uses dual-path temporal fusion.

Method

Overall architecture of Horizon3D. Multi-view camera images and 4D radar points are encoded by their respective backbones. KGGI initializes sparse Gaussian primitives at estimated object keypoints. OCSF aggregates cross-modal features and produces a sparse BEV representation. DPTF fuses temporal information through Gaussian and BEV paths.

KGGI

Keypoint-Guided Gaussian Initialization estimates object keypoints from both radar and camera features and initializes a compact set of Gaussian primitives at these locations, enabling efficient object-centric encoding even at extended detection ranges up to 150m.

OCSF

Object-Centric Sparse Fusion aggregates cross-modal features around Gaussian primitives via deformable cross-attention, iteratively refines their parameters, and splats them onto the BEV plane to produce a sparse representation fused with radar BEV features.

DPTF

Dual-Path Temporal Fusion integrates temporal cues through two complementary paths: a BEV path that accumulates foreground features over multiple frames, and a Gaussian path that propagates primitives across frames with velocity-guided motion compensation.

Object-Centric Sparse Fusion (OCSF). Refined Gaussian primitives are splatted to form a class-agnostic BEV mask supervised against the BEV mask GT, while sampled active Gaussians aggregate cross-modal features around each object to produce object-centric BEV features.

Dual-Path Temporal Fusion (DPTF). The Gaussian path propagates a velocity-matched history Gaussian set through a FIFO memory, while the BEV path accumulates filtered sparse BEV features in a memory bank — jointly encoding per-object motion and multi-frame scene context.

Demo

Object-Centric Sparse Fusion (OCSF)

Step through the OCSF pipeline: from detected objects to Gaussian primitives to BEV splatting.

Detected objects on the BEV plane (forward view). Cyan = GT, red dashed = enlarged GT box (BEV-mask target), orange = predictions.

■ GT · □ Enlarged GT · ■ Prediction · ● Keypoint / Gaussian · Glow = Splatted BEV feature

Dual-Path Temporal Fusion (DPTF)

Real inference on a TruckScenes highway scene (forward view). Top row — Gaussian path: propagated past primitives are velocity-compensated; without it (left) they lag behind moving objects. Bottom row — BEV path: both are ego-motion compensated, but without the velocity head (left) moving objects' BEV features smear along their motion; with it (right) they stay sharp.

Ego: 0 km/h

Front cameras (LEFT_FRONT + RIGHT_FRONT)

Gaussian Path — 2 tracked dynamic objects: why propagation needs velocity compensation

Before Vel. Comp.

After Vel. Comp.

●● tracked objects' Gaussians · ➜ Velocity vector · dashed line = drift from the object when velocity comp. is off · Play (1 fps)

Ego: 0 km/h

Front cameras (LEFT_FRONT + RIGHT_FRONT)

BEV Path — velocity compensation of accumulated BEV features

Without Velocity Comp.

With Velocity Comp.

■ GT · ■ Prediction · Heatmap = accumulated BEV features (both ego-motion compensated)

Main Results on TruckScenes

Method	Input	Backbone	Split	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE
CenterPoint-V	L	Voxel	val	35.3	22.6	0.461	0.405	0.468	3.028	0.261
RCTrans	C+R	R50	val	22.9	12.6	1.025	0.507	0.608	0.894	0.331
CRT-Fusion	C+R	R50	val	28.8	16.9	0.833	0.512	0.444	0.852	0.322
BEVFusion	C+R	R50	val	30.4	18.2	0.941	0.442	0.389	0.892	0.208
Horizon3D (Ours)	C+R	R50	val	37.4	23.6	0.876	0.410	0.331	0.625	0.198

Far3D	C	V2-99	val	21.4	10.7	0.883	0.507	0.671	1.352	0.338
SpaRC	C+R	V2-99	val	35.4	22.5	0.798	0.449	0.476	0.613	0.248
Horizon3D (Ours)	C+R	V2-99	val	38.4	24.1	0.833	0.404	0.355	0.561	0.208

CenterPoint-V	L	Voxel	test	41.0	26.7	0.409	0.352	0.277	2.730	0.201
HyDRa	C+R	V2-99	test	22.4	12.8	0.725	0.544	0.744	1.180	0.388
SpaRC	C+R	V2-99	test	37.4	27.2	0.759	0.413	0.411	0.814	0.227
Horizon3D (Ours)	C+R	V2-99	test	41.8	27.7	0.779	0.354	0.269	0.637	0.167

L = LiDAR, C = Camera, R = Radar. All metrics follow the official TruckScenes evaluation protocol.

Horizon3D: Sparse Radar-Camera Fusion for Long-Range 3D Perception in Autonomous Driving