Glasses-free 3D display with ultrawide viewing range using deep learning

SBP-utilization analysis

Owing to inherent SBP scarcity, existing 3D display approaches have been forced into static compromises, each emphasizing specific aspects at the expense of others in their display outcomes (see Supplementary Table 1 and Supplementary Information for more analysis and comparison details). Holographic displays, for instance, preserve complete 3D reconstruction by significantly compressing the displayed light field to centimetre-scale regions (about 1–2 cm2), ensuring wide-angle, high-quality optical content but becoming practically unscalable30. By contrast, automultiscopic displays maintain common display sizes (about 0.1–0.2 m2) more suitable for natural viewing scenarios but must limit their effectual viewing angles. Within this category, view-dense solutions use multilayer architectures to provide continuous and realistic optical generation at the cost of highly restricted viewing zones. Alternatively, view-segmented solutions achieve broad, horizontal viewing angles using single-panel optics21,23,51,52 to discretely spread out available SBP, sacrificing the stereo parallax across vertical and radial dimensions, as well as the focal parallax, although this loss of full parallax inevitably compromises immersion and visual comfort37,40.

Fundamentally, the limited practicality of these existing approaches arises from their passive use of scarce SBP, attempting to statically accommodate various viewing scenarios simultaneously. These static approximations inherently conflict with the extreme scarcity of SBP itself, and this remains unaltered even with AI enhancement (Supplementary Table 2). Recognizing this scientific constraint, it becomes clear that a proactive, dynamic use of limited SBP is necessary, that is, using optical resources precisely where they are most crucially needed at each moment. In practice, this means reconstructing accurate binocular light fields around target eye positions, as binocular parallax is the essential basis for human depth perception. Notably, this dynamic model does not rely on eye tracking to synthesize virtual disparities as is commonly done in conventional eye-tracked systems, as these systems respond only to instantaneous viewpoint positions, with responses typically exhibiting significant errors due to tracking noise and random eye movements. Instead, the rational and effective solution here requires the accurate and consistent generation of real physical light fields for both binocular viewpoints and their neighbourhoods, with eye tracking primarily serving to guide directional delivery rather than generating virtual content severely dependent on tracking precision. Although SBP, in principle, supports this localized generation, it remains challenging to precisely adapt optical output to arbitrary and extensive views within the neighbourhood of the eyes. To address this, we develop a physically accurate binocular geometric modelling and a deep-learning-based mathematical model that enable real-time computation of light-field outputs. To this end, EyeReal precisely adapts optical output to arbitrary binocular positions within an extensive viewing range, validated by a light-field delivery setup featuring large-scale imaging, wide-angle viewing and full-parallax attributes. This dynamic SBP-utilization strategy thereby realizes the possibility of achieving the long-desired glasses-free 3D display.

Eye camera modelling and calibration

Given an ocular position in the light-field coordinate system, we use the pinhole camera model (Supplementary Fig. 3) to simulate the retinal imaging process of the light field. In general, we align the centre of the screen with the centre of the light field where the object is located, and by default, the eye is directed towards the centre of the light field, which is the origin of the coordinate system. For standardization, we define the z-axis of the camera model to be opposite to the direction of sight. Moreover, to simulate normal viewing conditions, we stipulate that the x-axis of the camera is parallel to the ground on which the object is situated, consistent with the relative position of the observer and the object in the same world. Consequently, the y-axis of the eye camera is the normal to the plane formed by the z- and x-axes.

We initially get the relative ocular positions captured by the RGB-D camera. In the process of transferring eye positions into the light-field coordinate system, we first obtain their two-dimensional (2D) pixel coordinates by using a lightweight face detector. Combining the inherent camera intrinsic parameters and the detected pixel-level depth information, we can obtain the 3D coordinates of the eyes in the camera coordinate system. For one eye, this process can be formulated by

$$\left[\begin{array}{c}{x}_{{\rm{c}}}\\ {y}_{{\rm{c}}}\\ {z}_{{\rm{c}}}\end{array}\right]={z}_{c}{\left[\begin{array}{ccc}{f}_{x} & 0 & {c}_{x}\\ 0 & {f}_{y} & {c}_{y}\\ 0 & 0 & 1\end{array}\right]}^{-1}\,\left[\begin{array}{c}{u}_{{\rm{e}}}\\ {v}_{{\rm{e}}}\\ 1\end{array}\right]$$ (4)

where u e and v e are the pixel-wise positions of the eye; (c x , c y ) is the optical centre of the image, which represents the projection coordinates of the image plane centre in the camera coordinate system; f x and f y are the focal lengths of the camera in the x-axis and y-axis directions; and x c , y c and z c represent the transformed camera coordinates.

Then comes the alignment from the real-world eye coordinates to the digital light-field world. Given the fixed spatial configuration between the camera and the display setup, this alignment reduces to estimating a projection matrix ${M}_{{\rm{c}}}=[{A}_{{\rm{c}}}\,| {t}_{{\rm{c}}}]\in {{\mathbb{R}}}^{3\times 4}$, which transforms coordinates from the camera to the light field. Based on the characteristic of autostereoscopy, we design a simple and convenient calibration method based on the characteristic of reversible light paths (Extended Data Fig. 1). We select N calibration points in the light-field coordinate system, which also meet the visual field of the RGB-D camera. We replace the light-field images corresponding to the viewpoints with calibration marks (Supplementary Fig. 4) and provide them as input to the neural network to generate the corresponding layered patterns. Because the patterns can form only the best stereo effect at the input viewpoint, conversely, when the viewer sees the completely overlapping (the superposed colour is also the thickest at this time) rectangle with one eye at a certain angle on the screen of the hardware device, the current 3D eye camera coordinates ${c}_{i}\in {{\mathbb{R}}}^{3}$ captured by the camera and the world coordinates ${w}_{i}\in {{\mathbb{R}}}^{3}$ of the calibration points form an one-to-one correspondence. We solve for M c using least squares regression (Supplementary Fig. 5) based on K pairs of corresponding calibration points

$${A}_{{\rm{c}}}\,,{t}_{{\rm{c}}}\,=\mathop{\arg }\limits_{{A}_{{\rm{c}}}\,\in {{\mathbb{R}}}^{3\times 3},\,{t}_{{\rm{c}}}\in {{\mathbb{R}}}^{3}}\,\min \;\mathop{\sum }\limits_{i=1}^{K}{\parallel {A}_{{\rm{c}}}{c}_{i}^{{\rm{T}}}+{t}_{{\rm{c}}}-{w}_{i}^{{\rm{T}}}\parallel }_{2}^{2}$$ (5)

... continue reading