Outplaying elite table tennis players with an autonomous robot

Coordinate system

We use right-handed conventions with the origin of the coordinate system at the centre of the playing surface of the table, in which the x-axis points towards the human player side of the table and the z-axis points upwards.

Perception

Ball triangulation

We use nine cameras synchronized with the actuators of the robot with a 200 Hz trigger signal to accurately locate the ball in the volume of the Olympic-sized court. At each trigger event, cameras capture 1,440 × 1,080 pixel Bayer8 colour images. To reduce data transfer and improve scalability (more cameras can increase robustness and accuracy44), each camera is equipped with a hardware-accelerated field programmable gate array to facilitate two-dimensional (2D) ball detection. The field programmable gate arrays process the images through a segmentation pipeline to produce a compressed 2D detection mask, which is streamed to a central server through an embedded CPU. The server verifies the shape of the ball and triangulates its 3D position using pre-calibrated camera parameters. The entire process is completed within 10.2 ms.

Camera placement is optimized using a custom covariance matrix adaptation evolution strategy (CMA-ES) algorithm45. The optimizer determines the lens selection, mounting height and orientation for each camera, subject to constraints such as the number of towers, desired coverage volume and a minimum projected 2D ball radius (5 pixels).

Spin estimation

The angular velocity of the ball is estimated by observing the movement of the logo printed on the surface of the official ball. To accurately capture the high-speed moving and rotating logo, we develop a mirror-based event vision tracking system called the gaze control system (GCS). The GCS comprises three components: (1) an event camera4 for low-latency, low-motion-blur imaging; (2) a telephoto, electrically tunable lens to magnify the ball and keep it in focus; and (3) a set of rotatable mirrors to track the ball smoothly (Fig. 2d). Given the 3D triangulation results, the mirrors and lens are controlled to track and focus on the ball with the system delay compensated by predicting the ball trajectory using the ball aerodynamics. With the ball being tracked, its contour on the event camera frame is first detected by a CNN46. Then the events on the ball are processed by two spin estimators, namely, a low-latency estimator based on another CNN33 and a high-accuracy but slower estimator based on CMax34. The CNN estimates the angular velocities with heteroscedastic uncertainties from accumulated events and is trained on pseudo-ground-truth data obtained by CMax using heteroscedastic regression47.

Events are aggregated into a polarity-separated surface of active events48 of 15 ms accumulation time window in which timestamps are minimum/maximum normalized to a range between 0 and 1. We use a centred 320 × 320 pixel hardware crop of the original 1,280 × 720 pixel.

The angular velocities estimated by the CNN are refined asynchronously by CMax. To achieve both low-latency and high accuracy, the robot agent Ace uses the angular velocities obtained by the CNN at the beginning of the trajectory and switches to the ones obtained by CMax as soon as they become available with low uncertainty. Because the spin estimation uncertainty increases when the logo is invisible, we place three GCSs to track the ball from multiple perspectives, as shown in Fig. 2a, and combine the multi-view measurements based on the respective uncertainties.

... continue reading