Skeleton tracking for serious games and real-time medical diagnosis

Physical rehabilitation of people with reduced mobility implies to monitor the movements of the patients during the rehabilitation sessions, so to individualize the therapy patient by patient. A serious-games company, NaturalPad (NP), would like to develop a cheap real-time markerless skeleton tracking device ensuring diagnosis assistance of neuromuscular and articular pathologies among reduced mobility persons such as elderly, post-stroke and persons affected by disability. In this way, the goal of this device is to precisely assess 3D body joints coordinates in real-time, that will be used to format accurate indicators about articular capacities of the patient during a physiotherapy session. These indicators, such as the Range of Motion (ROM) of each articulation, will be printed on a Graphical User Interface (GUI), so the physiotherapist can monitor the How to cite this book chapter: Adjel, M., Seilles, A., Mottet, D. and Tallon, G. 2020. Skeleton tracking for serious games and real-time medical diagnosis. In: Loizides, F., Winckler, M., Chatterjee, U., Abdelnour-Nocera, J. and Parmaxi, A. (eds.) Human Computer Interaction and Emerging Technologies: Adjunct Proceedings from the INTERACT 2019 Workshops. Pp. 361–368. Cardiff: Cardiff University Press. DOI: https://doi.org/10.18573/ book3.au. License: CC-BY 4.0. 362 Human Computer Interaction and Emerging Technologies evolution of the patients pathologies. After giving details about related studies, we will explicit technological requirements and project constraints. Last we will define a benchmark process of existing skeleton tracking algorithms and cheap motion capture devices. The results will allow us to evaluate if there is an enough accurate camera/algorithm combination to deal with our issues.

evolution of the patients pathologies. After giving details about related studies, we will explicit technological requirements and project constraints. Last we will define a benchmark process of existing skeleton tracking algorithms and cheap motion capture devices. The results will allow us to evaluate if there is an enough accurate camera/algorithm combination to deal with our issues.

Keywords
Real-time skeleton tracking · Medical diagnosis · Joint angles estimation 1 Related works and technological requirements

Device and algorithm requirements
The Kinect v2 1 and Unity3D 2 are interesting tools to develop real-time interaction games for physical rehabilitation [1][2]. Actually, a lot of movements are supported by NPs serious games platform, such as steps, chest inclination, hands movement and squatting series. We are able to have precise enough skeleton data to improve functional autonomy among older adults [2][3]. Nevertheless, due to Kinect imprecision, we are unable to correctly recognize head inclinations, ankle/chest rotations and center of mass approximation. Moreover, the Kinect skeleton tracking algorithm doesn't take into account osseous and articular constraints of human body, so it's not precise enough for detailed articular angles analysis [2]. Yet, our device must respect skeletal constraints to be used for joints angles assessment.
Thus, the device needs to fulfill 3 technological constraints: 1) being able to handle RGB and/or depth data for real-time interaction in games, 2) being able to extract accurate enough 3D coordinates of the patients articulations to estimate articulations angles for diagnosis purposes, and 3) being sensor agnostic. The combination sensor/algorithm and hardware configuration have to be as cheap as possible, because of commercial constraints. The final device must be easy to set up, thereby the physiotherapist won't have to reconfigure and calibrate the set up between two sessions.

Clinical uses of depth cameras
Many studies [5][6][7][8][9][10][11][12] have examined the Kinect for assessment and balance control. They indicate that for relatively slow movements, the Kinect can give enough accurate skeleton data to perform dynamic tests as functional reach, sit to stand and timed up and go. Encouraging results also shown that the Kinect sensor can be useful for medical diagnosis and monitoring of patients suffering from Parkinsons, scoliosis and post-stroke diseases [13][14][15]. However, as we want to estimate articular capacities, were not sure if Kinect is worthwhile given the inaccuracy associated with some of the variables extracted by the sensor [5]. This is particularly true considering the turning movement, as the Kinect cannot accurately record postural movement when the patient performs the turn. Very recent studies [16][17][18][19] already worked on medical diagnosis with depth sensors. In [16], a serious games platform is conceived for home-based rehabilitation after the hospitalisation period, with automated evaluation of the patient during the training. Clinical indicators are extracted, such as neglected body areas during session or errors in limbs trajectory. In [16][17][18] its shown that depth-sensors can be useful for post-stroke rehabilitation serious games and motor function diseases diagnosis among elderly. The study in [19] also demonstrates very encouraging results in clinical data assessment using Intel RealSense depth cameras.

Joint angles estimation
As a reminder, the device we want to develop should be able to precisely extract 3D coordinates of skeleton joints in real-time, accurately estimate joint angles during a physiotherapy session, and, in the ideal case, during a serious-game session. Several studies [20][21][22][23] used markerless motion capture systems to estimate joint angles and compared these values with ground truth to estimate the accuracy of the depth sensor for such task. Studies [20][21][22] assessed the joint angles estimation accuracy of the Kinect for clinical uses. Marker based motion capture systems were used in [20][21] as ground truth, while [22] used an IMU device. [20] concluded that the Kinect system is not yet suitable for clinical assessments while [21] concluded the opposite. This contrast is explained by the fact that [20] uses a VICON system as ground truth, yet, studies [24][25] shown that there can be interferences between VICON and Kinect that slant the joints coordinates assessment of the Kinect [21] used a jig as guinea pig, instead of real humans, which can distort the results [22] demonstrates that the Kinect is efficient in knee joint angle estimation, which is not sufficient as we want accuracy on all body joints. As far as we know, there is no markerless device that aims accurately estimating joint angles in real-time to deduce articular capacities of the patient in the context of medical diagnosis support.

Real-time human pose estimation
Several papers [26][27][28][29][30][31]43] tackled the 3D real-time human pose estimation issue. Even if [27] estimates only 2D joint coordinates, we will keep it for our benchmark, for several reasons: 1) It works in real-time on cheap hardware 2) We can deduce 3D coordinates from 2D [32] 3) It uses a monocular RGB camera, which is cheap 4) We want to verify if its accurate for joint angles estimation. We already eliminated [26,[28][29][30][31] because of License requirements (cost or lack of documentation). Implementations remaining for the benchmark are [27], Nuitrack, Kinect SDK v2 and Orbbec SDK. First of all, we aim to determine the level of accuracy we can obtain with state-of-the-art real-time skeleton tracking algorithms and a single sensor. We will benchmark different combinations of algorithms/sensors to determine which couple is the best, using the constraints mentioned in 2)a) as criteria to select the best combination camera/algorithm. Then, physiotherapists will assess the clinical relevance of the best combination in selected use cases, for diagnosis and physical rehabilitation.

Benchmarking process
For this benchmark, we will test the skeleton tracking algorithms mentioned above and following markerless motion capture devices: Microsoft Kinect v2, Orbbec Astra, Intel RealSense D435i, Regular webcam (for [27]). We will assess the error of pose estimation for each sensor/algorithm combination. To calculate the estimation error for each device, we will use a state of the art motion capture system (VICON, Oxford metrics) 3 as reference system. We will record the movements with the two systems, and will compare the 3D coordinates values given by the VICON with 3D coordinates values given by the tested device/algorithm combination. As mentioned above, the VICON system infrared waves can interfere with markerless sensors. A protocol is defined in [24] to minimize this noise, so we will reduce the number of markers and the distance between Kinect and the volunteer. Then, we will implement the following steps: 1) Collecting skeleton data from Vicon and Tested Camera 2) Synchronize the data in Time, as Vicon and cameras have different frequencies 3) Compute each joints angles thanks to cosines law and calculate the angles assessment error of each camera/algorithm combination 4) Selection of the optimal combination. The last step will consist in comparing the joints angle assessment accuracy of the optimal solution with the accuracy needed by physiotherapists.

Further researches
Actually Convolutional Neural Networks (CNNs) are widely used for monocular 3D human pose estimation and show the most accurate results [46]. Nevertheless, both top-down and bottom-up approaches don't take into account the human skeleton constraints. Our approach will consist in using Denavit-Hartenberg (D-H) convention to model geometric and kinematic skeleton [45]. Coupled with CNN algorithm we theoretically will be able to have an accurate and constrained skeleton [43] shows we can enhance joints coordinates estimation with integrating skeleton constraints during the training process. Moreover, without going into the algorithm technicalities, we can add human skeleton constraints in two other ways: -Refine identified human silhouettes (works for top-down approaches) with D-H before recognizing joints on the silhouette -Refine the pose with D-H after joints coordinates assessment (works for both bottom-up and top-down approaches) We think both of these refinement steps would enhance accuracy and realism of extracted skeleton coordinates. However, this will have an impact on computational speed that we will have to keep reasonable. Depending on obtained results, we will remove some cost constraints to compute a heavy algorithm on costly hardware device.