Hand pose recognition in advanced Human Computer Interaction systems (HCI) is becoming more feasible thanks to the use of affordable multi-modal RGB-Depth cameras. Depth data generated by these sensors is a very valuable input information, although the representation of 3D descriptors is still a critical step to obtain robust object representations. This paper presents an overview of different multi-modal descriptors, and provides a comparative study of two feature descriptors called Multi-modal Hand Shape (MHS) and Fourier-based Hand Shape (FHS), which compute local and global 2D-3D hand shape statistics to robustly describe hand poses. A new dataset of 38K hand poses has been created for real-time hand pose and gesture recognition, corresponding to five hand shape categories recorded from eight users. Experimental results show good performance of the fused MHS and FHS descriptors, improving recognition accuracy while assuring real-time computation in HCI scenarios.