Feature representation is of vital importance for human action recognition. In recent few years, the application of deep learning in action recognition has become popular. However, for action recognition in videos, the advantage of single convolution feature over traditional methods is not so evident. In this paper, a novel feature representation that combines spatial and temporal feature with global motion information is proposed. Specifically, spatial and temporal feature from RGB images is extracted by convolutional neural network (CNN) and long short-term memory (LSTM) network. On the other hand, global motion information extracted from motion difference images using another separate CNN. Hereby, the motion difference images are binary video frames processed by exclusive or (XOR). Finally, support vector machine (SVM) is adopted as classifier. Experimental results on YouTube Action and UCF-50 show the superiority of the proposed method