This paper proposes a Generative Adversarial Network (GAN) based architecture called Deep Future Gaze (DFG) for addressing the task of gaze anticipation in egocentric videos. DFG takes in a single frame and generates multiple frames; it attempts to anticipate the future gazes in the generated multiple frames. As in the case of other GANs, DFG consists of two networks: Generator (GN) and Discriminator (D). Here, GN is a two-stream architecture (using 3D-CNN) which attempts to untangle the foreground and background to generate the future frames, whereas, D differentiates the synthetic frames generated by GN from the real frames, thereby, helping to improve GN. This enables DFG to perform better than the rest of the state-of-the-art techniques.
This paper proposes a three-stream convolutional neural network architecture for the task of action recognition in first-person videos. The three streams consist of the spatial, temporal, and Ego streams. The Ego stream is a two-stream architecture consisting of 2D and 3D CNN; it takes in hand mask, head motion, and saliency map for generating the class scores. The Ego stream when combined with the spatial and temporal streams, achieves a 10% gain in the action recognition accuracy.
This paper proposes a two-stream convolutional neural network architecture for the task of action recognition in a video. Out of the two streams, the spatial stream uses frames from a video and learns the spatial information, whereas, the temporal stream uses a stack of optical flow images for learning the temporal information. The information from both of these networks is combined to predict the final output. The authors analyze various inputs to the temporal network and use multitask learning for improving the performance of the architecture.
This paper proposes an end-to-end neural network architecture capable of jointly identifying 3D object and hand poses, and at the same time predicting the object and activity category for a RGB image in a single pass. The authors propose a novel representation method for jointly learning the 3D hand and object poses, and object and action categories. They also propose an interaction RNN for learning the interaction between 3D hand and object along the temporal dimension.
This paper aims at improving action recognition accuracy in the egocentric videos by using a two-stream Convolutional Neural Network (CNN) architecture. Here, one stream learns the appearance information, whereas the other stream learns the motion information. The two-stream CNN proposed is able to capture the object attributes and hand-object configurations.
A linear programming problem can be defined as the task of maximizing or minimizing a linear function subject to some linear constraints. The constraints can be equalities or inequalities.
Being able to interpret the output probability matrix from the Convolutional Recurrent Neural Network (CRNN) is an essential task for getting output from the trained network. Various decoding techniques prove to be useful for this task. In this article, we’ll discuss two of those methods.
Connectionist Temporal Classification (CTC) is a type of Neural Network output helpful in tackling sequence problems like handwriting and speech recognition where the timing varies. Using CTC ensures that one does not need an aligned dataset, which makes the training process more straightforward.
subscribe via RSS