Siddhant’s Scratch Book

DBpedia GSoC 2022 (Week 10-11): Website with UPLOAD functionality

2022-08-12T06:30:00+00:00

This article summarises my progess in GSoC over weeks ten and eleven of the GSoC coding period.

For these weeks, I went on to further update the UI of the web interface created. Instead of giving the image index as an input, I added an upload button to the webpage. Now, the user can upload any image to the portal. The input image will then be processed to generate its embedding and the embedding generated is used to query the pre-saved dataset.

Here is the link to the code for creating the webpage: GitHub.

DBpedia GSoC 2022 (Week 12-13): Winding down the code, documentation, and API

2022-08-12T06:30:00+00:00

This article summarises my progess in GSoC over weeks twelve and thirteen of the GSoC coding period.

Using Django, I created an API that anyone can use on their webpage for using the pipeline developed during the GSoC coding period. Furthermore, I udpated the documentation of the code written making it easy to follow and utilise.

The following link contains all the code written during the GSoC period along with the API and documentation: GitHub.

DBpedia GSoC 2022 (Week 9): Website-based demo for the framework + Mid-eval

2022-07-29T06:30:00+00:00

This article summarises my progess in GSoC over week nine of the GSoC coding period.

Implementing the complete framework

This week I was able to finish all the parts of the framework and create a functional pipeline for the process. The steps involved:

Getting an image from the user.
Pass the image through a pre-trained ResNet-50 to generate the embeddings.
Load the ResNet-50 embeddings of images in the dataset created earlier.
Compute similarity between the query image and images in the dataset.
Create a ranked list of dataset images in decreasing order of similarity scores.

Demo of the framework

Based on the pipeline mentioned above, for the purpose of mid-eval, I created a webpage that takes as input an image’s index (to use as query) and queries over rest of the images in the dataset. Here is the link to the code: Github.

Here is an image from the webpage:

DBpedia GSoC 2022 (Week 8): Creating the dataset

2022-07-22T06:30:00+00:00

This article summarises my progess in GSoC over week eight of the GSoC coding period.

Using SPARQL to query the DBpedia Knowledge Graph

To create a proof-of-concept of the system envisioned previous week, we decided to create a small dataset of images from various categories that we can consider as a part of the knowledge graph and use an input query image from the user. Using the image from the user, we can query the dataset created.

The first step before creating the dataset was to explore ways to query DBpedia. One of the ways is to use SPARQL. I used DBpedia’s SPARQL interface to make myself familiar with the query language. Link to the interface: link.

Later, I wrote the code to use Python and SPARQL for querying the DBpedia knowledge graph. Here is the link to the code GitHub.

The dataset

Using the following query

select ?e ?image where { 
?e rdf:type dbo:Weapon  . 
?e dbo:thumbnail ?image .
}

I was able to create a dataset with the following classes and stats:

Class	No. of Images
Birds	3883
Historic Places	7987
Politicians	1539
Reptiles	2337
Weapons	1954

In addition to the images, I also saved a mapping between the image’s path and the article’s corresponding URI. This will help in retrieving articles faster.

DBpedia GSoC 2022 (Week 7): Using the embeddings to query the Knowledge Graph

2022-07-15T06:30:00+00:00

This article summarises my progess in GSoC over week seven of the GSoC coding period.

How to use the embeddings generated for querying the DBpedia Knowledge Graph

Figure 1. Once we have the image-based KG, we aim to use it for image query search in DBpedia. As seen in the figure, here we have a query image given by the user. It is first converted to an embedding using ResNet-50. Then a similarity score between the nodes of KG and the embedding is generated. Using the similarity score, a ranked list of images (and the corresponding articles) is created.

The code related to the above figure is present here: Github.

DBpedia GSoC 2022 (Week 3-4): Project Summary and Begin Coding

2022-07-08T06:30:00+00:00

This article summarises my progess in GSoC over the past two weeks after the community bonding period ended.

I was working on fine-tuning the problem statement and began coding various parts of the proposed framework.

Problem Statement in Detail

How does DBpedia currently works?

Currently, DBpedia uses text as an input to search through the entities present in the Knowledge Graph.

What is the issue with this approach?

Imagine a situation where you only have a visual explaination of the object. How can we query DBpedia in such a case? Therefore, if we have only an image of something, we will not be able to query DBpedia Knowledge Graphs.

What do we want to achieve?

Given an image, we want to search for articles related to it.

Figure 1. A lot of us might not know the name of this fish (yes, it is a fish!). How do we use DBpedia to find its name? Current methods to query the DBpedia KG will not be able to take this image as an input and give its name. To overcome this limitation, in this project, we propose to create a KG using the images from DBpedia articles that will complement DBpedia’s existing KG and improve its functionality. By the way, this fish is the Axolotl.

How to use the images?

Figure 2. We want to generate a KG consisting of information from images from DBpedia aricticles. We represent a node as an image embedding and connect those nodes using the image’s semantic similarity or existing DB-KG links.

Code written for this purpose

Wrote the modules required to generate image embedding for a given image. Link: GitHub

DBpedia GSoC 2022 (Week 5-6): Visiting CVPR

2022-07-08T06:30:00+00:00

This article summarises my progess in GSoC over weeks five and six of the GSoC coding period.

For these weeks, I got permission from my mentors to visit New Orleans for attending Computer Vision and Pattern Recognition conference. Here, along with other authors, I presented my work Ego4D: Around the World in 3,000 Hours of Egocentric Videos.

DBpedia GSoC 2022 (Week 1-2): Community Bonding

2022-06-01T06:30:00+00:00

I am glad to share that my application has been selected for GSoC 2022! I will be contributing to DBpedia Association. Under the mentorship of Edgard Marx, Ashutosh Kumar, and Nausheen Fatma, I will be working on bridging the gap between computer vision and knowledge graphs!

Google Summer of Code	DBpedia

Problem Statement

Currently, users can query DBpedia using text. Although text as an input is an efficient approach to query the graph, there are cases where we do not know what we are seeing. How does one search the knowledge graph (KG) in such cases? Imagine being able to query the DBpedia Knowledge Graph (DB-KG) using images! The idea here is to create a framework that can combine existing computer vision techniques with knowledge graphs. Doing this will enable us to query the existing knowledge graphs using multiple modalities: images and text. To this end, in this proposal, we examine and explore two aspects of DB-KG: (a) A framework to create an image-based KG out of existing DBpedia entries; (b) Using the graph created to perform tasks like image querying, text + image search, and using relevant input images to add more images to existing articles.

Stay tuned for further updates on the project!

Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks

2020-09-01T23:30:00+00:00

This paper proposes a Generative Adversarial Network (GAN) based architecture called Deep Future Gaze (DFG) for addressing the task of gaze anticipation in egocentric videos. DFG takes in a single frame and generates multiple frames; it attempts to anticipate the future gazes in the generated multiple frames. As in the case of other GANs, DFG consists of two networks: Generator (GN) and Discriminator (D). Here, GN is a two-stream architecture (using 3D-CNN) which attempts to untangle the foreground and background to generate the future frames, whereas, D differentiates the synthetic frames generated by GN from the real frames, thereby, helping to improve GN. This enables DFG to perform better than the rest of the state-of-the-art techniques.

Architecture Overview

The following figure gives an overview of DFG. It consists of two networks: Generator Network (GN) and the Discriminator Network (D). GN is further divided into: Future Frame Generation Module (G) and Temporal Saliency Prediction Module (GP).

Source

The Generator Network

For differentiating between the foreground (hands and objects) and background motion (complex head motion), the authors propose to use a two-stream architecture for GN. Here, each stream consists of a 3D Convolutional Neural Network (CNN) architecture.

The input frame is first passed through a 2D-CNN for generating a latent representation. As shown in the figure above, this representation is then provided as an input to both the streams of the two-stream architecture. The two-stream architecture consists of foreground and background generation model. Both the streams generate N future frames. In addition to generating the foreground frames, the foreground model also generates the spatial temporal mask which has a pixel range of [0, 1]. Here, 1 indicates foreground and 0 indicates background. The background generation model has its own independent 3D-CNN, which, as the name suggests, generates the background frames. For preserving the spatial and temporal information, the authors add up-sampling layers after the convolution layers.

For looking real, the synthetic frames have to satisfy two criteria:

Coherent semantics across the frames (e.g. no table surface is inside the refrigerator);
Consistent motion across time (e.g. hand motions should be smooth across the frames).

Authors use GP which uses the foreground, background, and mask frames for anticipating the gaze location (shown using red dot in the figure above).

The Discriminator Network

D aims at distinguishing the synthetic examples from the real ones. It follows the same architecture as G, however, the up-sampling layers in G are replaced by convolutional layers. The output of D is a binary label indicating whether the input is real or not.

Training Details

Following the concept of Generative Adversarial Network (GAN), the authors make G and D play against each other. Task of G is to generate future frames which can fool D, while D’s task is to identify the real frames. In order to generate a frame consistent with the input frame, the authors use L1 loss. Both the networks are trained alternatively. The objective function of D consists of a combination of binary cross entropy loss. However, G has to satisfy two requirements: a) real output for fooling D; b) initial frame should be visually consistent with the input frame. For that, a combination of binary cross entropy loss and L1 loss is used (mean square error loss results in over-smoothing of the first generated frame). GP is trained using Kullback divergence (KLD) loss in a supervised approach.

Conclusions

The authors test DFG on GTEA, GTEAPlus, and OST datasets. For analysis, authors use Area Under the Curve (AUC) and Average Angular Error (AAE) metrics.

Some of the important observations are:

Source

Images in the figure above show that DFG was able to untangle foreground and background motions. In the foreground, both hand and objects are highlighted, whereas, the background is uniform all the time. And the mask highlight highest activation point;
Without using egocentric cues such as hands and objects of interest, DFG works better that the state-of-the-arts, which, in many cases, use egocentric cues;
Using two-steam architecture for learning foreground and background information, improves the gaze anticipation accuracy;
GP trained only on real frames does not perform well;
Gaze moment on individual frames is dependent on their previous states;
DFG is successful in learning egocentric cues in the spatial domain and motion dynamics in the temporal domain.

References

Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks. [Paper]
GTEA. [Link]
OST. [Link]

First Person Action Recognition Using Deep Learned Descriptors

2020-08-19T23:30:00+00:00

This paper proposes a three-stream convolutional neural network architecture for the task of action recognition in first-person videos. The three streams consist of the spatial, temporal, and Ego streams. The Ego stream is a two-stream architecture consisting of 2D and 3D CNN; it takes in hand mask, head motion, and saliency map for generating the class scores. The Ego stream when combined with the spatial and temporal streams, achieves a 10% gain in the action recognition accuracy.

Ego ConvNet

Following diagram shows the two-stream Ego ConvNet architecture for learning features specific to egocentric videos.

Source

The authors use three different input modalities for the Ego ConvNet:

Hand Mask;
Head Motion;
Saliency map.

Hand Mask

For generating the hand mask, authors model local appearance and global illumination. Local textures for different illumination conditions are captured using 48 Gabor filters. On the other hand, the authors perform k-means clustering on the HSV color space images’ histograms for learning the global features. Then for each cluster, the authors train a random tree regressor. The figure below shows some of the generated masks.

Source

Head Motion

The authors attempt to create a framework that relies only on one sensor: camera. So, in order to avoid using an eye tracker for the gaze information, the authors assume that, if the camera wearer is looking straight, then the head motion is approximately similar to the gaze information. This motivates them to capture the head motion using 2D homography transformation of the image.

Saliency Map

If the camera is static, then by using the optical flow information one can easily find out the object being handled by the camera wearer. However, in the case of egocentric videos, the camera is not static. To compensate for that and to avoid using additional sensors, the authors use 2D homography for canceling the head motion in the image. Doing this highlights the dominant motion in the scene. Which in most of the cases is the object which is being manipulated using the hands.

Architecture

The first figure on this webpage shows the architecture of the Ego ConvNet. It consists of 2D and 3D convolutional networks. The authors use this architecture for learning the coordination actions between hands, head motion, and saliency maps. The inputs to this network are:

Hand mask: As binary image;
Camera motion: Both x and y direction as grayscale images;
Saliency map: As grayscale image. Authors use infogain multinomial logisitc for training the network.

Three-stream Architecture

The authors experiment by combining the learned Ego ConvNet with the spatial and temporal stream as shown in the figure below. By using this architecture, they are able to improve the action recognition accuracy by 10%.

Source

Conclusions

The authors achieve state-of-the-art results on GTEA, CMU Kitchens, ADL, and UTE. As GTEA is a small dataset, the authors also train the network on the Interactive Museums dataset.

Some of the important observations are:

2D and 3D network alone as Ego ConvNet shows similar performance;
Fusing 2D and 3D network as a two-stream archiecture results in performace gains;
Further fusing them with spatial and temporal streams, the authors were able to improve action recognition accuracy by 10%;
The proposed network improves upon the state-of-the-art on all the four datasets.

Here are some of the samples of correctly classified examples:

Source

References

First Person Action Recognition Using Deep Learned Descriptors. [Paper]
GTEA. [Link]
CMU Kitchens. [Link]
ADL. [Link]
UTE. [Link]