Sound2Sight: Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis – a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames.

Read More “Sound2Sight: Generating Visual Dynamics from Sound and Context”

Remove to Improve

The workhorses of CNNs are its filters, located at different layers and tuned to different features. Their responses are combined using weights obtained via network training. Training is aimed at optimal results for the entire training data, e.g., highest average classification accuracy. In this paper, we are interested in extending the current understanding of the roles played by the filters, their mutual interactions, and their relationship to classification accuracy.

Read More “Remove to Improve”

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to characterize the sources better, especially when the same object class may produce varied sounds from distinct interactions.
Read More “Visual Scene Graphs for Audio Source Separation”

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high.

Read More “A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction”

Texture Recognition

Given an arbitrary image, our goal is to segment all distinct texture subimages. This is done by discovering distinct, cohesive groups of spatially repeating patterns, called texels, in the image, where each group defines the corresponding texture. Texels occupy image regions, whose photometric, geometric, structural, and spatial-layout properties are samples from an unknown pdf.

Read More “Texture Recognition”

Connected Segmentation Tree For Object Modeling

We propose a new object representation, called connected segmentation tree (CST), which captures canonical characteristics of the object in terms of the photometric, geometric, and spatial adjacency and containment properties of its constituent image regions. CST is obtained by augmenting the object’s segmentation tree (ST) with inter-region neighbor links, in addition to their recursive embedding structure already present in ST.

Read More “Connected Segmentation Tree For Object Modeling”

Object Category Recognition

Low level segmentation based image features are used for the problem of object categorization. In general, object categorization comprises two main research areas: (1) classification or clustering of images containing objects belonging to an object category, and (2) detection, localization, and segmentation of individual object-category instances in images. The first thrust of research is typically concerned with exemplar based methods, where the main focus is to develop an efficient distance measure between two images.

Read More “Object Category Recognition”

Segmentation Based Object Discovery

Given a set of images, possibly containing objects from an unknown category, determine if a category is present. If a category is present, learn spatial and photometric model of the category. Given an unseen image, segment all occurrences of the category.

Read More “Segmentation Based Object Discovery”

Face Detection

Faces and Gestures

The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.

Read More “Face Detection”

Face Recognition

Faces and Gestures

The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.

Read More “Face Recognition”

Learning to Recognize 3D Objects

3D Object Recognition

Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.

Learning to Recognize 3D Objects

A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability.

Read More “Learning to Recognize 3D Objects”

Learning for Object Recognition

3D Object Recognition

Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.

Learning for Object Recognition

A learning algorithm accounting for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability.

Read More “Learning for Object Recognition”

Learning of Low-level Spatiotemporal Structural Patterns

3D Object Recognition

Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.

Learning of Low-level Spatiotemporal Structural Patterns

Given an image or a video sequence, a prespecified set of low level, spatial and/or temporal descriptors of the image/video structure, and a higher level interpretation of the structure, use computational learning methods to derive a succinct relationship between the interpretation and the low level structural description.

Read More “Learning of Low-level Spatiotemporal Structural Patterns”

Gesture Recognition

Faces and Gestures

The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.

Read More “Gesture Recognition”