Typically, due to an explosion of images on the web, image search engines respond to a query by clustering together images that have similar tags or textual description. For example, on receiving a query about tall buildings, Google image search engine finds several images that possibly contain tall buildings and clusters them together into some attributes such as evening, looking-up, etc. However, this makes them susceptible to missing the visual clues that are not described in the text. For example, it is hard for the engine to determine which tall buildings have a curved and glassy appearance in the evening. In principle, we could search for curved tall buildings but the resulting images might again exhibit varying degrees of glassy reflections. Searching instead for curved glassy tall buildings might seem plausible, however, the complexity of search increases exponentially, and thus quickly becomes prohibitive, as the number of attributes in the search query grows.
We address this issue by leveraging visual clues for enhancing search results in a weakly supervised setting: we train our model with images that are each labeled with only one attribute, but learn to predict multiple attributes in any given image. In order to make this possible, we propose a new procedure for training Convolutional Neural Networks (CNNs) that we call Deep Carving. Our models consistently achieve state-of-the-art results with respect to the precision of attribute prediction.
More info!