Natural language can be used to construct rich, compositional descriptions of the world, highlighting for example entities (nouns), events (verbs), and the interactions between them (simple sentences). In this talk, I show how compositional structure around verbs and nouns can be repurposed to build computer vision systems that scale to recognize hundreds of thousands of visual concepts in images. I introduce the task of situation recognition, where the goal is to map an image to a language-inspired structured representation of the main activity it depicts. The problem is challenging because it requires recognition systems to identify not only what entities are present, but also how they are participating within an event (e.g. not only that there are scissors but they are they are being used to cut). I also describe new deep learning models that better capture compositionality in situation recognition and leverage the close connection to language ‘to know what we don’t know’ and cheaply mine new training data. Although these methods work well, I show that they have a tendency to amplify underlying societal biases in the training data (including over predicting stereotypical activities based on gender), and introduce a new dual decomposition method that significantly reduces this amplification without sacrificing classification accuracy. Finally, I propose new directions for expanding what visual recognition systems can see and ways to minimize the encoding of negative social biases in our learned models.