Individual Submission Summary
Share...

Direct link:

Joint Image-Text Classification Using an Attention-Based LSTM Architecture

Thu, September 10, 12:00 to 1:30pm MDT (12:00 to 1:30pm MDT), TBA

Abstract

The use of social media data in political science is now commonplace. Social media posts such as Tweets are usually multimodal, comprising of, for example, both text and images. For instance, recent work in election forensics uses Twitter data to capture people's reports of their personal experiences ("incidents") during the 2016 U.S. presidential election (Mebane et al. 2018). That work uses automated text-based classification, but the classifiers do not use all available content---in particular, they ignore images. But images can provide important context to the text. For instance, some Tweets feature pictures of long lines or of smiling voters wearing "I Voted" stickers. Human coders use both the text and images jointly when determining whether something is an election incident, but the computer does not use the images.

Two-stage ensemble classifiers have been developed to handle classifying both text and images together: probabilities are separately generated for the text and for the image to be an observation of interest, then an overall probability is calculated using some predefined function (see, e.g., Zhang & Pan 2019). However, many posts that contain both images and text are recognized as observations of interest only when text and images are considered simultaneously and synergistically.

We propose a joint image-text classifier using an attention-based long short-term memory (LSTM) architecture, loosely inspired by the architecture of an encoder-decoder used in image captioning (Xu et al. 2016). We first extract image features through a backbone network (a process known as transfer learning), such as Inception v3 (Szegedy et al. 2015) or Mobilenet v2 (Sandler et al. 2016). These image features are then fed into a multi-layer LSTM architecture as the initial input. At each LSTM timestep, a single word (as an embedding) is fed in as input, along with the previous hidden state, previous cell state, and the image features; the image features, the word input, and the previous hidden states are then linearly combined to obtain the activation vector, which is then used to calculate the next hidden state and cell state (see, e.g., Hochreiter & Schmidhuber 1997 and Chang & Masterson 2019). The weights associated with the image features are known as the attention weights; this ensures that the image is considered with each word of the Tweet, rather than simply acting as the initial input of the LSTM. After the last LSTM timestep, we use an affine function to transform the last hidden state into scores for the classes of interest. We compare this approach with one reliant entirely on convolutional neural networks, which have proven successful in automated image analysis (Krizhevsky et al. 2012).

We apply this joint image-text classifier to two empirical data collections. The first application is to a set of Tweets that contain both text and images that were hand-labeled as observations of incidents in the 2016 U.S. general election: approximately 10,000 Tweets have both text and image while about another 12,000 have only text. We start with a prototype of the model described above without the attention mechanism and without pretrained word embeddings, then we try to improve performance by removing uncommonly used words, using pretrained word embeddings, and implementing the attention mechanism. Our second planned application is to Tweets about Black Lives Matter. Using data gathered from 2014 to 2015 by Freelon et al. (2016), we code a training set of Tweets for whether they support the Black Lives Matter movement, then train a classifier based on text and image features using the approaches mentioned above. We also compare our results to those from two-stage ensemble classifiers and from concatenating word embeddings and image features.

Authors