April 1, 2015

Methodology: How Crimson Hexagon Works

To arrive at the results regarding the tone or frame of discussion on social media, and specifically Twitter, Pew Research often uses computer coding software provided by Crimson Hexagon. That software is able to analyze the textual content from all publicly available posts on Twitter. Crimson Hexagon (CH) classifies online content by identifying statistical patterns in words.

Use of Crimson Hexagon’s Technology

The technology is rooted in an algorithm created by Gary King, a professor at Harvard University’s Institute for Quantitative Social Science. (Click here to view the study explaining the algorithm.)

The purpose of computer coding in general, and Crimson Hexagon specifically, is to “take as data a potentially large set of text documents, of which a small subset is hand coded into an investigator-chosen set of mutually exclusive and exhaustive categories. As output, the methods give approximately unbiased and statistically consistent estimates of the proportion of all documents in each category.”

Universe

Crimson Hexagon software examines all publicly available tweets and has access to Twitter’s entire “firehose.” Unless otherwise noted, retweets are included in the analysis.

Monitor Creation and Training

Each individual study or query related to a set of variables is referred to as a “monitor.”

The process of creating a new monitor consists of four steps. (See below for an example of these steps in action.)

First, Pew Research staff decides what timeframe and universe of content to examine. Pew Research generally uses Crimson Hexagon for the study of Twitter. Unless otherwise noted, Pew Research only includes English-language content.

Second, the researchers enter key terms using Boolean search logic so the software can identify the universe of posts to analyze.

Next, researchers define categories appropriate to the parameters of the study. If a monitor is measuring the tone of coverage for a specific politician, for example, there would be four categories: positive, neutral, negative, and irrelevant for posts that are off-topic in some way.

If a monitor is measuring media framing or storyline, the categories would be more extensive. For example, a monitor studying the framing of coverage about the death of Osama bin Laden might include nine categories: details of the raid, global reaction, political impact, impact on terrorism, role of Pakistan, straight account of events, impact on U.S. policy, the life of bin Laden, and a category off-topic posts.

Fourth, researchers “train” the CH platform to analyze content according to specific parameters they want to study. The researchers in this role have gone through in-depth training at two different levels. They are professional content analysts fully versed in Pew Research’s existing content analysis operation and methodology. They then undergo specific training on the CH platform including multiple rounds of reliability testing.

The monitor training itself is done with a random selection of posts collected by the technology. One at a time, the software displays posts and a human coder determines which category each example best fits into. In categorizing the content, Pew Research staff follows coding rules created over the many years that the Center has been content analyzing news media. If an example does not fit easily into a category, that specific post is skipped. The goal of this training is to feed the software with clear examples for every category.

For each new monitor, human coders categorize at least 250 distinct posts. Typically, each individual category includes 20 or more posts before the training is complete. To validate the training, Pew Research has conducted numerous intercoder reliability tests (see below) and the training of every monitor is examined by a second coder in order to discover errors.

Once the training is complete, the software culls through and classifies the entirety of the identified online content according to the statistical word patterns derived during the training.

How the Algorithm Works

To understand how the software recognizes and uses patterns of words to interpret texts, consider a simplified example. Imagine the study examining coverage regarding the death of Osama bin Laden that utilizes the nine categories listed above. As a result of the example tweets categorized by a human coder during the training, the CH monitor might recognize that portions of a tweet with the words “Obama,” “poll” and “increase” near each other are likely about the political ramifications. However, a tweet that includes the words “Obama,” “compound” and “Navy” is likely to be about the details of the raid itself.

Unlike most human coding, CH monitors do not measure each story as a unit, but examine the entire discussion in the aggregate. To do that, the algorithm breaks up all relevant texts into subsections. Rather than the dividing each tweet or sentence, CH treats the “assertion” as the unit of measurement. Thus, posts are divided up by the computer algorithm. If 40% of a story fits into one category, and 60% fits into another, the software will divide the text accordingly. Consequently, the results are not expressed in percent of newshole or percent of posts. Instead, the results are the percent of assertions out of the entire body of posts identified by the original Boolean search terms. We refer to the entire collection of assertions as the “conversation.”

Testing and Validity

Pew Research spent more than 12 months testing CH and its own tests comparing coding by humans and the software determined that the software met the Center’s high standards for accuracy and repeatability.

In addition to validity tests of the platform itself, the Center conducted separate examinations of human intercoder reliability to show that the training process for complex concepts is replicable. The first test had five researchers each code the same 30 posts which resulted in an agreement of 85%.

A second test had each of the five researchers build their own separate monitors to see how the results compared. This test involved not only testing coder agreement, but also how the algorithm handles various examinations of the same content when different human trainers are working on the same subject. The five separate monitors came up with results that were within 85% of each other.

Unlike polling data, the results from the CH tool do not have a sampling margin of error since there is no sampling involved.

An Example

Since the use of computer-aided coding is a relatively new phenomenon, it will be helpful to demonstrate how the above procedure works by following a specific example.

In January 2015, Pew Research created a monitor to measure the tone of the conversation on Twitter for New Jersey Governor Chris Christie. First, we created a monitor with the following guidelines:

  1. Source: “Twitter” only
  2. Original date range: November 4 to December 31, 2014
  3. English-language content only
  4. Keyword: Christie

We then created the four categories that are used for measuring tone:

  1. Positive
  2. Neutral
  3. Negative
  4. Off-topic/Irrelevant

Next, we trained the monitor by classifying documents. CH randomly selected entire posts from the time period specified, and displayed them one by one. A researcher decided if each post is a clear example of one of the four categories, and if so, assigned that post into the appropriate category. If an example post could fit into more than one category, or is not clear in its meaning, the coder skipped the post. Since the goal is to find the clearest case possible, coders will often skip many posts until they find good examples.

A tweet that is about a poll showing Chris Christie ahead of the Republican field—and that his lead is growing, would be a good example to put in the “positive” category. A different story that is entirely about Christie’s record in New Jersey and how some conservative voters are opposed to him would be put in the “negative” category. A post that is strictly factual, such as a story about a speech Christie gave on the economy that does not include evaluative assessments, would be put in the “neutral” category. And a post that includes the word “Christie” but is not about the candidate at all, such as a story about a different person with the same last name, would go in the “off-topic” category.

The coder trained more than 250 documents. Each of the four categories had more than 20 posts in them.

At that point, the initial training was finished. For the sake of validity, Pew Research has another coder check over all of our training and look for tweets that they would have categorized differently. Those tweets are removed from the training sample because the disagreement between coders shows that they are not clear, precise examples. In the case of the Christie monitor, there were four documents that were removed for this reason.

Finally, we “ran” the monitor. This means that the algorithm examined the word patterns derived from the monitor training, and applied those patterns to every tweet that was captured using the initial guidelines. Since the software studies the conversation in an aggregate as opposed to individual posts, the algorithm divided up the overall conversation into percentages that fit into the four categories.