January 29, 2018

Sources Shared on Twitter: A Case Study on Immigration


This study examines the different types of sources linked to on Twitter about a widely discussed news topic: U.S. immigration. To do this, Pew Research Center researchers analyzed the sites linked to in tweets about U.S. immigration policy and news events that were posted during the first month of the Trump administration (Jan. 20-Feb. 20, 2017). During this time period several high-profile events occurred, most notably Trump’s signing of several immigration-related executive orders, including the order that restricted entry to the U.S. by people from certain countries; the resulting protests at airports nationwide; and several court rulings that delayed the administration’s ability to enforce that order.

Data collection

This study analyzed the 1,030 sites linked to by any of the 9.7 million immigration-related tweets included in this study. These tweets are from an original set of 11.5 million immigration-related tweets with external links; only those sites linked to in at least 750 tweets. Once collected, these sites were coded for their type. Two categories of sites – News Organizations and Other Information Providers – were also coded for age, self-described ideology and the use of anti-establishment language in their “about” section or connected social media pages (see Content Analysis for more details).

Researchers collected all tweets matching certain immigration-related keywords for analysis using the Gnip Historical PowerTrack API (Gnip API), a searchable archive of all publicly available tweets. There are four types of tweets: a basic tweet, in which a user posts some kind of content; a retweet, in which the user reposts a message posted by another user; a retweet with an additional comment, also known as a quoted tweet; and a reply, in which a user replies to another tweet. All four types of tweets can include a link that directs the user to a site (i.e., a website or social media page/channel) outside of Twitter. This analysis includes tweets from each of the four tweet types that included a link.

Selecting tweets about immigration

Tweets were collected from the Gnip API using a set of keywords related to the immigration debate in the United States. This included general terms related to immigration as well as specific issues such as the proposed border wall, sanctuary cities and the so-called “Dreamers” (immigrants brought to the United States as children). The final set of keywords was developed iteratively, with analysts testing different combinations of keywords to include as much relevant content as possible while minimizing the inclusion of irrelevant content. The analysis included tweets with keywords following five parameters below:

  • Contains at least one of the following: “Muslim ban,” “travel ban,” “CBP,” “border patrol,” “#NoBan,” “#NoBanNoWall,” “#NoWall,” “sanctuary city,” “sanctuary cities,” “sanctuary church” or “Day Without Immigrants”
  • Contains either “fence” or “wall”; and at least one of the following: “Trump,” “Mexico,” “border” or “Mexican”; but does not contain any of the following: “Wall Street,” “on the fence” or “Israel”
  • Contains at least one of the following: “detain,” “detained” or “questioned”; and at least one of the following: “airport,” “airports,” “customs,” or “border”
  • Contains at least one of the following: “non-citizen,” “noncitizen,” “green card,” “asylum,” “permanent resident,” “H-1B,” “undocumented,” “-deport-,” “-migra-,” “-criminal alien-” or “-dream-”; and at least one of the following: “Trump,” “-presiden-,” “USA,” “US,” “U.S.,” “U.S.A.,” “-America-,” “Washington,” “White House,” “-Mexic-,” “-protest-,” “-demonstra-,” “order,” “ban,” “judge,” “court” or “ruling”
  • Contains at least one of the following: “non-citizen,” “noncitizen,” “green card,” “asylum,” “permanent resident,” “H-1B,” “undocumented,” “-deport-,” “-migra-,” “-criminal alien-,” or “-dream-”; and at least one of the following: “circuit,” “emergency stay,” “-appeal-,” “EO,” “pause,” “moratorium,” “90 day,” “90-day,” “-vett-,” “vote,” “-voter-,” “voting,” “sanctuary,” “National Guard,” “ICE,” “DHS” or “DOJ”

The Gnip API supports partial keyword matching (i.e. matching just a portion of a word). For example, “-migra-” matches “migration,” “migrate” and “immigration.” The Gnip API is also case-insensitive, so the keyword phrase “White House” matches both “White House” and “white house.” The API also can ignore punctuation, which is helpful when dealing with Twitter’s hashtags. For example, the keyword “Trump” will match tweets that contain the exact keyword, along with tweets that contain “#Trump,” “Trump!” and any other punctuation or symbols (but not numbers or letters) preceding or following the keyword.

The Gnip API returned over 20 million tweets matching the keyword parameters listed above.

Even with these search parameters in place, off-topic tweets still made their way into the dataset, prompting researchers to apply a series of additional parameters to remove these irrelevant tweets. The following rules were added to do so:

  • Removed any tweets in which the only mention of “dream” is in a user name (preceded by an @ symbol) that is contained within the tweet text
  • Removed any tweets containing any of the following: “Chasing Your Dream Radio,” “CIA Memorial,” “American Horror Story,” “Wall St-” or “on the fence”

Finally, because this study analyzed the external sites linked to in tweets, researchers removed any tweets that did not contain any links to sites outside of Twitter.

After applying these rules, the dataset included 11.5 million tweets.

Extracting links

The Gnip API stores external links in different ways depending on the type of tweet that contained the link. To extract these links from the data for analysis, researchers developed and ran a Python script that searched the Gnip data for any links in all tweets.

For a variety of reasons, it is common for Twitter users to use link shorteners. Common link shorteners include bit.ly, ow.ly and an assortment of site-specific shorteners (for instance, Pew Research Center uses the shortener pewrsr.ch).

The Gnip API stores both the shortened and expanded link for basic tweets, while it only stores the shortened link for most other tweet types. Once researchers extracted all links from each tweet, they then used a script to follow all of these links (the equivalent of clicking on a link) in order to identify the final link. If both the shortened and expanded links routed to the same webpage, the expanded link was saved and the shortened link was discarded. In some cases, the original link could not be determined from the shortened link because the shortened link had expired or it was otherwise unclear where the shortened link originally pointed. These are captured in the content delivery tools grouping under the Other Sites category.

Researchers then extracted the domains from all collected links. For example, “nytimes.com” is the extracted domain from the link www.nytimes.com/2017/01/26/us/politics/mexico-wall-tax-trump.html. A single tweet can contain multiple links and, therefore, be counted twice if those links point to different domains. If a tweet contained multiple links to an individual domain, researchers counted the tweet/domain pair only once.

After these verification and link extraction steps, the dataset included 11.5 million tweets, 54,320 domains and 55,462 identifiable social media pages/channels or discussion forum groups.

Determining which sites to include in this study

Because many sites had just a few tweets linking to them, and therefore likely did not play a large role in the Twitter conversation, researchers only included sites that were linked to in at least 750 tweets. This resulted in 1,030 sites (website domains and social media pages/channels), which were included in 9.7 million tweets or about 85% of all tweets in the dataset.10

Even after limiting the dataset to just those sites with 750 tweets, reaching this final dataset required several additional validation steps.

First, researchers consolidated any related subdomains (such as edition.cnn.com) or site-specific link shorteners with at least 750 tweets into a single site. For instance, the dataset included both bbc.co.uk and bbc.com, which were consolidated under bbc.com with the sum of the links to both sites. The same process was also applied to several site-specific shorteners (such as pewrsr.ch, which is a shortener for pewresearch.org) that had not redirected in previous link expansion steps. For instance, if 1,000 tweets linked to pewresr.ch, this link shortener was removed from the dataset and the number of tweets that linked to pewresearch.org was increased by 1,000. This validation step affected 96 sites.

Second, researchers analyzed the links to social media, discussion forums and other platforms for user-generated content to distinguish between the platforms’ different pages. If a social media page, such as a YouTube channel, Facebook page or WordPress blog was linked to in more than 750 tweets, researchers included that page in the analysis. Because Reddit is focused on discussions between users instead of the comments of an individual user, analysis of that site was of subreddits. All other pages on these platforms were excluded from the dataset.

Any social media pages connected with a site already captured were removed from the dataset and their tweets were associated with the original site, as was done for link shorteners. For example, tweets that linked to Pew Research Center’s Facebook page were combined with tweets that linked to pewresearch.org. However, this combination only occurred if the social media page was linked to by at least 750 tweets (i.e. if Pew Research Center’s Facebook page was only linked to in 100 tweets, this step was not taken).

The social media platforms and discussion forums that include one or more pages linked to by more than 750 tweets were:

  • youtube.com (37 channels)
  • facebook.com (30 pages)
  • wordpress.com (6 blogs)
  • reddit.com (5 subreddits)
  • medium.com (4 publishers)
  • instagram.com (3 profiles)
  • blogspot.com (2 blogs)
  • linkedin.com (1 profiles)

There were several social media platforms from which no page had more than 750 tweets. Additionally, there were some platforms for which researchers could not identify the social media account from the link because either the link had expired or the account could not be identified via automated means. These included gab.ai, periscope.tv, pinterest.com, vimeo.com and vine.co. Additionally, one site under tumblr.com was associated with a previously captured site.

After these validation steps, there were 1,030 sites, including social media pages, in the dataset.

Content analysis

After collecting and validating these Twitter data, researchers conducted an additional content analysis. This analysis was performed by a team of two coders who were trained specifically for this project.

The 1,030 sites in the dataset were coded according to several variables:

  • Broad category and specific grouping refers to the different kinds of sites that are linked to in these 9.7 million tweets. For every site, researchers visited the homepage of the site itself as well as its “about” page and any connected social media profiles. There was a total of 14 different site groupings, which are organized below into three broad categories used throughout the report:
    • News Organizations – Legacy news organizations, digital-native news organizations
    • Other Information Providers – Digital-native commentary/blogs, digital-native aggregators, nonprofit/advocacy organizations, government institutions/public officials, academic/polling
    • Other Sites – Consumer products and internet services; foreign/non-English; spam; discontinued; content delivery tools; celebrity, sports and parody/satire; other sites

The following variables were only used to analyze sites in the News Organizations and Other Information Providers categories:

  • Age refers to the date the site began posting content. This variable was coded for those established before or after Jan. 1, 2015. To code this variable, researchers evaluated any of the following: the site’s “about” page, its WHOIS information (which provides information on the individual or organization that registered the domain), the date of the first post on the site or news articles about the site’s launch.
  • Ideology refers to a site’s self-described ideology or partisanship as stated on its “about” page, associated social media profiles or interviews with its founders, based on the following categories:
    • Liberal, including Democrats, progressives and left-leaning
    • Conservative, including Republicans and right-leaning
    • No self-described ideology
  • Establishment orientation refers to a site’s self-described orientation toward the media or political establishment, as stated on its “about” page or associated social media pages. Sites that say, for example, they are “exposing the lies of the media” or “taking the fight to the political establishment” were categorized as anti-establishment. All other sites were categorized as not having a self-identified anti-establishment leaning.

Coders were given multiple sets of sites to evaluate during the training period. Once internal agreement on how to code the variables was established, coding of the content for the study began. The Krippendorff’s Alpha estimate for each variable is below. For each variable, this estimate is based on a minimum of 139 sites and a maximum of 241 sites (for site category/grouping).

  • Site category/grouping: 0.69
  • Age: 0.82
  • Ideology: 0.67
  • Establishment orientation: 0.72

Throughout the coding process, staff discussed questions as they arose and arrived at decisions under supervision of the content analysis team leader. In addition, the master coder checked coders’ accuracy throughout the process.

External fake news list analysis: Lists from three organizations – Politifact, BuzzFeed and FactCheck.org – were combined to create a single list of “fake news” websites. These lists were selected because they met the criteria of having staff from these organizations directly evaluate the content of each website included rather than compiling them from other existing lists. Additionally, these organizations were cited in media reports as reputable sources of information about fake news or were part of fact checking initiatives, such as the Facebook fake news initiative.

Each of the lists are publicly available: BuzzFeed’s Fake News Sites and Ad Networks list (updated in December 2017), FactCheck.org’s list of websites that post fake and satirical stories (updated October 2017) and Politifact’s Fake News Almanac (updated November 2017). As of January 2018, Politifact’s list, with 330 websites, was the longest, followed by BuzzFeed (167 websites) and FactCheck.org (62 websites).

Pew Research Center analysts downloaded each list in November 2017 and Buzzfeed’s updated list in December 2017. After accounting for sites appearing on more than one list, the combined lists included 468 unique websites.

Analyzing how many tweets contained links to each site

This study also included a secondary analysis that looks at the number of tweets that contained links to each of the 1,030 sites that were included in the first analysis. In this dataset, 9.7 million tweets contained links to these sites.

A tweet may contain multiple links, which can point to multiple sites or the same site multiple times. For example, a tweet could link to both The New York Times and Fox News, and both of these sites would be captured for this analysis. Accordingly, across the 9.7 million tweets, there are 12.2 million instances in which a tweet included at least one link to one of the 1,030 sites. The study simply reports the percentage of tweets that point to each site or site grouping.

  1. In this study, pewresearch.org met the threshold of being linked to in at least 750 tweets and was included in the nonprofit/advocacy organization grouping within the Other Information Providers category. It was not linked to in enough tweets to be broken out separately as one of the 15 most shared sites.