July 3, 2015

Methodology: As Greeks head to the polls, the Twitter conversation differs by language

This analysis of the Twitter discussions surrounding the 2015 Greek referendum employed media research methods that combined Pew Research’s content analysis rules with computer coding software developed by Crimson Hexagon (CH). This report is based on examination of about 2.5 million Twitter statements that were identified as being about the Greek developments in light of the July 5th referendum. An additional analysis of about 300,000 tweets was conducted examining the sentiment toward Greek prime minister Alexis Tsipras, following the referendum. The primary searches were conducted in English and Greek. The data were gathered and analyzed by Michael Barthel and Katerina Eva Matsa.

Crimson Hexagon is a software platform that identifies statistical patterns in words used in online texts. Researchers enter key terms using Boolean search logic so the software can identify relevant material to analyze. Pew Research draws its analysis sample from all public Twitter posts. Then a researcher trains the software to classify documents using examples from those collected posts. Finally, the software classifies the rest of the online content according to the patterns derived during the training. While automated sentiment analysis is not perfect, the Center has conducted numerous tests and determined that Crimson Hexagon’s method of analysis is among the most accurate tools available. Multiple tests suggest that human coders and Crimson Hexagon’s results are in agreement between 75% and 83% of the time. (For a more in-depth explanation on how Crimson Hexagon’s technology works click here.)

Greek is considered an “unsupported language” in Buzz monitors, which are not being used in this analysis. Opinion monitors, which we are using, are trained by a researcher fluent in the language.

This study contains an analysis of the sentiment or tone of the conversation on Twitter.

All tweets analyzed in this report were collected between 12 am EDT, June 26, 2015 to 11:59 pm EDT, July 1, 2015 and between July 6, 2015 to 11:59 pm EDT, July 12, 2015.

Each Boolean search used keywords in English and Greek only.

The Boolean searches used for each monitor included a variety of terms relevant to the subject being examined.

For example, the search used to identify tweets in English about the recent developments in Greece was: (Crisis AND Greece) OR (Greece AND EU) OR (Greece AND European) OR (Greek AND crisis) OR (Greek AND European) OR (Greek AND EU) OR (Greek AND IMF) OR (Greece AND IMF) OR Greece OR debt OR bankruptcy OR default OR bankrupted OR referendum OR grexit OR greferendum OR Eurogroup OR Tsipras OR Syriza OR (Greece AND Europe) OR (Greek AND Europe) OR yeseurope OR oxi

And the search used to identify tweets in Greek about the recent developments in Greece was: Κρίση OR κριση OR Ευρώπη OR Ευρωπη OR ΕΕ OR “Ευρωπαϊκή Ένωση” OR “Ευρωπαικη Ενωση” OR Ελλάδα OR δημοψήφισμα OR δημοψηφισμα OR κρίση OR κριση OR χρέος OR χρεος OR Σύριζα OR Συριζα OR κυβέρνηση OR κυβερνηση OR αντιπολίτευση OR αντιπολιτευση OR Τσίπρας OR Τσιπρας OR θεσμοί OR θεσμών OR θεσμοι OR θεσμων OR Τρόϊκα OR Τροικα

All tweets were put into one of four categories: positive, neutral, negative or off topic. Depending on the search, a tweet was considered positive if it clearly supported the Yes in relation to the referendum, and considered negative if it clearly supported No.

References to debt or the default were only included in the study if the tweet was clearly focused on Greece. The algorithm was trained to consider references to other debts or defaults, such as in Puerto Rico, as off-topic, and were excluded from the study.

CH monitors examine the entire Twitter discussion in the aggregate. To do that, the algorithm breaks up all relevant texts into subsections. Rather than the dividing each story, paragraph, sentence or word, CH treats the “assertion” as the unit of measurement. Thus, posts are divided up by the computer algorithm. Consequently, the results are not expressed in percent of newshole or percent of stories. Instead, the results are the percent of assertions out of the entire body of stories identified by the original Boolean search terms. We refer to the entire collection of assertions as the “conversation.”