The era of « fake news » is upon us. Browsing social media is a constant exercise in judgment, but data science can be helpful in distinguishing real trending topics from made-up ones. In EPJ Data ScienceEmilio Ferrara and team set to determine from the outset whether information is spread organically or artificially on social media.
Pixabay, public domain CC0
Guest post by Emilio Ferrara
Every day, billions of people participate in online social media platforms. These digital ecosystems expose their users to tailored information based on individual interests, friendship networks and offline world news. Each “story,” which together with related ones forms a “meme” or information campaign, can emerge organically, from grassroots activity, or in some cases underpinned by advertising or other coordinated efforts.
Most information campaigns are genuine and benign; however, recently we have seen the emergence of « bad actors » who exploit social media to sway public opinion, with the intent to deceive or simply create chaos. For example, our research showed that prior to the 2016 US presidential election, fake news had become the vehicle for spreading disinformation, attacking candidates, and causing confusion online. Similarly, we have demonstrated how ISIS and other extremist groups have exploited Twitter for terrorist propaganda and recruitment purposes.
It is of paramount importance to be able to detect, in their initial stage, memes and information campaigns that are artificially supported and separate them from organic ones.
It is therefore of paramount importance to be able to detect, in their initial stage, memes and information campaigns that are artificially supported and separate them from organic ones. This issue has important societal implications and poses numerous technical challenges, in part due to the paucity of large-scale annotated datasets with examples of both types of information campaigns.
In EPJ Data Science, we make progress in the direction of discriminating between trending memes that are organic or promoted through advertisements. This classification proves to be very challenging: the ads usually cause bursts of collective attention that can easily be mistaken for those produced by organic trends. Luckily, we can rely on Twitter for labeled examples: When a hashtag is promoted by an advertiser, Twitter clearly states it. This feature allowed us to collect a dataset of millions of tweets belonging to promoted information campaigns, as well as millions of tweets belonging to organic trends.
We propose a machine learning framework and new techniques for classifying such memes. Our algorithm leverages hundreds of time-varying features to capture changing networks and syndication patterns, content and sentiment information, timing signals, and user metadata.
We conceptualize two different forecasting problems: early detection of information campaigns promoted right at the moment of the trend poses significant challenges due to the minimal volume of activity data available for forecasting before the trend; tracking campaign after trend is easier due to the high volume of activity data generated by the many users who join that conversation.
Our framework achieves 75% accuracy for early detection, increasing to over 95% after trending. We evaluate the robustness of the algorithm by introducing different factors, such as random time shifts on trending time series, to reproduce situations that can occur in the real world. Finally we explore which features best predict promoted campaigns, finding that content signals provide consistently useful signals; user features are more informative for early detection, while network and timing features are more useful once more data is available.
In the future, we will extend this framework to monitor social media to detect coordinated information efforts such as fake news, conspiracy theories, anti-vaccination campaigns, etc.
Read the full article Here.
Emile Ferrara is Research Assistant Professor of Computer Science at the University of Southern California, Research Leader at the USC Information Sciences Institute and Principal Investigator of the USC Machine Intelligence and Data Science (MINDS) group. His research focuses on the study of techno-social systems, on the design of machine learning structures to model and predict individual behavior, characterize the diffusion of online information and predict crimes and abuses.