This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 1, October 2019 - April 2020)

Type Dataset Erdal Baran, Dimitar Dimitrov (2020): TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 1, October 2019 - April 2020). Zenodo. Dataset. https://zenodo.org/record/3871753

Authors: Erdal Baran ; Dimitar Dimitrov ;

Links

Summary

TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies*.

We also provide a tab-separated values (tsv) version of the dataset. Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

Tweet Id: Long. Username: String. Encrypted for privacy issues*. Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ). #Followers: Integer. #Friends: Integer. #Retweets: Integer. #Favorites: Integer. Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;". Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1"). Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;". Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;". URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

This dataset consists of 8,151,524 tweets in total, posted by 3,664,518 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

To extract the dataset from TweetsKB, we compiled a seed list of 268 COVID-19-related keywords.

* For the sake of privacy, we anonymize user IDs and we do not provide the text of the tweets.

 

More information

  • DOI: 10.5281/zenodo.3871753
  • Language: en

Subjects

  • twitter, tweets, linked data, microblogging, RDF, csv, covid-19, coronavirus

Dates

  • Publication date: 2020
  • Issued: June 04, 2020

Rights


Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsDocumentedByhttps://data.gesis.org/tweetscov19/
IsVersionOfhttps://doi.org/10.5281/zenodo.3871752
IsPartOfhttps://zenodo.org/communities/covid-19
IsPartOfhttps://zenodo.org/communities/twitter-datasets
IsPartOfhttps://zenodo.org/communities/zenodo