The catalogue contains study descriptions in various languages. The system searches with your search terms from study descriptions available in the language you have selected. The catalogue does not have ‘All languages’ option as due to linguistic differences this would give incomplete results. See the User Guide for more detailed information.
News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021
Creator
Dahlgren, Peter M. (Department of Journalism, Media and Communication (JMG), University of Gothenburg)
Study number / PID
2021-256-1-1 (SND)
https://doi.org/10.5878/d18f-q220 (DOI)
Data access
Open
Series
Not available
Abstract
This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night.
The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours.
The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se, samtiden.nu, svd.se, sverigesradio.se, svt.se, sydsvenskan.se and vlt.se.
Due to copyright, the full text is not available but instead transformed into a document-term matrix (in long format) which contains the frequency of all words for each article (in total, 80 million words). Each article also includes extensive metadata that was extracted from the articles themselves (URL, document title, article heading, author, publish date, edit date, language, section, tags, category) and metadata that was inferred by simple heuristic algorithms (page type, article genre, paywall).
The dataset consists of the following:
article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables.
article_text.csv (236 MB): The file contains the id of each news article and how many times...
Terminology used is generally based on DDI controlled vocabularies: Time Method, Analysis Unit, Sampling Procedure and Mode of Collection, available at CESSDA Vocabulary Service.
Methodology
Data collection period
2019 - 2019
Country
Sweden
Time dimension
Longitudinal
Analysis unit
Media unit: Text
Universe
News articles
Sampling procedure
An open source web scraper scraped news articles from 19 Swedish news sites every two hours. Code in Python for the web scraper is available at: https://github.com/peterdalle/mechanicalnews
Total universe/Complete enumeration
Kind of data
Not available
Data collection mode
Other
Access
Publisher
Swedish National Data Service
Publication year
2021
Terms of data access
Access to data through SND. Data are freely accessible.