Working with Large Online Datasets for Social Science Research (using R)

Instructor: Lukasz Walasek

Modality: In presence

Week 1: 10-14 August 2026

 

Workshop contents and objectives

The rapid growth of digital communication, online social platforms, and publicly available web data sources has created new opportunities for social scientists to study behavior, attitudes, and social phenomena on an unprecedent scale. Online data, whether accessed through APIs, publicly available datasets, social media archives, repositories, or web-scraping tools, can provide unique insights into complex topics, such as wellbeing, risk perception, health communication, consumer behavior, misinformation, and many others.

This course introduces participants to the foundational concepts, tools, and research practices needed to design and conduct social science projects using large online datasets. This course prioritizes practical knowledge with emphasis on research design, ethical considerations, discovering online data sources, and introductory methods for accessing, wrangling, and analyzing rich quantitative and qualitative datasets

By the end of this course, participants will be able to:

  1. Critically evaluate how online data sources can be applied effectively in social science research.
  2. Identify, access, and extract valuable data from various online sources.
  3. Implement data access using APIs and web scraping tools, focusing on ethical, robust, and sustainable data analysis pipelines.
  4. Apply best practices for data wrangling, documentation, and reproducible research workflows.

 

Workshop design

Each day on the course will consist of two parts:

  • The morning session is dedicated to foundational concepts and approaches. We'll cover research design, data sources, and ethics, complemented by live demonstrations in R.
  • The afternoon session will feature guided exercises and individual project work. Every participant will have the opportunity to apply their new knowledge to construct their own research pipeline using large online datasets.

During the course, participants are also welcome to work on their own projects. The instructor will happily assist with any individual project that requires skills and knowledge covered during the course.

Materials (lecture slides, sample datasets, handouts, exercises with solutions, annotated R scripts) will be made openly available via an online repository to all participants.

 

Detailed lecture plan (daily schedule)

Day 1 – Opportunities and Challenges of Large Online Data in Social Science

Morning:

  • What is “large online data”? (textual records, metadata, social media, digital trace data, APIs, web archives, open datasets)
  • Developing research questions with large online datasets
  • Case studies from political science, psychology, economics, public health
  • Ethical considerations: digital footprint data, consent, terms of service, reproducibility, anonymity

Afternoon:

  • Exploring existing online datasets using R
  • Introduction to common data formats (CSV, JSON, XML)
  • Exercise: locating and evaluating online datasets for your research question
Day 2 – Obtaining Data Using APIs

Morning:

  • Understanding APIs
  • Basic structure of an API request: endpoints, parameters, authentication

Afternoon:

  • Accessing simple public APIs in R
  • Working with JSON data
  • Exercise: retrieving, parsing, and visualising API-based datasets
Day 3 – Working with Large Online Data: Wrangling, Cleaning, and Preparation

Morning:

  • Typical challenges: duplicates, unstructured text, nested JSON, missing values
  • Tidy and reproducible data preparation principles for online datasets
  • Merging text and numeric data in R

Afternoon:

  • Data wrangling workshop: Developing robust functions to explore and process large online datasets.
  • Basic text pre-processing pipelines
  • Mini-challenge exercise: clean and gain insights into from messy dataset about human behaviour!
Day 4 – When APIs Aren’t Enough: Web Scraping

Morning:

  • When scraping is and when it is not appropriate
  • HTML/CSS structure, selectors, and the process of web scraping
  • Polite scraping: rate limiting, data protection, legal considerations

Afternoon:

  • Using rvest in R for simple scraping tasks
  • Extracting structured content from static pages
  • Practical challenge exercise: Crawling through web store data and building a new database
Day 5 – Combining it all Together: Complete Analysis Pipeline with Large Online Data

Morning:

  • Designing a robust data workflow
  • Strategies for documentation, transparency, and reproducibility
  • Introduction to exploratory analysis and visualisation

Afternoon:

  • Designing own data workflow and project pipeline
  • Mini project presentations

 

Class materials

All materials will be provided online.

 

Prerequisites

The course is suitable for beginners wanting to explore online data in an applied, conceptual, and practical way.

Participants are expected to have basic computer and statistical analysis skills. Basic familiarity with R is necessary to participate in practical exercises and activities. 

Recommended readings or preliminary material

  • Altman, S., Behrman, B., & Wickham, H. (2021). Data Wrangling. https://dcl-wrangle.stanford.edu/
  • Bradley, A., & James, R. J. E. (2019). Web scraping using R. Advances in Methods and Practices in Psychological Science, 2(3), 264-270.
  • Wickham H (2022). rvest: Easily Harvest (Scrape) Web Pages. https://rvest.tidyverse.org/

Lukasz Walasek

University of Warwick, UK

Dr Lukasz Walasek is an associate professor at the Department of Psychology, University of Warwick, UK. He completed PhD and MSc in Psychology at the University of Essex and a BSc in Psychosocial Sciences at the University of East Anglia. Dr Walasek teaches the “Behavioural Change: Nudging and Persuasion” on the MSc in Behavioural Economic Science, and MSc in Behavioural Data Science.

In his research, Dr Walasek applies insights from data science to study how people make everyday decisions and judgments. His most recent work uses data mining and natural language processing to study topics such as: implicit bias, self-control, gambling-related harm, food choice, effects of inequality on consumption, as well as the dynamics of political polarization.

Read