This repository is a collection of teaching materials from a workshop on how to gather data from the internet using the programming language R. Specifically, it covers two main skills: pulling tweets and other information directly from Twitter, and extracting data from websites. The repo contains sample code and slides designed to introduce people to web scraping, the practice of automatically collecting information from online sources rather than manually copying and pasting. The code examples are intentionally simple ("toy examples") so they're easier to follow if you're new to this kind of work. They demonstrate both how to connect to Twitter and pull tweets down, and how to grab structured data like tables from web pages, as well as messier, semi-organized content embedded in HTML code. This would be useful for researchers, journalists, or analysts who need to collect large amounts of data from the web for analysis. For instance, a political researcher might want to scrape thousands of tweets about an election, or a journalist might need to extract financial data from multiple news websites. Instead of doing this by hand, these techniques let you write a script that does it automatically, saving hours of tedious work. The workshop was put together for the NYU Politics Data Lab back in 2013, so the materials are oriented toward people in academic or political research contexts. Keep in mind that web scraping tools and APIs change frequently, and Twitter's policies around data collection have shifted significantly since then, so some of the specific code may need updating if you're trying to use it today.
← ujjwalkarn on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.