GHTorrent tutorial
by Georgios Gousios and Diomidis Spinellis
This is the web page for the GHTorrent tutorial given as an ICSE 2017 technical briefing
Contents
We plan to cover at least the following aspects:
- GitHub data collection strategies, including querying the API, using online services such as GitHub Archive and GHTorrent.
- Using GHTorrent to sample appropriate repositories for various types of research questions.
- Writing, managing, and optimizing complex and expensive relational queries on GHTorrent relational data.
- Using GHTorrent effectively: understanding the data collection challenges and avoiding common pitfalls.
- Copyright and privacy issues when using the GitHub data.
Downloads
Our tutorial will be based on the following data sources:
- SQLite3 database. You can use this directly with a command line or a graphical SQLite3 database editor.
- MongoDB DB dump. To restore this, you will need a running MongoDB.
To create them, we have run the default GHTorrent data collection process on the ReactiveX/rxjs project. The data are current on Feb 14.
Before you come to the Technical Briefing
We plan to make the technical briefing an interactive session where we will work together on common GHTorrent repository mining tasks. To make the session as productive as possible please make sure that:
- You bring your (macOS/Linux/Unix/Bash on Ubuntu on Windows) laptop!
- Install SQLite3 and download our SQLite3 database
- Clone simple-rolap and rdbunit
- Install rdbunit
- Install Git, make, Python, and GraphViz
Have your say!
What would you like to see covered in the technical briefing?