Replication package for our work on "Taxing Collaborative Software Engineering"
This replication package requires Python 3.10 or higher. Install the dependencies via:
python3 -m pip install -r requirements.txt
For a faster loading, we recommend to optionally install orjson
via pip:
python3 -m pip install orjson
First, we collect all timelines from all pull requests at a GitHub instance. crawler.py
requires an <api_token>
for your GitHub instance and an <out_dir>
where the results are stored into:
python3 crawl.py <api_token> <out_dir>
crawl.py
also provides the following optional command line arguments:
--api_url
for the GitHub instance URL (default:https://api.github.com
)--disable_cache
for disable caching (for larger instances not recommended)--num_workers
for parallel processes (default: 1)--organization
for limiting to one organization (helpful for organizations hosted on github.com)
To list all options in detail, run:
python3 crawl.py --h
For this step, you will need:
- The directory of the previously collected data; and,
- A mapping of users and countries. This can be either a
dict
for a static mapping (does not capture changes in the users' location over time) or a dataframe for time-dependent mapping as data frame monthly sampled (captures changes in the users' location over time).
Run notebook.ipynb
. Look out for the instructions as inline comments.
Copyright © 2023 Michael Dorner.
This work is licensed under MIT license.