Overview
The Interplanetary Filesystem (IPFS) is a P2P data storage and retrieval network. Structurally, it exhibits features of both structured and unstructured networks. A Kademlia-based DHT is used to store lists of providing peers for data items. Requests for data are first locally flooded in an unstructured fashion, and only on failure located through the DHT and scoped more precisely.
We have multiple years of recorded requests from the IPFS network, including information about the clients and geolocation of the request origins. This is a large dataset that needs to be analyzed. Furthermore, data collection is an ongoing process and can be augmented, if necessary.
Goal
Derive useful information from the dataset.
Tasks
- Understand the dataset and how we capture it. Understand its limitations.
- Come up with interesting queries about the dataset.
- Engineer solutions that incrementally (i.e., batch-processing new data) derive answers to these queries from the dataset.
- Implement said solutions to automatically(!) derive insights into the dataset and visualize them.
Requirements
- Proficiency with Linux and familiarity with the command line. Data analysis will almost exclusively run on remote servers due to the volume of the data.
- Motivation about the topic. Don't pick this topic if all you need is a thesis project.
- Knowledge of common formats such as JSON and CSV
- Knowledge of Python (and R) for data processing and visualization. Final visualizations etc. should be done using R, but it's possible to learn enough R in half a year to make this work.
Literature
- https://arxiv.org/abs/2104.09202
- https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9142764