Automated Crawling and Visualization of Libp2p Kademlia DHTs

Supervisor: Leonhard Balduf
KOM-ID: KOM-B-0718
Link zur Ausschreibung

Problem Description

We maintain a crawler for libp2p-based Kademlia DHTs, which we use to crawl IPFS.
We are able to enumerate all DHT servers, i.e., all nodes participating as server nodes in the DHT.
We can then visualize data about these nodes, such as agent versions, churn, geolocation, etc.

The crawler, however, is generic over all DHTs implemented using go-libp2p-kad-dht.
There are many projects using this implementation, which we could potentially crawl.
They differ in a) their bootstrap nodes, and b) their protocol identifiers.

Main Objective

Working on well-connected Linux servers, set up multiple instances of the crawler, enumerating peers of as many networks as possible.
Set up automated data wrangling and visualization pipelines, displaying the results on a statically-generated website.

Optional Bonus Goals

  • Extend the visualization and data science scripts.
  • Realize the setup in a containerized environment, in a way that is easily extensible with new networks.
  • Save the results of the crawls into a PostgreSQL database for compact storage and easier queries.
  • ???

Prerequisites

You must

  • Have intrinsic motivation to look at P2P networks.
  • Be able to fluently work on remote Linux machines via SSH.
  • Understand Go code.
  • Be able to admin Linux machines and document the setup.

It would probably be helpful if

  • You understood how DHT crawlers work in general.
  • You knew Chinese, since many of the projects have Chinese source code comments or documentation.
  • You knew R, since the evaluation scripts are in R.
  • You were well-versed in SQL and design of database schemas.

Bachelor/Master?

If you pursue this as a Bachelor's thesis, the minimum goal and any number of optional goals apply.

If you pursue this as a Master's thesis, you'll have to significantly extend the data processing and visualization frameworks to derive new interesting data from the datasets. Addiitonally, you'll of course have to implement the setup in a clean way, well-documented, with more of the optional goals realized. I'd also expect you to come up with your own optional goals.