Home

Introduction

The growing complexity of cancer diagnosis and treatment requires data sets that are larger than currently available in a single hospital or even in cancer registries. However, sharing patient data is difficult due to patient privacy and data protection needs. Privacy preserving distributed learning technology has the potential to overcome these limitations.

The general idea behind distributed learning is that sites share a (statistical) model and model parameters instead of sharing sensitive data: each site runs computations on a local data store that generate these aggregated statistics. In this setting organizations can collaborate by exchanging aggregated data/statistics while keeping the underlying data safely on site and undisclosed.

The infrastructure

Collaboration through distributed learning requires an infrastructure. The open source software VANTAGE6 provides this infrastructure. Conceptually, it consists of the following parts:

  1. A central server that coordinates communication with the nodes;
  2. One or more nodes that execute algorithms (encapsulated in Docker images) and return the output;
  3. Organisations that are interested in collaborating with each other;
  4. Collaborations between organisations;
  5. Users (i.e. researchers) that instruct the nodes which algorithms to execute and the parameters to use;
  6. A Docker registry that functions as a database of algorithms.
Infrastructure overview

The central server

The central server handles administrative tasks such as authentication and authorization. It keeps track of the tasks and communicates new tasks and results to the nodes and users. It does so through a RESTful-api and some help from websockets.

Nodes & users

Users and Nodes are organised in organisations and collaborations: for organisations to work together, they organise themselves in a collaboration. Within the collaboration it is agreed upon which data to use, how the data should be formatted and what algorithms can and will be applied.

Within a collaboration, each participating site (i.e. each organisation) runs a node that can access a local data store (e.g. a database or a CSV-file) and connects to the central server.

Researchers (i.e. users) can instruct the nodes within a collaboration to execute an algorithm (in the broadest sense of the word) by uploading “tasks” to the central server. These tasks are picked up by the nodes, executed and the results are returned to the user (via the central server). In order to maintain privacy, the results should only consist of aggregated statistics.

When the researcher has obtained all the results from the participating sites, he/she can, for example, combine the statistics from all of the sites into a single result.

Note that, although the infrastructure itself doesn’t make any assumptions about the way the nodes store the data, the current implementation of the distributed Cox proportional hazards algorithm (github.com) requires all sites/nodes to agree on the exact same data format and only supports CSV.

Docker & Docker registry

In order to support different working environments and provide researchers with as much flexibility as possible with respect to the tools and algorithms they can use, the infrastructure makes use of Docker images to encapsulate the computation requests.

This means researchers are free to use any application/script by embedding it in a Docker image. Docker Images are pushed (i.e. uploaded) to a (private) Docker registry that makes them available to the nodes. As such, a task is essentially defined by the following properties:

  • A name
  • The collaboration it applies to
  • The Docker image to execute
  • Input parameters to provide to the Docker image

When executing a task, nodes pull (i.e. download) the image from the registry and run it in a container that has access to the data store.

At the moment, the system is based on trust between the parties within a collaboration: the system only places limits on who can create tasks. Additionally, read/write access to the (private) registry at https://docker-registry.distributedlearning.ai has been locked down. However, in the (near) future, security on the node can be strengthened by methods like: 1) limiting the list of images the node will execute; 2) using Docker Notary to digitally sign images that all parties trust; 3) or using Harbor.

Using the infrastructure

The general process flow, illustrated by the image below, is as follows:

Process flow

  1. The user uploads a task to the server, specifying:
    • the Docker image to run
    • the input parameters for the algorithm
  2. The node(s) …
    • retrieve the task from the server
    • pull the corresponding Docker image from the (private) registry
    • execute the Docker image/container while providing it access to the data
    • return the results (aggregated statistics) to the server
  3. The user collects the results and processes them if desired.

By repeating the above steps, it is possible to construct complicated algorithms and iterative processes.

For a more detailed example, see the GitHub repository for the distributed Cox Proportional Hazards algorithm.

Hardware and software requirements

Server

This machine, hosted by IKNL, the Netherlands Comprehensive Cancer Organisation,  runs a VANTAGE6-server. Should you be interested in using it (non-commercially), feel free to contact us to discuss the possibilities.

Alternatively, you can always run your own server. Running the central server requires a (virtual) machine that:

  • is accessible from the internet
  • has Python (≥ 3.6) and the VANTAGE6 package installed

For information on how to install a server, please have a look at the docs.

Node

Running a node/site requires a (virtual) machine that has:

  • Python (≥3.6) and the VANTAGE6 package installed
  • Docker CE installed (the user running the node software needs to have the proper permissions to perform docker commands)
  • Access to a local data store
  • Access to the internet and/or central server

Client(s)

In order for researchers to interact with the infrastructure’s REST-API, they need to issue HTTP requests. This interaction usually takes place within a statistical/machine learning environment, such as R or Python, so to support end users, clients are available that provide access to the most common parts of the API. The VANTAGE6 package contains a client that can be used in Python. For R a client is available athttps://github.com/IKNL/ptmclient. Clients for Python, SAS and STATA are under consideration.

More information

Please see the following sites for more information on distributed learning: