A distributed Supercomputing Project for Early Research

Superclustr is a wide-area distributed cluster independently designed by a group of Internet Researchers and Engineers.
Superclustr will be used for research on parallel and distributed matrix multiplication with clusters located in:

  • Amsterdam, Netherlands (AMS1)  
  • Stuttgart, Germany (STG1)  

Superclustr consists of three clusters, located within the European Union. The first cluster (in Amsterdam) contains 3 Always-On Nodes and 1 On-Demand Node
that can be booted up to scale with workload demand, the control plane cluster (in Falkenstein) contains no worker nodes, the other clusters have 1 On-Demand Node (5 nodes in total).
The system was built in-house and runs a ram-only Rocky Linux Enterprise Linux operating system flavor with custom proprietary software accelerators.

The Falkenstein cluster, known as the control plane cluster, is designed to provide centralized control and management. As the heart of the Superclustr,
it manages the orchestration of containers across the entire cluster network. With its 5 nodes, it handles all communications within the cluster system, monitoring the health of each
node and automating the deployment, scaling, and management of applications. Additionally, leveraging a distributed Ceph Storage infrastructure, the cluster stores and serves all nessesary images
and model checkpoints to all nodes over a highly redundant and high-speed/low-latency connection.

Superclustr is built using Kubeflow, a machine learning toolkit for Kubernetes, making it easy to deploy and manage scalable machine learning workflows across the three clusters.
Kubeflow ensures that all components are smoothly integrated and running as expected, providing developers and data scientists with a user-friendly interface to deploy and monitor
their machine learning models.

Each of these clusters is designed to interact seamlessly with each other, working together to handle diverse tasks within the system.
Our system is able to withstand unexpected fatal errors during active computation and resume operation without data loss. This design ensures that the overall system
remains operational even when individual nodes or clusters encounter issues, enhancing the robustness and reliability of the Superclustr.

The Superclustr system is fully independently funded by the technical staff and the participating researchers.

Access to the Superclustr system is exclusively reserved for researchers and individuals affiliated with either non-profit organizations or not-for-profit entities recognized
within the European Union or unaffiliated individuals. Superclustr nodes are only operated by Superclustr Staff, third-parties do not have access to the system.
Please note that the indicated rates are applicable only for these individuals and entities. For-profit entities are not eligible for access to the system.

Power Consumtion

While using Superclustr you only have to pay for the consumed energy in kWh. Superclustr is almost fully relying on renewable energy,
therefore consumption is dynamically priced. With a flexible price limit you can cost-effectively run your operations on Superclustr.

One node in the cluster can consume under maximum load a maximum of 3.2kWh. During underutilization or undesirably high energy prices
Superclustr's flexible price limit will ensure to pause your queued tasks ahead of time and put nodes into hibernation mode during idle.

Compare Superclustr to Google Cloud with the Google Cloud Pricing Calculator.

Technical Specification

Always-On Servers:

On-Demand Servers:

Each cluster is interconnected via a secure Wireguard tunnel forwarded over the public internet. In addition, Gigabit Ethernet is used as OS network (file transport and high-speed
interconnect between the nodes). All clusters are connected over a unicast control plane server in Falkenstein, so the entire system can be used as a 5-node wide-area distributed
cluster.

Superclustr is a wide-area distributed cluster built by a group of internet researchers and engineers, who developed this system in their personal capacity outside of
their official roles to perform research on parallel and distributed matrix multiplication as well as optimization.    

In addition to the primary research, the system has been built to analyze large amounts of internet measurement data, which has been openly shared by the RIPE Atlas Project of the
RIPE NCC. Superclustr uses this data to better understand the changing nature of internet infrastructure. This aids in improving networking research and allowing more deeper insights
into global internet measurement data.    

The three clusters were built in June 2023. Below are several pictures of the original 4-node cluster in Amsterdam (which was installed 5. June 2023). The pictures illustrate the
integration of the different components. The original 4-node cluster consisted of a single shelf with three 16-inch height racks, loaded with one On-Demand Server and three
Always-On Servers in a 36-inch deep Server Rack.

Each cluster has a separate file-server and gateway-server. These machines are regular tower-PCs. The file server contains a 960 GByte SATA3 DataCenter SSD. The gateways will be
used to interconnect the three clusters encrypted over the public internet, and provide a local edge PXE server to cache the latest Operating System from the control plane cluster, as
well as a network cache for larger model checkpoints downloaded from the control plane cluster, which are saved into non-persistent RAM disk storage on the nodes.

All connectors (of the motherboard and PCI-e cards) are located vertically inside the machine to allow horizontal mounting of GPUs. Each node has a CAT6 cable (capable of 10Gbps)
with a Gigabit Ethernet Card connected to it. Each node also has serial, keyboard, and VGA connectors exposed over the I/O panel, so it's easy to attach a keyboard, serial cable,
and monitor to any node.

Each cabinet contains a single unmanaged Gigabit Ethernet Switch (the blue box in the space below the first rack from the bottom). All maximum three nodes per cabinet are
connected through a single switch.

Each computing node is packaged as a unenclosed metallic framework, containing the motherboard (EATX form factor) with six PCI-e slots (one of which is used for Gigabit Ethernet
adaptors). The motherboard has 4x RJ45 LAN ports, Serial Ports, USB intferfaces, and 1 RJ45 Dedicated IPMI LAN port and a VGA connector for integrated graphics over the Aspeed AST2400 BMC.
Additionally, specially-dedicated large box fans are used for supplying cold air to the cooling channels of each machine in the cabinet.
Below is a picture of one node (without the Gigabit Network card):

Larger pictures are here:

Slides of a talk about Superclustr by Robin Röper can be found here.

Documentation for users of Superclustr can be found here.

Research projects using Superclustr

No research project published yet.

This page is maintained by Robin Röper.
Last modified: 10 July 2023.