A distributed Supercomputing Project for Early Research
Superclustr is a wide-area distributed cluster independently designed by a group of Internet Researchers and Engineers. Superclustr will be used for research on parallel and distributed matrix multiplication with clusters located in:
Amsterdam, Netherlands (AMS1)
Stuttgart, Germany (STG1)
Superclustr consists of three clusters, located within the European Union. The first cluster (in Amsterdam) contains 3 Always-On Nodes and 1 On-Demand Node that can be booted up to scale with workload demand, the control plane cluster (in Falkenstein) contains no worker nodes, the other clusters have 1 On-Demand Node (5 nodes in total). The system was built in-house and runs a ram-only Rocky Linux Enterprise Linux operating system flavor with custom proprietary software accelerators.
The Falkenstein cluster, known as the control plane cluster, is designed to provide centralized control and management. As the heart of the Superclustr, it manages the orchestration of containers across the entire cluster network. With its 5 nodes, it handles all communications within the cluster system, monitoring the health of each node and automating the deployment, scaling, and management of applications. Additionally, leveraging a distributed Ceph Storage infrastructure, the cluster stores and serves all nessesary images and model checkpoints to all nodes over a highly redundant and high-speed/low-latency connection.
Superclustr is built using Kubeflow, a machine learning toolkit for Kubernetes, making it easy to deploy and manage scalable machine learning workflows across the three clusters. Kubeflow ensures that all components are smoothly integrated and running as expected, providing developers and data scientists with a user-friendly interface to deploy and monitor their machine learning models.
Each of these clusters is designed to interact seamlessly with each other, working together to handle diverse tasks within the system. Our system is able to withstand unexpected fatal errors during active computation and resume operation without data loss. This design ensures that the overall system remains operational even when individual nodes or clusters encounter issues, enhancing the robustness and reliability of the Superclustr.
The Superclustr system is fully independently funded by the technical staff and the participating researchers.
Access to the Superclustr system is exclusively reserved for researchers and individuals affiliated with either non-profit organizations or not-for-profit entities recognized within the European Union or unaffiliated individuals. Superclustr nodes are only operated by Superclustr Staff, third-parties do not have access to the system. Please note that the indicated rates are applicable only for these individuals and entities. For-profit entities are not eligible for access to the system.
Power Consumtion
While using Superclustr you only have to pay for the consumed energy in kWh. Superclustr is almost fully relying on renewable energy,therefore consumption is dynamically priced. With a flexible price limit you can cost-effectively run your operations on Superclustr.
One node in the cluster can consume under maximum load a maximum of 3.2kWh. During underutilization or undesirably high energy pricesSuperclustr's flexible price limit will ensure to pause your queued tasks ahead of time and put nodes into hibernation mode during idle.
1x 10Gbps Dual Port RJ45 Gigabit Ethernet Card with Intel® X550 Chipset
8x PCI-e 3.0 riser cables
2x 2000W (max. 1800W) 80PLUS Platinum Power Supply
Each cluster is interconnected via a secure Wireguard tunnel forwarded over the public internet. In addition, Gigabit Ethernet is used as OS network (file transport and high-speed interconnect between the nodes). All clusters are connected over a unicast control plane server in Falkenstein, so the entire system can be used as a 5-node wide-area distributed cluster.
Superclustr is a wide-area distributed cluster built by a group of internet researchers and engineers, who developed this system in their personal capacity outside of their official roles to perform research on parallel and distributed matrix multiplication as well as optimization.
In addition to the primary research, the system has been built to analyze large amounts of internet measurement data, which has been openly shared by the RIPE Atlas Project of the RIPE NCC. Superclustr uses this data to better understand the changing nature of internet infrastructure. This aids in improving networking research and allowing more deeper insights into global internet measurement data.
The three clusters were built in June 2023. Below are several pictures of the original 4-node cluster in Amsterdam (which was installed 5. June 2023). The pictures illustrate the integration of the different components. The original 4-node cluster consisted of a single shelf with three 16-inch height racks, loaded with one On-Demand Server and three Always-On Servers in a 36-inch deep Server Rack.
Each cluster has a separate file-server and gateway-server. These machines are regular tower-PCs. The file server contains a 960 GByte SATA3 DataCenter SSD. The gateways will be used to interconnect the three clusters encrypted over the public internet, and provide a local edge PXE server to cache the latest Operating System from the control plane cluster, as well as a network cache for larger model checkpoints downloaded from the control plane cluster, which are saved into non-persistent RAM disk storage on the nodes.
All connectors (of the motherboard and PCI-e cards) are located vertically inside the machine to allow horizontal mounting of GPUs. Each node has a CAT6 cable (capable of 10Gbps) with a Gigabit Ethernet Card connected to it. Each node also has serial, keyboard, and VGA connectors exposed over the I/O panel, so it's easy to attach a keyboard, serial cable, and monitor to any node.
Each cabinet contains a single unmanaged Gigabit Ethernet Switch (the blue box in the space below the first rack from the bottom). All maximum three nodes per cabinet are connected through a single switch.
Each computing node is packaged as a unenclosed metallic framework, containing the motherboard (EATX form factor) with six PCI-e slots (one of which is used for Gigabit Ethernet adaptors). The motherboard has 4x RJ45 LAN ports, Serial Ports, USB intferfaces, and 1 RJ45 Dedicated IPMI LAN port and a VGA connector for integrated graphics over the Aspeed AST2400 BMC. Additionally, specially-dedicated large box fans are used for supplying cold air to the cooling channels of each machine in the cabinet. Below is a picture of one node (without the Gigabit Network card):