Response Module
Hackathon


Multi-node
GPU Challenge

Machine Learning Hackathon 

Who should join?

If you have some experience in deep learning frameworks (for instance PyTorch or TensorFlow) you should join our challenge.

What’s the goal?

The aim of the challenge is to code a multi node/multi GPU template, preferably in one of the leading frameworks (PyTorch or TensorFlow), alternatively in an established framework (for instance Caffe). 

Who wins?

The multi node/multi GPU template that combines linear scalability (priority 1) and execution time (priority 2) in the most efficient way. Datasets can be chosen freely (Pascal VOC, MS COCO etc.), whereas default datasets would be considered more valuable due to better comparability. 

When do we start?

Registrations are open until 16th April. The Hackathon kick-off call is scheduled for 19th April. The challenge will last about 6 weeks, until 31st May. The winner will be informed shortly after the end of the challenge.

Why join?

Besides building fun multi-node/multi-GPU templates, a major benefit to participating in the Hackathon is the winners’ prize: a FUJITSU Workstation CELSIUS M7010X with NVIDIA® GeForce RTXTM 2080 Super 8GB and a powerful 12-core Intel® CPU (4.80 GHz), which is ideal for high-end CAD, M&E, rendering and simulation applications.

Top 30 patent applicants by number of patent families

Hackathon Details

The goal of this hackathon is to create a template based on TensorFlow or PyTorch that enables multi-node/multi-GPU training.

Template means that the focus is on the "how to train" = Multi Node/Multi GPU to ensure a highly efficient scale-out scalability. The template should form a kind of "framework", where the user of the template fully concentrates on the implementation of his data and does not have to think about how to train optimally.

Background: In many projects the aspect of scalability in the sense of a real scale-out is neglected. This is aggravated by the fact that the scale-out capabilities of available frameworks are either not optimal in this respect or are still relatively new, so that solution approaches often continue to concentrate on the topic of scale-up. Also developer systems which are frequently Single Node/Multiple (or also single) GPU's based, complicate the software development since the adjustment on Multi Nodes/Multi GPUs with additional expenditure and costs are connected.

Even if Multi Node training often does not seem to be necessary, another point of view helps here: A service provider, for example, usually has an interest in processing AI workloads in an extremely short time in order to be able to serve many AI requests with the available infrastructure and the required SLAs.

Therefore, it is much less about "what is trained". That is, the selection of data and desired results is secondary, what matters is scaling as linearly as possible on multiple nodes.

The result is then tested on the data-driven transformation platform, on which besides a DGX also 2 nodes with 2 x V100 each are used. The DGX can test scale-up capabilities (single node, multiple GPU's) while the 2 nodes validate the actual approach of the hackathon.

On request, the system can be made available at specific timeslots. Ideally for final test and evaluation.

Registrations for this challenge are closed now.