Wednesday, May 10, 2023

decentralized learning

 Some early papers emerging are switching from federated learning (FL) with a dedicated model parameter aggregation server, to decentralized learning (DL) with P2P distribution of model parameters. Both suffer from n-1 message scale challenges, at each step of the machine learning training, model parameters are sent to update everyone else, then everyone moves to the next step of the iteration on their local data.

This has multiple downsides - 

1. from a performance perspective, the implosion of messages creates a severe bottleneck - this can be alleviated, through sparsification of the model parameter update cycle, either sending updates to a (say) random subset of other nodes (DL) or from a random set of nodes to the server (FL) or thinning the actual parameter list (both). A more complex distributed approach is to have a hierarchy of aggregation points - so in FL, this just means having clusters of multiple specialised servers, with subsets of FL clients, which then coordinate amongst themselves to forward partial aggregates to each other. This already smacks partly of DL, so why not just cluster a set of DL peers and elect a "supernode" (anyone remember origianl skype architecture, which did exactly this) based on CPU&bandwidth, and have it act as cluster head/aggregation point, and then coordinate with other supernodes (possibly recursively) - this could also work wide area, and take account of local versus long haul bandwidth.

2. distributed iterative synchronous algorithms have the straggler problem - there are lots of straggler elimination schemes - and the clustering in point 1 will also help with this, but even within a cluster we should have mechanisms to timeout, or even shoot-down nodes that are too slow - but note, also, we can simply run asynchronous - many ML schemes are stochastic, and will work well if run asynchronously. Or we could run a mix of synchronous for similar speed nodes, and asych updates for slower nodes (or nodes on slower links.

3. Nodes taking specialised roles may crash, so we need to run replicas. There are a ton of replication schemes (e.g. Raft) and again, we don't even need perfect consensus for training to converge, we just want to increase availability - indeed, one could build a virtual Clos topology out of supernodes for DL, and get redundant servers autoconfigured for free...


side note - supernodes also deal with NAT traversal, so nodes learning in homes (typically behind NAT and firewall) can still find other nodes. If we don't care for fixed supernodes, we can also use the BitTorrent like schemes for dynamically switching based on load, between a small set of supernodes, and, indeed, using KitTorrent's decentralised tracker (e.g. kademlia) for discovery too.

We have all the pieces. We need to build it, and they will come.

No comments:

Blog Archive

About Me

My photo
misery me, there is a floccipaucinihilipilification (*) of chronsynclastic infundibuli in these parts and I must therefore refer you to frank zappa instead, and go home