We've looked at this problem from several sides now - to solve the "incast", to do aggregation for map/reduce or any federated learning platform, to aggregate acknowledgements for PGM.
When we say "in-network", we're talking about in-switch processing - borrowing resources from the poor P4 switch to store and process multiple application layer packets worth of stuff, so that only one actual packet (or at least a lot less) needs to be sent on its way.
So how about we compare with multicast (in network copying) and its (largely) replacement by CDNs/overlays.
Key point is branches in the net - this is where the "implosion" (for incast) or "explosion" (for multicast) happens:
So do we have a server nearby? Or can we just put one there (or just connect one there?
Answer is (for multicast yes:
netflix/pops in wide area - use distribution trree to all pops, and caches
So in data center:
use servers, not switches and build sink forest of trees
clos system, connect servers to local switch, top of rack, and spine switch/server...then for servers at some level, use a node at the next level up as aggregation server (note Clos even has redundancy so this will survive edge/switch outages)