channels: add `Distribute` pact #493

petrosagg · 2023-01-05T17:54:32Z

This PR adds the Distribute pact that aims to evenly distribute data among all workers by routing each container to a randomly selected worker.

Traditionally this "defensive distribution" could be implemented using an Exchange pact whose key function round-robined the records or randomly distributed them in some other way. While this works it has couple of downsides:

The key function is calculated once per record, instead of once per container
Each record must be copied out of the original container into a per-worker container depending on the key function

The Distribute pact streamlines this pattern by avoiding copying each record to a separate container and immediately pushing each container to a random worker.

Future work

A potential future improvement is to circulate in-band statistics over the channel about how many messages each workers has seen. This would allow each worker to estimate the current skew and only leap into action once things are bad enough.

This PR adds the `Distribute` pact that aims to evenly distribute data among all workers by routing each container to a randomly selected worker. Traditionally this "defensive distribution" could be implemented using an `Exchange` pact whose key function round-robined the records or randomly distributed them in some other way. While this works it has couple of downsides: * The key function is calculated once per record, instead of once per container * Each record must be copied out of the original container into a per-worker container depending on the key function The `Distribute` pact streamlines this pattern by avoiding copying each record to a separate container and immediately pushing each container to a random worker. == Future work == A potential future improvement is to circulate in-band statistics over the channel about how many messages each workers has seen. This would allow each worker to estimate the current skew and only leap into action once things are bad enough. Signed-off-by: Petros Angelatos <[email protected]>

antiguru

This looks useful in some situations. What about adding a closure that can decide on a target for each container instead of hardcoding a random function?

antiguru · 2023-01-05T18:41:01Z

timely/src/dataflow/channels/pact.rs

+    fn push(&mut self, message: &mut Option<BundleCore<T, C>>) {
+        let mut state: fnv::FnvHasher = Default::default();
+        std::time::Instant::now().hash(&mut state);
+        let worker_idx = (state.finish() as usize) % self.pushers.len();
+        self.pushers[worker_idx].push(message);


It seems we could make this more generic by accepting a closure that takes a message and returns a usize to route the whole container. Probably also rename the struct to ContainerExchange or something like this?

antiguru · 2023-01-05T18:41:10Z

timely/src/dataflow/channels/pact.rs

+impl<P>  DistributePusher<P> {
+    /// Allocates a new `DistributePusher` from a supplied set of pushers
+    pub fn new(pushers: Vec<P>) -> DistributePusher<P> {
+        DistributePusher {


Suggested change

DistributePusher {

Self {

antiguru · 2023-01-05T18:45:40Z

An alternative could be the implement PushPartitioned for a wrapper type, but then one has to deal with non-vector streams, which can be difficult.

petrosagg · 2023-01-05T19:01:22Z

This looks useful in some situations. What about adding a closure that can decide on a target for each container instead of hardcoding a random function?

The idea for this pact came from the office hours where we were discussing the fact that when an operator introduces skew to its output the skew persists along Pipeline edges and it would be nice to have an edge that behaves like Pipeline but can also detect when there is skew and do something about it.

For this reason I think we shouldn't expose any hook for users to plug any logic and instead state that the intent of this pact is to do this "smart routing". The current implementation is not particularly sophisticated, but the idea is that users use it as-is for the stated benefit and then we can improve the implementation independently.

I initially set out to integrate a work stealing queue by adding a new Allocate method but I think I want to explore the statistics based approach first that could potentially work with even Tcp channels.

The ideal case would be if this pact's performance is almost identical to Pipelines under balanced loads in which case we could replace the pact of all the timely/dd operators that don't care about distribution (map, filter, etc) from Pipeline to Distribute which would make them more robust to skewed workloads.

antiguru reviewed Jan 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channels: add `Distribute` pact #493

channels: add `Distribute` pact #493

petrosagg commented Jan 5, 2023

antiguru left a comment

antiguru Jan 5, 2023

antiguru Jan 5, 2023

antiguru commented Jan 5, 2023

petrosagg commented Jan 5, 2023

channels: add Distribute pact #493

Are you sure you want to change the base?

channels: add Distribute pact #493

Conversation

petrosagg commented Jan 5, 2023

Future work

antiguru left a comment

Choose a reason for hiding this comment

antiguru Jan 5, 2023

Choose a reason for hiding this comment

antiguru Jan 5, 2023

Choose a reason for hiding this comment

antiguru commented Jan 5, 2023

petrosagg commented Jan 5, 2023

channels: add `Distribute` pact #493

channels: add `Distribute` pact #493