Skip to content

Conversation

@gjoseph92
Copy link
Collaborator

@gjoseph92 gjoseph92 commented Jan 27, 2022

This makes the version-compatibility check no longer actually import the packages being checked.

It also removes support for custom version-getting functions, and only uses importlib.metadata.version (EDIT: or pkg_resources on 3.7) to look up the version number registered for a package.

I assumed the purpose of custom version-getting functions was to handle discrepancies in how modules expose their version as an attribute, like how most use .__version__, but tornado uses .version, msgpack uses .version but it's a tuple of ints instead of a string, etc. That becomes irrelevant with importlib.metadata.version, which will always give you the version number of any reasonably-formed package as a string.

You could use this (modname, version_getter_func) form as an argument to Client.get_versions, so not supporting the tuple form in theory is a breaking change to a public API. However, that form isn't documented, so I don't know if that counts.

If we want to maintain support for those custom functions, I will, it just makes the code messier.

@gjoseph92 gjoseph92 requested a review from crusaderky January 27, 2022 23:58
@gjoseph92 gjoseph92 self-assigned this Jan 27, 2022
@gjoseph92
Copy link
Collaborator Author

gjoseph92 commented Jan 28, 2022

importlib.metadata is 3.8 only 🤦‍♂️

Do we want to use the backport? https://github.com/python/importlib_metadata

@crusaderky
Copy link
Collaborator

importlib.metadata is 3.8 only man_facepalming

Do we want to use the backport? https://github.com/python/importlib_metadata

pkg_resources.get_distribution("numpy").version

@gjoseph92
Copy link
Collaborator Author

The Python docs seem to suggest importlib.metadata over pkg_resources.

Along with importlib.resources in Python 3.7 and newer (backported as importlib_resources for older versions of Python), this can eliminate the need to use the older and less efficient pkg_resources package.
https://docs.python.org/3/library/importlib.metadata.html

The pkg_resources docs say the same:

Use of pkg_resources is discouraged in favor of importlib.resources, importlib.metadata, and their backports (resources, metadata). Please consider using those libraries instead of pkg_resources.
https://setuptools.pypa.io/en/latest/pkg_resources.html

I guess instead of requiring the backport, I could use pkg_resources on 3.7 and importlib.metadata when it's available? Seems like a fine compromise.

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Jan 28, 2022
@gjoseph92 gjoseph92 requested a review from crusaderky January 28, 2022 01:39
@gjoseph92
Copy link
Collaborator Author

The number of flaky-seeming test failures here is concerning me. Could it be that not importing pandas on the workers is allowing them to start faster, and making race conditions even more likely to happen? Or something like that?

@crusaderky
Copy link
Collaborator

I don't get it - I thought that in order to not import pandas we need both this and #5695?
Rerunning tests to see if it's a temporary VM issue.

@crusaderky crusaderky closed this Jan 28, 2022
@crusaderky crusaderky reopened this Jan 28, 2022
@crusaderky
Copy link
Collaborator

Looking at the runtimes of the heavily flaky test suites, I'm seeing 1h+ runtime instead of the usual 30min. Which is something I've seen in the past on main. So I suspect it may be github's fault.

@gjoseph92
Copy link
Collaborator Author

I don't get it - I thought that in order to not import pandas we need both this and #5695?

Totally correct—just forgot, sorry.

@crusaderky
Copy link
Collaborator

crusaderky commented Jan 28, 2022

Merged CI test of #5695 + #5724 running on crusaderky#2 and things don't look very good so far. I can see much more flakiness than main and much slower end-to-end test runtime.

@crusaderky
Copy link
Collaborator

CI is back on track; I've rerun all tests.
I can see a gazillion of failed tests.
The good news is, several of these are known flaky tests which seem to have become a lot more flaky in this PR.
This may be a good workbench to finally iron them out.
CC @fjetter

Failed tests (with number of failures):

      7 distributed/deploy/tests/test_adaptive.py::test_adapt_quickly
      6 distributed/deploy/tests/test_local.py::test_adapt_then_manual
      5 distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler
      5 distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
      5 distributed/tests/test_client.py::test_dump_cluster_state
      3 distributed/tests/test_worker.py::test_scheduler_delay
      3 distributed/tests/test_worker.py::test_lifetime
      3 distributed/tests/test_worker.py::test_gather_many_small
      3 distributed/tests/test_worker.py::test_avoid_oversubscription
      3 distributed/tests/test_steal.py::test_steal_twice
      3 distributed/tests/test_scheduler.py::test_worker_breaks_and_returns
      3 distributed/tests/test_scheduler.py::test_balance_many_workers_2
      3 distributed/tests/test_scheduler.py::test_balance_many_workers
      3 distributed/tests/test_client.py::test_run_rpc
      3 distributed/tests/test_client.py::test_get_versions_rpc
      3 distributed/deploy/tests/test_adaptive.py::test_target_duration
      3 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
      2 distributed/tests/test_scheduler.py::test_profile_metadata
      2 distributed/tests/test_scheduler.py::test_idle_timeout 
      2 distributed/tests/test_multi_locks.py::test_timeout_wake_waiter
      1 distributed/tests/test_worker.py::test_statistical_profiling_cycle
      1 distributed/tests/test_worker.py::test_remove_replicas_while_computing
      1 distributed/tests/test_utils_test.py::test_assert_worker_story_malformed_story[story4]
      1 distributed/tests/test_steal.py::test_worksteal_many_thieves
      1 distributed/tests/test_steal.py::test_steal_when_more_tasks
      1 distributed/tests/test_steal.py::test_steal_more_attractive_tasks
      1 distributed/tests/test_steal.py::test_dont_steal_few_saturated_tasks_many_workers
      1 distributed/tests/test_steal.py::test_dont_steal_fast_tasks_compute_time
      1 distributed/tests/test_scheduler.py::test_worker_reconnect_task_memory 
      1 distributed/tests/test_scheduler.py::test_worker_heartbeat_after_cancel
      1 distributed/tests/test_scheduler.py::test_profile_metadata_timeout
      1 distributed/tests/test_scheduler.py::test_profile_metadata_keys
      1 distributed/tests/test_failed_workers.py::test_restart_fast

@crusaderky
Copy link
Collaborator

@gjoseph92 please merge from main

@crusaderky
Copy link
Collaborator

please

  • un-xfail pandas on test_no_unnecessary_imports_on_worker
  • merge from main to remove python 3.7
  • accept suggestions and lint

@crusaderky
Copy link
Collaborator

Latest run vs main: https://github.com/crusaderky/distributed/actions/runs/1875755877

Failures:

distributed/dashboard/tests/test_scheduler_bokeh.py::test_compute_per_key
distributed/deploy/tests/test_adaptive.py::??
distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers 
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly 
distributed/deploy/tests/test_adaptive.py::test_adapt_quickly 
distributed/deploy/tests/test_adaptive.py::test_target_duration
distributed/deploy/tests/test_adaptive.py::test_target_duration
distributed/deploy/tests/test_local.py::test_adapt_then_manual
distributed/deploy/tests/test_local.py::test_adapt_then_manual
distributed/deploy/tests/test_local.py::test_adapt_then_manual
distributed/deploy/tests/test_local.py::test_adapt_then_manual
distributed/deploy/tests/test_local.py::test_asynchronous_property
distributed/deploy/tests/test_local.py::test_asynchronous_property
distributed/deploy/tests/test_local.py::test_asynchronous_property
distributed/deploy/tests/test_local.py::test_asynchronous_property
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_dump_cluster_state_error
distributed/tests/test_client.py::test_get_versions_rpc_error
distributed/tests/test_client.py::test_get_versions_rpc_error
distributed/tests/test_client.py::test_get_versions_rpc_error
distributed/tests/test_client.py::test_get_versions_rpc_error
distributed/tests/test_client.py::test_get_versions_rpc_error
distributed/tests/test_client.py::test_reconnect_timeout
distributed/tests/test_client.py::test_reconnect_timeout
distributed/tests/test_client.py::test_run_rpc_error
distributed/tests/test_client.py::test_run_rpc_error
distributed/tests/test_client.py::test_run_rpc_error
distributed/tests/test_client.py::test_run_rpc_error
distributed/tests/test_failed_workers.py::test_failing_worker_with_additional_replicas_on_cluster
distributed/tests/test_failed_workers.py::test_failing_worker_with_additional_replicas_on_cluster
distributed/tests/test_multi_locks.py::test_timeout_wake_waiter
distributed/tests/test_multi_locks.py::test_timeout_wake_waiter
distributed/tests/test_scheduler.py::test_balance_many_workers
distributed/tests/test_scheduler.py::test_balance_many_workers
distributed/tests/test_scheduler.py::test_balance_many_workers
distributed/tests/test_scheduler.py::test_balance_many_workers
distributed/tests/test_scheduler.py::test_balance_many_workers_2
distributed/tests/test_scheduler.py::test_balance_many_workers_2
distributed/tests/test_scheduler.py::test_balance_many_workers_2
distributed/tests/test_scheduler.py::test_balance_many_workers_2
distributed/tests/test_scheduler.py::test_computations
distributed/tests/test_scheduler.py::test_profile_metadata
distributed/tests/test_scheduler.py::test_profile_metadata
distributed/tests/test_scheduler.py::test_profile_metadata_keys
distributed/tests/test_scheduler.py::test_profile_metadata_keys
distributed/tests/test_scheduler.py::test_profile_metadata_timeout
distributed/tests/test_scheduler.py::test_worker_breaks_and_returns
distributed/tests/test_scheduler.py::test_worker_breaks_and_returns
distributed/tests/test_scheduler.py::test_worker_breaks_and_returns
distributed/tests/test_scheduler.py::test_worker_breaks_and_returns
distributed/tests/test_scheduler.py::test_worker_heartbeat_after_cancel
distributed/tests/test_scheduler.py::test_worker_heartbeat_after_cancel
distributed/tests/test_steal.py::test_dont_steal_fast_tasks_compute_time
distributed/tests/test_steal.py::test_dont_steal_fast_tasks_compute_time
distributed/tests/test_steal.py::test_dont_steal_few_saturated_tasks_many_workers
distributed/tests/test_steal.py::test_dont_steal_few_saturated_tasks_many_workers
distributed/tests/test_steal.py::test_steal_more_attractive_tasks
distributed/tests/test_steal.py::test_steal_more_attractive_tasks
distributed/tests/test_steal.py::test_steal_twice
distributed/tests/test_steal.py::test_steal_twice
distributed/tests/test_steal.py::test_steal_twice
distributed/tests/test_steal.py::test_steal_twice
distributed/tests/test_steal.py::test_steal_when_more_tasks
distributed/tests/test_steal.py::test_steal_when_more_tasks
distributed/tests/test_steal.py::test_worksteal_many_thieves
distributed/tests/test_steal.py::test_worksteal_many_thieves
distributed/tests/test_utils_test.py::test_assert_worker_story_malformed_story[story4]
distributed/tests/test_utils_test.py::test_assert_worker_story_malformed_story[story4]
distributed/tests/test_utils_test.py::test_dump_cluster_unresponsive_remote_worker
distributed/tests/test_utils_test.py::test_dump_cluster_unresponsive_remote_worker
distributed/tests/test_variable.py::test_race
distributed/tests/test_variable.py::test_race
distributed/tests/test_worker.py::test_avoid_oversubscription
distributed/tests/test_worker.py::test_avoid_oversubscription
distributed/tests/test_worker.py::test_avoid_oversubscription
distributed/tests/test_worker.py::test_avoid_oversubscription
distributed/tests/test_worker.py::test_gather_many_small
distributed/tests/test_worker.py::test_gather_many_small
distributed/tests/test_worker.py::test_gather_many_small
distributed/tests/test_worker.py::test_gather_many_small
distributed/tests/test_worker.py::test_lifetime
distributed/tests/test_worker.py::test_lifetime
distributed/tests/test_worker.py::test_lifetime
distributed/tests/test_worker.py::test_lifetime
distributed/tests/test_worker.py::test_pause_executor
distributed/tests/test_worker.py::test_scheduler_delay
distributed/tests/test_worker.py::test_scheduler_delay
distributed/tests/test_worker.py::test_scheduler_delay
distributed/tests/test_worker.py::test_scheduler_delay
distributed/tests/test_worker.py::test_scheduler_delay
distributed/tests/test_worker.py::test_statistical_profiling_cycle
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler
distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler

@gjoseph92
Copy link
Collaborator Author

@crusaderky comments addressed

crusaderky and others added 2 commits February 23, 2022 13:29
Co-authored-by: Thomas Grainger <[email protected]>
Co-authored-by: Thomas Grainger <[email protected]>
@gjoseph92 gjoseph92 assigned graingert and unassigned gjoseph92 Mar 1, 2022
@crusaderky crusaderky assigned gjoseph92 and unassigned graingert Mar 1, 2022
@graingert
Copy link
Member

So it turns out importlib.metadata.version is very slow, adding a functools.cache around it appears to help https://github.com/graingert/distributed/actions/runs/1937969735

main

In [1]: from distributed.versions import get_versions

In [2]: %timeit get_versions()
148 µs ± 567 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

importlib

In [1]: from distributed.versions import get_versions

In [2]: %timeit get_versions()
12.3 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mrocklin
Copy link
Member

I've handled merge conflicts and pushed to your branch. I hope that you don't mind @gjoseph92

@mrocklin
Copy link
Member

Looking at the runtimes of the heavily flaky test suites, I'm seeing 1h+ runtime instead of the usual 30min. Which is something I've seen in the past on main. So I suspect it may be github's fault.

This is still happening. I propose that instead maybe this change adds a non-trivial and somewhat stochastic cost to startup time, which has genuinely increased the duration of the test suite.

The good news is, several of these are known flaky tests which seem to have become a lot more flaky in this PR.
This may be a good workbench to finally iron them out.

Yeah, I like this way of thinking. Rather than try to hide the problem, let's expose it more clearly.

However, looking through the test failures that I'm seeing here (just anecdotally skimming) I'm seeing that half of them are things that I often see in practice, and half of them are, I think, unrelated.

At first I was very excited about this PR because it shined a bright light on all of the bad tests. However now I'm concerned that it might not be doing a good job of prioritizing them for us. I think that I still value just going through existing PRs and picking off failing tests from them and seeing if we can solve them. That's less clear signal, but it's also less biased signal.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       14 files   -        1         14 suites   - 1   15h 6m 29s ⏱️ + 8h 59m 24s
  2 910 tests +       1    2 794 ✔️  -      29    80 💤 ±  0    35 +  30  1 🔥 ±0 
20 024 runs   - 1 517  18 976 ✔️  - 1 623  919 💤  - 16  127 +121  2 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit c72f970. ± Comparison against base commit 7b24c94.

@crusaderky
Copy link
Collaborator

This is still very red after merging from main.
The very good news is that I'm seeing many recurring offenders among the failed tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't import packages just for version compatibility check

5 participants