Defer pandas import on worker in P2P shuffle #5695

gjoseph92 · 2022-01-24T22:02:03Z

In #5688, we discovered that #5520 was increasing worker unmanaged memory usage by ~10MiB at startup (on my mac). I suspect that this came from importing pandas.

The shuffle extension already worked if pandas wasn't available: it just wouldn't install itself on the worker. However, to check if pandas was available, it was importing pandas—so if pandas was available, every worker would have to spend the time and memory to import pandas at startup, even if it wasn't used at all.

With this PR, all use of pandas is deferred. There's also now an explicit test that importing the shuffle does not import pandas.

Tests added / passed
Passes pre-commit run --all-files

In dask#5688, we discovered that dask#5520 was increasing worker unmanaged memory usage by ~10MiB at startup (on my mac). I suspect that this came from importing pandas. The shuffle extension already worked if pandas wasn't available: it just wouldn't install itself on the worker. However, to check if pandas was available, it was importing pandas—so if pandas _was_ available, every worker would have to spend the time and memory to import it at startup, even if it wasn't used at all. With this PR, all use of pandas is deferred. There's also now an explicit test that importing the shuffle does not import pandas.

gjoseph92 · 2022-01-24T22:08:31Z

distributed/shuffle/shuffle.py

    column: str,
    npartitions: int | None = None,
 ):
+    from dask.dataframe import DataFrame


I don't love how dask.dataframe.DataFrame is also imported if TYPE_CHECKING. It means I could remove this line here, and the code would still pass mypy and flake8, even though it would fail at runtime. Tests would still fail, but I could see making a similar mistake in the future in some place that didn't have test coverage, and unwittingly introducing a runtime error.

However, because I want to use DataFrame as a type annotation, it does need to be imported for type-checkers.

I don't really think there's a way around this. Just curious if anyone had ideas, particularly @crusaderky or @ian-r-rose

Yep, this issue is well-known upstream: PyCQA/pyflakes#530
The solution is remembering that linters are a nice and fast way to detect issues but cannot replace unit tests :)

gjoseph92 · 2022-01-24T22:16:05Z

distributed/shuffle/tests/test_no_pandas_import.py

+import subprocess
+import sys
+
+import_check_code = """


I tried a couple other options to avoid this kind of nasty "code in a string" pattern:

Mucking with sys.modules (setting pandas to None, removing dask.dataframe and distributed.shuffle, then re-importing distributed.shuffle). This worked, but felt brittle, because you basically had to list all the places pandas might be transitively imported from, and remove them from sys.modules.

Using multiprocessing in spawn mode to start a fresh interpreter and run this code. Because spawn mode works by sending pickled function, the the function must be importable. And un-picking the function means importing distributed.shuffle.tests.test_no_pandas_import.func_name_whatever. Therefore, distributed.shuffle gets imported before your function can even execute! So though you can assert that pandas hasn't been imported, if it has been imported, you can't tell from where, which makes the test very annoying to debug.

In the end, the "code in a string" was both much easier to read and more robust than anything else I could come up with.

(Also, if distributed had a mindeps test, we might not need this at all. But it's nice to have anyway for local development.)

crusaderky · 2022-01-25T11:55:07Z

distributed/shuffle/tests/test_no_pandas_import.py

+import subprocess
+import sys
+
+import_check_code = """
+import sys
+
+current_pandas_modules = [m for m in sys.modules if m.startswith("pandas")]
+assert (
+    not current_pandas_modules
+), "pandas is already imported at startup: " + "\\n".join(current_pandas_modules)
+
+# Make pandas un-importable
+sys.modules["pandas"] = None
+# "if the value is None, then a ModuleNotFoundError is raised"
+# https://docs.python.org/3.6/reference/import.html#the-module-cache
+
+import distributed.shuffle  # noqa: F401
+"""
+
+
+def test_import_no_pandas():
+    p = subprocess.run(
+        [sys.executable],
+        input=import_check_code,
+        text=True,
+        timeout=10,
+    )
+    assert (
+        p.returncode == 0
+    ), "Importing the shuffle extension without pandas failed. See logs for details."


Suggested change

import subprocess

import sys

import_check_code = """

import sys

current_pandas_modules = [m for m in sys.modules if m.startswith("pandas")]

assert (

not current_pandas_modules

), "pandas is already imported at startup: " + "\\n".join(current_pandas_modules)

# Make pandas un-importable

sys.modules["pandas"] = None

# "if the value is None, then a ModuleNotFoundError is raised"

# https://docs.python.org/3.6/reference/import.html#the-module-cache

import distributed.shuffle # noqa: F401

"""

def test_import_no_pandas():

p = subprocess.run(

[sys.executable],

input=import_check_code,

text=True,

timeout=10,

)

assert (

p.returncode == 0

), "Importing the shuffle extension without pandas failed. See logs for details."

import sys

from distributed.nanny import Nanny

from distributed.utils_test import gen_cluster, inc

@gen_cluster(

client=True,

Worker=Nanny,

nthreads=[("", 1)],

config={"distributed.worker.multiprocessing-method": "spawn"},

)

async def test_import_no_pandas(c, s, a, b):

"""Test that pandas is not imported when a worker is spawned,

unless the shuffle extension is actually used

"""

def assert_no_pandas(dask_worker):

assert dask_worker.extensions["shuffle"]

assert "pandas" not in sys.modules

assert await c.submit(inc, 1) == 2

await c.run(assert_no_pandas)

This is a neat way of doing it. The one thing is that it won't point you to where pandas is being imported, just that it is (somewhere). The reason I went for running the script in a subprocess was that you'd get an import error (and traceback to) exactly where the import was happening, which made fixing the failure much easier.

If you'd prefer giving up the debuggability for this more conventional pattern, I'm happy to make the change.

I think that git bisect is enough to offer debuggability?

Even still, you have to figure out what line of code is causing the import (and know how to use git bisect). Getting the traceback to the exact line of code just feels more friendly to other contributors.

I counter-argue that your version would not notice it if there was an import pandas inside ShuffleWorkerExtension.__init__. IMHO tests should be as black-box as possible.

That's a good point.

Unfortunately, the gen_cluster approach doesn't work. Pandas is already imported even if the shuffle extension is not imported on the worker. A good example of how when there isn't a traceback to see where the import is happening, debugging it is tough!

My guess is that the problem is that nannies pre-import pandas (and other modules) automatically into the worker process:
The global mp_context used by Nannies to create worker processes is initialized here at import time:

distributed/distributed/utils.py

Lines 69 to 93 in 338d0be

def _initialize_mp_context():

if WINDOWS or PYPY:

return multiprocessing

else:

method = dask.config.get("distributed.worker.multiprocessing-method")

ctx = multiprocessing.get_context(method)

# Makes the test suite much faster

preload = ["distributed"]

if "pkg_resources" in sys.modules:

preload.append("pkg_resources")

from .versions import optional_packages, required_packages

for pkg, _ in required_packages + optional_packages:

try:

importlib.import_module(pkg)

except ImportError:

pass

else:

preload.append(pkg)

ctx.set_forkserver_preload(preload)

return ctx

mp_context = _initialize_mp_context()

In particular, notice how all required and optional packages that are importable are added to set_forkserver_preload.

pandas is in the list of optional packages:

distributed/distributed/versions.py

Lines 24 to 29 in 338d0be

optional_packages = [

("numpy", lambda p: p.__version__),

("pandas", lambda p: p.__version__),

("lz4", lambda p: p.__version__),

("blosc", lambda p: p.__version__),

]

(Sidenote: we're using spawn by default; I confirmed that's what's used when I run the test. Reading the multiprocessing code, I don't understand why set_forkserver_preload is affecting non-forkserver contexts. But if I comment out the pandas line from optional_packages, the test passes, so I'm not going to look into it much further.)

For posterity: @crusaderky and I discovered that set_forkserver_preload has no effect when using spawn, as expected. Instead, the reason pandas was always imported on Nanny workers (but not imported if it was commented out from the list in versions.py) is that the get_versions check on client connect imports all those packages listed in order to get their versions. This could be avoided.

distributed/shuffle/tests/test_no_pandas_import.py

xfail'd until dask#5724

distributed/tests/test_nanny.py

crusaderky · 2022-01-28T11:09:44Z

...Great, now test_memory is heavily flaky -__-

distributed/tests/test_nanny.py

crusaderky · 2022-01-28T17:11:24Z

Merged CI test of #5695 + #5724 running on crusaderky#2 and things don't look very good so far. I can see much more flakiness than main and much slower end-to-end test runtime.

crusaderky · 2022-01-31T10:01:16Z

It seems like CI is back on track. I reran all unit tests.
test_memory is heavily flaky; I can also see a failure in distributed/tests/test_active_memory_manager.py::test_drop_with_paused_workers_with_running_tasks_5 which is unrelated. I'm working on them.

crusaderky · 2022-01-31T14:58:31Z

Blocked by

AMM: speed up and stabilize test_memory #5737

crusaderky · 2022-02-01T11:13:40Z

@gjoseph92 please merge from main; you shoud get all green now (well, no more red than usual at least 😆 )

…as-import

gjoseph92 · 2022-02-01T16:36:28Z

Thanks @crusaderky for all your work smoothing out the flakes!

gjoseph92 · 2022-02-01T20:58:24Z

FAILED distributed/tests/test_worker.py::test_worker_reconnects_mid_compute
FAILED distributed/tests/test_worker.py::test_worker_reconnects_mid_compute_multiple_states_on_scheduler

Just on one windows box! Looking very good.

github-actions · 2022-02-01T23:11:30Z

Unit Test Results

      18 files +      7       18 suites +7 9h 37m 26s ⏱️ + 3h 59m 34s
  2 564 tests +      5   2 479 ✔️ +    11     80 💤 -     8 5 ❌ +2
22 963 runs +9 011 21 449 ✔️ +8 255 1 509 💤 +754 5 ❌ +2

For more details on these failures, see this check.

Results for commit a591f9a. ± Comparison against base commit 2b83fac.

crusaderky · 2022-02-02T12:18:17Z

Thank you!

rjzamora · 2022-02-10T19:17:01Z

@crusaderky @gjoseph92 - Just a note that this PR seems to have broken Dask's test_scheduler_highlevel_graph_unpack_import tests.

It doesn't look like this PR does anything dangerous, since pandas does not need to be available. However, the Dask tests currently assume that the scheduler will not import pandas/numpy. cc @jrbourbeau - In case you have thoughts on test_scheduler_highlevel_graph_unpack_import?

gjoseph92 requested a review from crusaderky January 24, 2022 22:02

gjoseph92 commented Jan 24, 2022

View reviewed changes

gjoseph92 self-assigned this Jan 24, 2022

gjoseph92 mentioned this pull request Jan 24, 2022

Fix flaky test_rebalance #5691

Closed

3 tasks

crusaderky reviewed Jan 25, 2022

View reviewed changes

jrbourbeau reviewed Jan 25, 2022

View reviewed changes

distributed/shuffle/tests/test_no_pandas_import.py Outdated Show resolved Hide resolved

jrbourbeau mentioned this pull request Jan 26, 2022

Release 2022.01.1 dask/community#216

Closed

3 tasks

gjoseph92 mentioned this pull request Jan 27, 2022

Don't import packages just for version compatibility check #5723

Open

Test unnecessary imports in nanny tests

ce8e9d2

xfail'd until dask#5724

crusaderky mentioned this pull request Jan 28, 2022

get_versions without importing packages #5724

Open

3 tasks

crusaderky reviewed Jan 28, 2022

View reviewed changes

distributed/tests/test_nanny.py Outdated Show resolved Hide resolved

crusaderky reviewed Jan 28, 2022

View reviewed changes

distributed/tests/test_nanny.py Outdated Show resolved Hide resolved

crusaderky added 2 commits January 28, 2022 11:00

Update distributed/tests/test_nanny.py

6a6fc64

Update distributed/tests/test_nanny.py

31c33c6

crusaderky mentioned this pull request Jan 28, 2022

numpy is imported unconditionally #5729

Closed

crusaderky reviewed Jan 28, 2022

View reviewed changes

distributed/tests/test_nanny.py Outdated Show resolved Hide resolved

Update distributed/tests/test_nanny.py

9d0cd65

crusaderky closed this Jan 30, 2022

crusaderky reopened this Jan 30, 2022

crusaderky mentioned this pull request Jan 30, 2022

[DNM] Do not import pandas crusaderky/distributed#2

Closed

crusaderky mentioned this pull request Jan 31, 2022

AMM: speed up and stabilize test_memory #5737

Merged

Merge remote-tracking branch 'upstream/main' into p2p-shuffle/no-pand…

a591f9a

…as-import

gjoseph92 force-pushed the p2p-shuffle/no-pandas-import branch from a83b5fb to a591f9a Compare February 1, 2022 16:31

crusaderky merged commit 3902429 into dask:main Feb 2, 2022

This was referenced Feb 10, 2022

Revise column-projection optimization to capture common dask-sql patterns dask/dask#8692

Merged

test_scheduler_highlevel_graph_unpack_import flaky dask/dask#8480

Open

	def _initialize_mp_context():
	if WINDOWS or PYPY:
	return multiprocessing
	else:
	method = dask.config.get("distributed.worker.multiprocessing-method")
	ctx = multiprocessing.get_context(method)
	# Makes the test suite much faster
	preload = ["distributed"]
	if "pkg_resources" in sys.modules:
	preload.append("pkg_resources")

	from .versions import optional_packages, required_packages

	for pkg, _ in required_packages + optional_packages:
	try:
	importlib.import_module(pkg)
	except ImportError:
	pass
	else:
	preload.append(pkg)
	ctx.set_forkserver_preload(preload)
	return ctx


	mp_context = _initialize_mp_context()

	optional_packages = [
	("numpy", lambda p: p.__version__),
	("pandas", lambda p: p.__version__),
	("lz4", lambda p: p.__version__),
	("blosc", lambda p: p.__version__),
	]

Uh oh!

Defer pandas import on worker in P2P shuffle #5695

Defer pandas import on worker in P2P shuffle #5695

Uh oh!

Conversation

gjoseph92 commented Jan 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crusaderky commented Jan 28, 2022

Uh oh!

Uh oh!

crusaderky commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Jan 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented Jan 31, 2022

Uh oh!

crusaderky commented Feb 1, 2022

Uh oh!

gjoseph92 commented Feb 1, 2022

Uh oh!

gjoseph92 commented Feb 1, 2022

Uh oh!

github-actions bot commented Feb 1, 2022

Unit Test Results

Uh oh!

crusaderky commented Feb 2, 2022

Uh oh!

rjzamora commented Feb 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crusaderky Jan 25, 2022 •

edited

Loading

crusaderky Jan 25, 2022 •

edited

Loading

crusaderky commented Jan 28, 2022 •

edited

Loading

crusaderky commented Jan 31, 2022 •

edited

Loading