Skip to content

Commit 25f02ba

Browse files
committed
DOC: an editing pass at the prose
1 parent a3134f6 commit 25f02ba

File tree

1 file changed

+67
-47
lines changed

1 file changed

+67
-47
lines changed

Diff for: docs/source/design.rst

+67-47
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,10 @@
22
Design
33
========
44

5-
When a Matplotlib :obj:`~matplotlib.artist.Artist` object in rendered via the `~matplotlib.artist.Artist.draw` method the following
6-
steps happen (in spirit but maybe not exactly in code):
5+
6+
When a Matplotlib :obj:`~matplotlib.artist.Artist` object in rendered via the
7+
`~matplotlib.artist.Artist.draw` method the following steps happen (in spirit
8+
but maybe not exactly in code):
79

810
1. get the data
911
2. convert from unit-full to unit-less data
@@ -29,22 +31,23 @@ steps happen (in spirit but maybe not exactly in code):
2931
target.
3032

3133
However, this clear structure is frequently elided and obscured in the
32-
Matplotlib code base: Step 3 is only present for *x* and *y* like data (encoded
33-
in the `~matplotlib.transforms.TransformNode` objects) and color mapped data
34-
(implemented in the `.matplotlib.colors.ScalarMappable` family of classes); the
35-
application of Step 2 is inconsistent (both in actual application and when it
36-
is applied) between artists; each ``Artist`` stores it's data in its own way
37-
(typically as numpy arrays).
34+
Matplotlib code base: Step 3 is only present for *x* and *y* like data
35+
(encapsulated in the `~matplotlib.transforms.TransformNode` objects) and color
36+
mapped data (encapsulated in the `.matplotlib.colors.ScalarMappable` family of
37+
classes); the application of Step 2 is inconsistent (both in actual application
38+
and when it is applied) between artists; each ``Artist`` stores its data in
39+
its own way (typically as numpy arrays).
3840

3941
With this view, we can understand the `~matplotlib.artist.Artist.draw` methods
40-
to be very extensively `curried
41-
<https://en.wikipedia.org/wiki/Currying>`__ version of
42-
these function chains where the objects allow us to modify the arguments to the
43-
functions.
42+
to be very extensively `curried <https://en.wikipedia.org/wiki/Currying>`__
43+
version of these function chains where the objects allow us to modify the
44+
arguments to the functions and the re-run them.
4445

45-
The goal of this work is to bring this structure more the foreground in the internal of
46-
Matplotlib to make it easier to reason about, easier to extend, and easier to inject
47-
custom logic at each of the steps
46+
The goal of this work is to bring this structure more to the foreground in the
47+
internal structure of Matplotlib. By exposing this inherent structure
48+
uniformity in the architecture of Matplotlib the library will be easier to
49+
reason about and easier to extend by injecting custom logic at each of
50+
the steps
4851

4952
A paper with the formal mathematical description of these ideas is in
5053
preparation.
@@ -55,55 +58,66 @@ Data pipeline
5558
Get the data (Step 1)
5659
---------------------
5760

58-
Currently, almost all ``Artist`` class store the data associated with them as
59-
attributes on the instances as `numpy.array` objectss. On one hand, this can
60-
be very useful as historically data was frequently already in `numpy.array`
61-
objects and, if you know the right methods for *this* ``Artist`` you can access
62-
that state to update or query it. From a certain point of view, this is
63-
consistent with the scheme laid out above as ``self.x[:]`` is really
64-
``self.x.__getitem__(slice())`` which is (technically) a function call.
65-
66-
However, this has several drawbacks. In most cases the data attributes on an
67-
``Artist`` are closely linked -- the *x* and *y* on a
61+
In this context "data" is post any data-to-data transformations or
62+
aggregations. There is already extensive tooling and literature around that
63+
aspect. By completely decoupling the aggregations pipeline from the
64+
visualization process we are able to both simplify and generalize the problem.
65+
66+
Currently, almost all ``Artist`` classes store the data they are representing
67+
as attributes on the instances as realized `numpy.array` [#]_ objects. On one
68+
hand, this can be very useful as historically data was frequently already in
69+
`numpy.array` objects in the users' namespace. If you know the right methods
70+
for *this* ``Artist``, you can query or update the data without recreating the
71+
Artist. This is technically consistent with the scheme laid out above if we
72+
understand ``self.x[:]`` as ``self.x.__getitem__(slice())`` which is a function
73+
call.
74+
75+
However, this method of storing the data has several drawbacks. In most cases
76+
the data attributes on an ``Artist`` are closely linked -- the *x* and *y* on a
6877
`~matplotlib.lines.Line2D` must be the same length -- and by storing them
69-
separately it is possible that they will get out of sync in problematic ways.
70-
Further, because the data is stored as materialized ``numpy`` arrays, there we
71-
must decide before draw time what the correct sampling of the data is. While
72-
there are some projects like `grave <https://networkx.org/grave/>`__ that wrap
73-
richer objects or `mpl-modest-image
78+
separately it is possible for them to become inconsistent in ways that noticed
79+
until draw time [#]_. Further, because the data is stored as materialized
80+
``numpy`` arrays, we must decide before draw time what the correct sampling of
81+
the data is. While there are some projects like `grave <https://networkx.ors
82+
g/grave/>`__ that wrap richer objects or `mpl-modest-image
7483
<https://github.com/ChrisBeaumont/mpl-modest-image>`__, `datashader
7584
<https://datashader.org/getting_started/Interactivity.html#native-support-for-matplotlib>`__,
7685
and `mpl-scatter-density <https://github.com/astrofrog/mpl-scatter-density>`__
77-
that dynamically re-sample the data these are niche libraries.
86+
that dynamically re-sample the data, these libraries have had only limited
87+
adoption.
7888

79-
The first goal of this project is to bring support for draw-time resampleing to
80-
every Matplotlib ``Artist`` out of the box. The current approach is to move
81-
all of the data storage off of the ``Artist`` directly and into a (so-called)
82-
`~data_prototype.containers.DataContainer` instance. The primary method on these objects
83-
is the `~data_prototype.containers.DataContainer.query` method which has the signature ::
89+
The first goal of this project is to bring support for draw-time resampling to
90+
every Matplotlib ``Artist``. The proposed approach is to move the data storage
91+
of the ``Artist`` to be indirectly via a (so-called)
92+
`~data_prototype.containers.DataContainer` instance rather than directly. The
93+
primary method on these objects is the
94+
`~data_prototype.containers.DataContainer.query` method which has the signature
95+
::
8496

8597
def query(
8698
self,
87-
transform: _Transform,
99+
/,
100+
coord_transform: _MatplotlibTransform,
88101
size: Tuple[int, int],
89102
) -> Tuple[Dict[str, Any], Union[str, int]]:
90103

91104
The query is passed in:
92105

93-
- A transform from "Axes" to "data" (using Matplotlib's names for the `various
94-
coordinate systems
95-
<https://matplotlib.org/stable/tutorials/advanced/transforms_tutorial.html>`__
96-
- A notion of how big the axes is in "pixels" to provide guidance on what the correct number
97-
of samples to return is.
106+
- A *coord_transform* from "Axes fraction" to "data" (using Matplotlib's names
107+
for the `coordinate systems
108+
<https://matplotlib.org/stable/tutorials/advanced/transforms_tutorial.html>`__)
109+
- A notion of how big the axes is in "pixels" to provide guidance on what the
110+
correct number of samples to return is. For raster outputs this is literal
111+
pixels but for vector backends it will have to be an effective resolution.
98112

99113
It will return:
100114

101-
- A mapping of strings to things that is coercible (with the help of the
115+
- A mapping of strings to things that are coercible (with the help of the
102116
functions is steps 2 and 3) to a numpy array or types understandable by the
103117
backends.
104118
- A key that can be used for caching
105119

106-
This function will be called at draw time by the ``Aritist`` to get the data to
120+
This function will be called at draw time by the ``Artist`` to get the data to
107121
be drawn. In the simplest cases
108122
(e.g. `~data_prototype.containers.ArrayContainer` and
109123
`~data_prototype.containers.DataFrameContainer`) the ``query`` method ignores
@@ -124,15 +138,15 @@ visualization. This also opens up several interesting possibilities:
124138

125139
By accessing all of the data that is needed in draw in a single function call
126140
the ``DataContainer`` instances can ensure that the data is coherent and
127-
consistent. This is important for applications like steaming where different
141+
consistent. This is important for applications like streaming where different
128142
parts of the data may be arriving at different rates and it would thus be the
129143
``DataContainer``'s responsibility to settle any race conditions and always
130144
return aligned data to the ``Artist``.
131145

132146

133147
There is still some ambiguity as to what should be put in the data. For
134148
example with `~matplotlib.lines.Line2D` it is clear that the *x* and *y* data
135-
should be pulled from the ``DataConatiner``, but things like *color* and
149+
should be pulled from the ``DataContiner``, but things like *color* and
136150
*linewidth* are ambiguous. A later section will make the case that it should be
137151
possible, but maybe not required, that these values be accessible in the data
138152
context.
@@ -224,7 +238,7 @@ returns a cache key that it generates to the caller. The exact details of how
224238
to generate that key are left to the ``DataContainer`` implementation, but if
225239
the returned data changed, then the cache key must change. The cache key
226240
should be computed from a combination of the ``DataContainers`` internal state,
227-
the transform and size passed in.
241+
the coordinate transformation and size passed in.
228242

229243
The choice to return the data and cache key in one step, rather than be a two
230244
step process is drive by simplicity and because the cache key is computed
@@ -239,3 +253,9 @@ management at the ``Artist`` layer. We also need to determine how many cache
239253
layers to keep. Currently only the results of Step 3 are cached, but we may
240254
want to additionally cache intermediate results after Step 2. The caching from
241255
Step 1 is likely best left to the ``DataContainer`` instances.
256+
257+
.. [#] Not strictly true, in some cases we also store the values in the data in
258+
the container it came in with which may not be a `numpy.array`.
259+
.. [#] For example `matplotlib.lines.Line2D.set_xdata` and
260+
`matplotlib.lines.Line2D.set_ydata` do not check the lengths of the
261+
input at call time.

0 commit comments

Comments
 (0)