Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Cube merge/concatenate issues #5375

Open
acchamber opened this issue Jul 11, 2023 · 4 comments
Open

Various Cube merge/concatenate issues #5375

acchamber opened this issue Jul 11, 2023 · 4 comments
Assignees
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Feature: Merge/Concatenate

Comments

@acchamber
Copy link
Contributor

acchamber commented Jul 11, 2023

Background

Following the Dragon Taming session offline and #4446 #3234, I thought it best to try and capture what users want from improvements to merge/concatenate and what pain points there are, so we can establish what if any can be fixed.

So the user story is as follows - I've loaded in some data as iris cubes, and put them in my cubelist. I want to do this to perform some kind of analysis on the resulting cubes, producing plots or datasets, not cubes to further share on, so destruction or ignorance of metadata is fine. I've done equalise_attributes and unify_time_units. Yet Iris refuses to join my cubes together, when I know the data makes sense together. Why?

Exploration of Problem

Here's a few examples I've found from talking to users / yammer/AVD knowledgebase

  1. Dim Coord bounds not being identical (off by 0.00001 in on the edge for one example)
  2. One cube having cell methods and the other not, or mismatches in cell methods
  3. Auxiliary coordinate bounds not matching dim coords bounds
  4. Cube units not matching
  5. Cubes having different auxiliary or derived coordinates
  6. Dim Coords having the same values, but different Dtypes
  7. Cubes having overlapping times

I think you can put these in three categories - things that are bugs to fix, things that a "force" keyword or another util function could fix, and things that should be errors.

  • Bugs to fix

Dim coords having same values, but different Dtypes (see #5372 for it already being addressed for time, but I think a broader check may be useful)

  • Things that the user should be able to force

Remove all cell methods from the cubes before merge/concat
Remove all auxiliary coordinates from the cubes before merge/concat
Remove all derived coordinates from the cubes before merge/concat
Remove scalar coordinates before concatenate (not merge)
Remove bounds from dimensional coordinates and only compare points (maybe guess_bounds afterwards to resstablish?)
Guess a order for dimensional coordinates for cases where dim coords are exactly equal (maybe, might leave this one as error)

  • Things that should still error

Cubes having overlapping times.
Cubes having different units (or names, for either coords or the cube itself)

Proposed solution

In this process, the hardest to diagnose errors were often ones relating to Dim coords not matching. Adding a function to iris similar to @rcomer's coord_diffs (http://fcm9/projects/utils/browser/hadru-python/trunk/iris_wrappers.py?marks=19-52#L19) to allow easier or automatic comparison of coords when there is an error would also improve the user experience.

So, I propose two things. A coord_comparison function as a coord method that gives more detail to the user when a concatenate or merge fails due to dim coords not matching.

And a Force keyword for concatenate/merge cube that does some or all of the above listed fixes to allow users to automatically do the steps in iris they are already manually doing to their cubes. It should also automatically call equalise_attributes and unify_time_units

Thoughts?

@rcomer
Copy link
Member

rcomer commented Jul 11, 2023

For anyone outside the Met Office who is interested, I also have the coord_diffs function as a gist. I'm afraid it has no tests.

@bjlittle bjlittle added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Aug 2, 2023
@mo-matthewfry
Copy link

Great issue, thanks for articulating those problems.

To add another one to the list, when trying to concatenate ensemble runs the realization coordinate often overlaps, e.g., different MOGREPS-G runs all have the control member numbered 0.

Appreciate sometimes you’d want this information to be preserved but most of the time I don’t really care which member is which!

@scitools-ci scitools-ci bot removed this from 🚴 Peloton Dec 15, 2023
@trexfeathers trexfeathers moved this to 📌 Prioritised in 🐉 Dragon Taming Feb 23, 2024
@trexfeathers trexfeathers moved this from 📌 Prioritised to 🛡 Championed in 🐉 Dragon Taming Feb 23, 2024
@pp-mo
Copy link
Member

pp-mo commented Nov 14, 2024

Taming Design Notes

Similarity of proposed "operations"

All of the proposed "operations" follow a pattern -
(which matches iris.util.equalise_attributes or iris.util.unify_time_coords)

  • they take a list of cubes + return a list of corresponding modified cubes
  • equalise or remove various metadata elements to make all the cubes mergeable/concatenable
  • return the same cubes, if operation is not needed (or not possible)
  • won't make further changes, if re-applied
  • can work out whether changes are needed (e.g. remove elements only if they disagree)
  • equalise all the cubes they are given,
    intending that they will all (potentially) combine into a single result cube

Proposed new API form

We could write all these as additional iris.util routines, in the style of equalise_attributes/unify_time_coords
- but we think it is probably better to have a single function
-- say, "equalise_cubes" --
with keyword controls for the various different operations

  • NOTE: assumes ideal = all ops work independently, and independent of order of application

Possible additional functions

We can think of some extra possibly useful operations, too:

  • unify_time_units
  • equalise_attributes
    • (or possibly more lenient-style approach = where some cube don't have it, retain on all)
  • unify_units
    • seems possible where units are compatible (but must preclude degC/K conversion due to relative/absolute confusion)

Grouping of input cubes ?

Optionally, the top-level "equalise_cubes" function could identify the "groups" of cubes (which would naturally be chosen by merge/concat),
so as to allow the operation to be applied to (for instance) all cubes from a load.
HOWEVER there are remaining questions about this :

  • the individual operations assume that the input group is identified as a candidate for merge/concatenate
    • thus, they can't be used on cube groups which "should" be kept separate -- e.g. different phenomena
    • the grouping could be performed within the "combine_cubes", or the "equalise_cubes" operation
  • It's not (yet) clear whether all operations are completely independently applicable, i.e.
    • will a number of operation produce equivalent results, regardless of the order they are applied in
    • can they all be applied, just once, in advance of any/all merge/concatenate operations
  • most of the suggested operations seem to require no additional context or arguments,
    • so you either run that operation, or you don't
    • but SOME it seems might need additional per-operation difference
      = e.g. "Remove scalar coordinates" (effectively using "new_axis"??)
      • might need to specify which
      • and, as for the time-dependent-orography problem, possibly which other aux-coords need to map the new dim)
  • actual grouping is determined by merge/concat implementations
    -- but we suspect that the definition of this may be subtly different for merge + concatenate ???

Embedding in extended "combine_cubes" operation

We also anticipate that "equalise_cubes" can become an additional step in an extended "combine_cubes" operation
(and so will mean extending "CombineProfile" (=="LoadProfile" as-is) to accommodate)
( See pp-mo#82
But note : this also awaiting #6199 before it makes sense as PR to the main Iris fork
)
But this means:

  • we probably do it only once, before any merge/concatenate steps (and repeats)
  • all controls need to have identifiable independent keywords which can appear as keywords of "combine_cubes"
  • combine_cubes does allow cubes to be mixed across multiple "output groups"
    -- ( grouping discussed above )
    N.B. various API need to shake down before "combine_cubes" becomes a reality.
    So we can think forward to this, but should definitely wait
    (Aside : future of "combine_cubes"
  • expect LoadPolicy to be replaced by "combine_policy"
  • expect "merge_unique" to be added as a control
    )

@pp-mo
Copy link
Member

pp-mo commented Nov 14, 2024

Proposed Development Plan

=Tamed ?

  1. craft the initial operation, implementing only certain operations
    = why not equalise_attributes/unify_time_coords for now, as these already exist ?
  2. choose (and implement) a list of further "equalisation operations"
  3. decide on integration with combine_cubes (though this API has not settled yet)

Notes on process

for taming meeting:

  • address the tech worries:
    • group identification
    • apply-once-in-any-order ?
  • need to reconsider from the user perspective
    • are we addressing the right needs
    • is the interface usable
  • could go in experimental ?
  • (OR) could keep initial func private, until it becomes useful (+ API settles a bit?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info Feature: Merge/Concatenate
Projects
Status: 🛡 Championed
Status: No status
Development

No branches or pull requests

7 participants