Added support for ancillary files #190

atteggiani · 2025-02-12T00:47:26Z

Fixes #101.
Fixes #8.

Removed NotSupportedError for ancillary files.
Added 'fix_iris_calendar' function, to better manage the fixing of the iris calendar to handle proleptic gregorian calendar. Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.**
Added is_ancil function to check if the mule file is an ancillary file.
Excluded some tasks from being performed if the input fields file is an ancillary file
Based on the comments in Support conversion of ancillary files #101, I excluded the following tasks when the file is an ancillary:
- cubes masking
- coordinate fixing (lat/lon and lev)
- pressure levels fixing
- forecast reference time fixing
Updated unit-tests

The unit-test passes and I am able to successfully convert all the ancillary files that I tested.

It would be good to add some regression tests for ancillary files too. More details in #189.

Change get_z_sea_constants to return [None,None] if the file is an ancillary file.

…e iris calendar to handle proleptic gregorian calendar. Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.

…file. - Updated tests for the get_z_sea_constants function

…an ancillary file Updated tests

…'calendar'.

atteggiani · 2025-02-13T05:21:55Z

A few more details on the need for the new_epoch_date_hours_internals function to fix iris calendar:

The patch to the iris calendar (by patching the calendar property, is needed while #3561 gets fixed (hopefully soon).
The problem with this patch, however, is that there are some internal functions in iris, that perform checks, like for example, with the _epoch_date_hours_internals function that checks for the calendar type.
However, this function currently doesn't account for a cf_units.CALENDAR_PROLEPTIC_GREGORIAN calendar type and therefore, without any further fixing, opening some files with iris would result in a ValueError: unrecognised calendar : cf_units.CALENDAR_PROLEPTIC_GREGORIAN.

You can test this, for example, by commenting out this line in this head branch and running:

um2nc /g/data/access/TIDS/UM/ancil/atmos/n48e/orography/globe30/v2/qrparm.orog test_nc

It should be the same for most (if not all) ancillary files, or any UM file without a time specification.

Thank you @blimlim for asking about this. It made me realise of an issue with the fix_iris_calendar that I now corrected!

blimlim

Thanks @atteggiani for putting this PR together. I think it's looking good.

I had a look back over the diff between ancil2netcdf.py and um2netcdf4.py here and noticed some small differences between what the old script and this new version are doing. I suspect these differences aren't highly important, but have just noted them down in the comments and would be keen to hear from @MartinDix if there are important reasons behind the changes in ancil2netcdf.py.

blimlim · 2025-02-13T23:04:58Z

src/um2nc/um2netcdf.py


        # TODO: flag warnings as an error for the driver script?
        if not args.use64bit:
            convert_32_bit(c)

        fill_value = fix_fill_value(c)
-        fix_forecast_reference_time(c)
+        if not is_ancillary:
+            fix_forecast_reference_time(c)



Tiny suggestion – I think it should be possible to move the fix_forcast_reference_time call further up to ~line 654, into the first if not is_ancillary: block.

Good point!
I tried not to restructure things too much (that might happen in a later PR), but this seems pretty sensible to include here.

Actually, on this point, I am now thinking that the fix_forecast_reference_time should still be performed for ancillary files. This because the function itself already does nothing in case the fieldsfile doesn't have a forecast_reference_time coordinate (as ancillary files don't have). However:

I am not 100% sure that no ancillary file exists with a forecast_reference_time coordinate

Also dump files should not have the forecast_reference_time coordinate, but the function is called anyway for this type of fieldsfiles, so I would do the same for ancillary files

What do you think?

I've had a dig through the ESM1.5 ancillary files and found two examples that contain forcast_reference_time data:

/g/data/vk83/configurations/inputs/access-esm1p5/modern/unused/atmosphere/climatology/global.N96/2020.05.19/qrclim.slt

This file isn't actually being used by ESM1.5 – it's just sitting there in the directory.

It contains forecast_reference_time as the main time coordinate, and time as an auxiliary coordinate:

The points differ between the two coordinates:

>>> print(test2[0].coord("forecast_reference_time")[0]) DimCoord : forecast_reference_time / (hours since 1970-01-01 00:00:00, standard calendar) points: [1996-01-31 23:59:00] >>> print(test2[0].coord("time")[0] AuxCoord : time / (hours since 1970-01-01 00:00:00, standard calendar) points: [1987-07-17 23:59:30]

If we run fix_forecast_reference_time() on this file, it will delete the forecast_... coordinates. I don't know if there would be any issues with this, as I'm not sure which of the time coordinates are most relevant for this file, or whether the file is just structured strangely.

Since time is not a dimension coordinate in this file, we end up with a new dimension dim0 in place of forecast_reference_time:

dimensions: dim0 = 13 ; latitude = 145 ; longitude = 192 ; bnds = 2 ; variables: float tsl(dim0, latitude, longitude) ; tsl:_FillValue = 1.e+20f ; tsl:standard_name = "soil_temperature" ; ... ... double time(dim0) ; time:bounds = "time_bnds" ; time:units = "days since 1970-01-01 00:00" ; time:standard_name = "time" ; time:calendar = "proleptic_gregorian" ;

g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_sst_n96_greg.pp

This SST file is used for ESM1.5 AMIP simulations. It contains time, forecast_reference_time, and forecast_period coordinates, but time is the main record coordinate

In this case, running fix_forecast_reference_time() deletes the forecast_... coordinates and correctly keeps the time dimension as the record dimension

time = UNLIMITED ; // (1764 currently) latitude = 145 ; longitude = 192 ; bnds = 2 ; variables: float ts(time, latitude, longitude) ; ts:_FillValue = 1.e+20f ; ...

I don't think there's a problem with deleting the forecast_... coordinates for this file. It's handled by the model in a similar way to the sea ice file (/g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_seaice_n96_greg.pp) which has no forecast_... coordinates,

Thank you @blimlim for the information provided!

I think this relates to how iris converts fieldsfiles to cubes, and sometimes I think it messes things up.
I think the general behaviour of fix_forecast_reference_time() should NOT vary based on the fieldsfile type, and should be only based on what we want the output NetCDF variable coordinates to look like.

For example, in your example 1, I think:

time should be considered a dimension coordinate (as it should in all fieldsfiles)

Also, bdns should not be present as a dimension, but only as a variable related to latitude, longitude (and maybe time and lev --> related comment).

Therefore the output NetCDF variable should have the following dimension coordinates:

time = 13 latitude = 145 longitude = 192

dimensions.

For auxiliary coordinates, I am not sure if forecast_reference_time and forecast_period, when present, can be reliable (probably not).
Hence, it might be good to remove them (as it currently happens) to avoid confusion and inaccuracies.

I believe this should apply to all output NetCDF files, regardless of the type of fields file they were converted from.

Also:

if time=1 (if there is no actual time dimension but the data is instantaneous), the time dimension can be dropped making the output variable in the NetCDF 2D (lat/lon)

if level>1 (if the field is multilevel) then the output variable in the NetCDF should include also a level = N_LEV dimension coordinate, making it 3D (or 4D if also time is >1)

Thanks @atteggiani! I think that sounds good, and agree with your point that consistency in the behaviour across all fields file types would be the best way to go.

I was looking at fix_forecast_reference_time and I see a few issues with the function.
I think this needs to be addressed in a separate PR, so I am going to:

write an issue about it

address the issue and open a separate PR

Then we can rebase this head branch so the changes are included.

For fix_forecast_reference_time related to this PR, after the discussion above, I am going to run the function even if the file is an ancillary (because the function would return the cube as is).

Does it sound good?

src/um2nc/um2netcdf.py

blimlim · 2025-02-14T00:13:18Z

src/um2nc/um2netcdf.py

-            if require_heaviside_uv(c.item_code) and heaviside_uv:
-                apply_mask(c, heaviside_uv, args.hcrit)
+        if not is_ancillary:
+            fix_latlon_coords(c, mv.grid_type, mv.d_lat, mv.d_lon)


I noticed that the original ancil2netcdf.py script keeps some functionality from fix_latlon_coords. It keeps the parts performed by add_latlon_coord_bounds, but not the parts done by fix_lat_coord_name. I.e. it will still add bounds for the latitude and longitude coordinates.

Is it important for the new version to also add bounds to the lat/lon coordinates for ancillary files? We could split fix_latlon_coords into two separate parts if this is the case – one for adding bounds, and one for renaming.

I agree.
The coordinate bounds should be added in all cases, even for ancillary files. I would still exclude the fixing of coordinate names though.
I will split the fix_latlon_coords into add_coordinates_bounds and fix_coordinates_names (also better for testing) and only exclude the fix_coordinates_names one.

Note:
Why are the boundaries only being produced for lat-lon? What about time? What about level?

I think that sounds good!

Why are the boundaries only being produced for lat-lon? What about time? What about level?

Good question, I'm not too sure about this one

@MartinDix do we know anything about this?

Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.

Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.

@MartinDix

Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.

I see.
For time bounds, I see that the convert_proleptic and fix_forecast_reference_time functions fiddle with the time.bounds (for specific questions/issues about this I will open a separate issue).
If we find time bounds and manage to correctly modify them, wouldn't it make sense to add them as a variable together with the lat/lon ones?

Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.

Is the CMIP6 specification on coordinate naming widely used also outside CMIP6?

I would usually avoid doing processing that the user cannot control, if is not strictly needed for the main purpose of the function (converting a file to NetCDF).
I see producing NetCDFs matching the CMIP6 specification as a "specific" application of um2nc, that should have its optional parameter, but not be the "default" option.

So, in this case, I think the "default" option should be to leave the coordinate names as they are (for any type of files, also output files), and the "fixing" of the coordinate names should be performed using an optional parameter (for example --fix-coords).

blimlim · 2025-02-14T00:30:01Z

I think there's also some small differences in the behaviour here compared to the ancil2netcdf.py script:

um2nc-standalone/src/um2nc/um2netcdf.py

Lines 633 to 636 in 17c2d57

    
           fix_var_name(c, st.uniquename, args.simple) 
        
           fix_standard_name(c, st.standard_name, args.verbose) 
        
           fix_long_name(c, st.long_name) 
        
           fix_units(c, st.units, args.verbose)

fix_var_name(): the ancil2netcdf.py script removes the section which appends _max or _min to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?
fix_standard_name(): ancil2netcdf.py cuts out the part which changes the standard names x_wind -> eastward_wind and y_wind -> northward_wind.
fix_long_name(): ancil2netcdf.py cuts out the part which clips the long_name to 110 characters. I suspect this difference wouldn't be very important though?

I don't suggest these are things that necessarily need to be changed in your PR! I just thought I'd note them down for discussion in case there is any reasons behind the differences in the ancil2netcdf.py script that we need to be aware of.

atteggiani · 2025-02-14T00:55:47Z

Thank you for your precious comments @blimlim!

A general idea for the following:

fix_var_name(): the ancil2netcdf.py script removes the section which appends _max or _min to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?

fix_standard_name(): ancil2netcdf.py cuts out the part which changes the standard names x_wind -> eastward_wind and y_wind -> northward_wind.

fix_long_name(): ancil2netcdf.py cuts out the part which clips the long_name to 110 characters. I suspect this difference wouldn't be very important though?

I think these steps all perform tasks that are related to the naming of STASH variables (independently from them being ancillary files, dump file, output files).
Therefore, I think that if these steps are deemed important and should therefore be kept, for consistency they should be kept for any fieldsfile type.

blimlim · 2025-02-14T02:24:34Z

I think that if these steps are deemed important and should therefore be kept, for consistency they should be kept for any fieldsfile type.

Good point, I agree it makes sense to be consistent with these

MartinDix · 2025-02-17T23:06:02Z

I think there's also some small differences in the behaviour here compared to the ancil2netcdf.py script:

um2nc-standalone/src/um2nc/um2netcdf.py

Lines 633 to 636 in 17c2d57

fix_var_name(c, st.uniquename, args.simple)

fix_standard_name(c, st.standard_name, args.verbose)

fix_long_name(c, st.long_name)

fix_units(c, st.units, args.verbose)
* `fix_var_name()`: the `ancil2netcdf.py` script removes the section which appends `_max` or `_min` to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?

* `fix_standard_name()`: `ancil2netcdf.py` cuts out the part which changes the standard names `x_wind` -> `eastward_wind` and `y_wind` -> `northward_wind`.

* `fix_long_name()`: `ancil2netcdf.py` cuts out the part which clips the `long_name` to 110 characters. I suspect this difference wouldn't be very important though?
I don't suggest these are things that necessarily need to be changed in your PR! I just thought I'd note them down for discussion in case there is any reasons behind the differences in the ancil2netcdf.py script that we need to be aware of.

I don't think it would ever be necessary to have the max/min suffix for ancillary files. Nor would we ever have ancillary files with winds. There's no problem with not having this processing, it's just a question of whether it's simpler to just include it or to exclude it.

fix_long_name is something that was just a work-around for xconv. It was only some strange variables that ever got close to the limit and I'm pretty sure no ancillary variables would trigger it. We could probably just remove it in general.

blimlim · 2025-02-18T01:11:08Z

Thanks @MartinDix for clarifying!

There's no problem with not having this processing, it's just a question of whether it's simpler to just include it or to exclude it.

In that case I suspect it's easiest to include the same processing across all file types

…illary

atteggiani and others added 4 commits February 3, 2025 12:21

Removed NotSupportedError for ancillary files.

aa4cf6f

Change get_z_sea_constants to return [None,None] if the file is an ancillary file.

Added 'fix_iris_calendar' function, to better manage the fixing of th…

4209ad6

…e iris calendar to handle proleptic gregorian calendar. Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.

- Added is_ancil function to check if the mule file is an ancillary …

b481328

…file. - Updated tests for the get_z_sea_constants function

Excluded some tasks from being performed if the input fields file is …

fb7e8c9

…an ancillary file Updated tests

atteggiani requested review from blimlim and MartinDix February 12, 2025 00:47

Fixed the name for the iris property to patch, from 'pg_calendar' to …

17c2d57

…'calendar'.

blimlim reviewed Feb 14, 2025

View reviewed changes

Run fix_forecast_reference_time function even when the file is an anc…

3de599c

…illary

This was referenced Mar 6, 2025

Remove workaround for xconv in fix_long_name ACCESS-NRI/access-hive.org.au#889

Closed

Remove workaround for xconv in fix_long_name #195

Open

Refactor fix_forecast_reference_time and convert_proleptic #197

Open

Added support for ancillary files #190

Are you sure you want to change the base?

Added support for ancillary files #190

Uh oh!

Conversation

atteggiani commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atteggiani commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blimlim left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atteggiani Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blimlim commented Feb 14, 2025

Uh oh!

atteggiani commented Feb 14, 2025

Uh oh!

blimlim commented Feb 14, 2025

Uh oh!

MartinDix commented Feb 17, 2025

Uh oh!

blimlim commented Feb 18, 2025

Uh oh!

Uh oh!

atteggiani commented Feb 12, 2025 •

edited

Loading

atteggiani commented Feb 13, 2025 •

edited

Loading

atteggiani Feb 17, 2025 •

edited

Loading