Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for ancillary files #190

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

atteggiani
Copy link
Collaborator

@atteggiani atteggiani commented Feb 12, 2025

Fixes #101.
Fixes #8.

  • Removed NotSupportedError for ancillary files.
  • Added 'fix_iris_calendar' function, to better manage the fixing of the iris calendar to handle proleptic gregorian calendar. Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.**
  • Added is_ancil function to check if the mule file is an ancillary file.
  • Excluded some tasks from being performed if the input fields file is an ancillary file
    Based on the comments in Support conversion of ancillary files #101, I excluded the following tasks when the file is an ancillary:
    • cubes masking
    • coordinate fixing (lat/lon and lev)
    • pressure levels fixing
    • forecast reference time fixing
  • Updated unit-tests

The unit-test passes and I am able to successfully convert all the ancillary files that I tested.

It would be good to add some regression tests for ancillary files too. More details in #189.

atteggiani and others added 4 commits February 3, 2025 12:21
Change get_z_sea_constants to return [None,None] if the file is an ancillary file.
…e iris calendar to handle proleptic gregorian calendar.

Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.
…file.

- Updated tests for the get_z_sea_constants function
@atteggiani
Copy link
Collaborator Author

atteggiani commented Feb 13, 2025

A few more details on the need for the new_epoch_date_hours_internals function to fix iris calendar:

The patch to the iris calendar (by patching the calendar property, is needed while #3561 gets fixed (hopefully soon).
The problem with this patch, however, is that there are some internal functions in iris, that perform checks, like for example, with the _epoch_date_hours_internals function that checks for the calendar type.
However, this function currently doesn't account for a cf_units.CALENDAR_PROLEPTIC_GREGORIAN calendar type and therefore, without any further fixing, opening some files with iris would result in a ValueError: unrecognised calendar : cf_units.CALENDAR_PROLEPTIC_GREGORIAN.

You can test this, for example, by commenting out this line in this head branch and running:

um2nc /g/data/access/TIDS/UM/ancil/atmos/n48e/orography/globe30/v2/qrparm.orog test_nc

It should be the same for most (if not all) ancillary files, or any UM file without a time specification.

Thank you @blimlim for asking about this. It made me realise of an issue with the fix_iris_calendar that I now corrected!

Copy link
Collaborator

@blimlim blimlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @atteggiani for putting this PR together. I think it's looking good.

I had a look back over the diff between ancil2netcdf.py and um2netcdf4.py here and noticed some small differences between what the old script and this new version are doing. I suspect these differences aren't highly important, but have just noted them down in the comments and would be keen to hear from @MartinDix if there are important reasons behind the changes in ancil2netcdf.py.


# TODO: flag warnings as an error for the driver script?
if not args.use64bit:
convert_32_bit(c)

fill_value = fix_fill_value(c)
fix_forecast_reference_time(c)
if not is_ancillary:
fix_forecast_reference_time(c)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion – I think it should be possible to move the fix_forcast_reference_time call further up to ~line 654, into the first if not is_ancillary: block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I tried not to restructure things too much (that might happen in a later PR), but this seems pretty sensible to include here.

Actually, on this point, I am now thinking that the fix_forecast_reference_time should still be performed for ancillary files. This because the function itself already does nothing in case the fieldsfile doesn't have a forecast_reference_time coordinate (as ancillary files don't have). However:

  • I am not 100% sure that no ancillary file exists with a forecast_reference_time coordinate
  • Also dump files should not have the forecast_reference_time coordinate, but the function is called anyway for this type of fieldsfiles, so I would do the same for ancillary files

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a dig through the ESM1.5 ancillary files and found two examples that contain forcast_reference_time data:

  1. /g/data/vk83/configurations/inputs/access-esm1p5/modern/unused/atmosphere/climatology/global.N96/2020.05.19/qrclim.slt

This file isn't actually being used by ESM1.5 – it's just sitting there in the directory.

It contains forecast_reference_time as the main time coordinate, and time as an auxiliary coordinate:

The points differ between the two coordinates:

>>> print(test2[0].coord("forecast_reference_time")[0])
DimCoord :  forecast_reference_time / (hours since 1970-01-01 00:00:00, standard calendar)
    points: [1996-01-31 23:59:00]

>>> print(test2[0].coord("time")[0]
AuxCoord :  time / (hours since 1970-01-01 00:00:00, standard calendar)
    points: [1987-07-17 23:59:30]

If we run fix_forecast_reference_time() on this file, it will delete the forecast_... coordinates. I don't know if there would be any issues with this, as I'm not sure which of the time coordinates are most relevant for this file, or whether the file is just structured strangely.

Since time is not a dimension coordinate in this file, we end up with a new dimension dim0 in place of forecast_reference_time:

dimensions:
	dim0 = 13 ;
	latitude = 145 ;
	longitude = 192 ;
	bnds = 2 ;
variables:
	float tsl(dim0, latitude, longitude) ;
		tsl:_FillValue = 1.e+20f ;
		tsl:standard_name = "soil_temperature" ;
...
...
	double time(dim0) ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1970-01-01 00:00" ;
		time:standard_name = "time" ;
		time:calendar = "proleptic_gregorian" ;
  1. g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_sst_n96_greg.pp

This SST file is used for ESM1.5 AMIP simulations. It contains time, forecast_reference_time, and forecast_period coordinates, but time is the main record coordinate

In this case, running fix_forecast_reference_time() deletes the forecast_... coordinates and correctly keeps the time dimension as the record dimension

time = UNLIMITED ; // (1764 currently)
	latitude = 145 ;
	longitude = 192 ;
	bnds = 2 ;
variables:
	float ts(time, latitude, longitude) ;
		ts:_FillValue = 1.e+20f ;
...

I don't think there's a problem with deleting the forecast_... coordinates for this file. It's handled by the model in a similar way to the sea ice file (/g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_seaice_n96_greg.pp) which has no forecast_... coordinates,

Copy link
Collaborator Author

@atteggiani atteggiani Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blimlim for the information provided!

I think this relates to how iris converts fieldsfiles to cubes, and sometimes I think it messes things up.
I think the general behaviour of fix_forecast_reference_time() should NOT vary based on the fieldsfile type, and should be only based on what we want the output NetCDF variable coordinates to look like.

For example, in your example 1, I think:

  • time should be considered a dimension coordinate (as it should in all fieldsfiles)
  • Also, bdns should not be present as a dimension, but only as a variable related to latitude, longitude (and maybe time and lev --> related comment).

Therefore the output NetCDF variable should have the following dimension coordinates:

time = 13
latitude = 145
longitude = 192

dimensions.

For auxiliary coordinates, I am not sure if forecast_reference_time and forecast_period, when present, can be reliable (probably not).
Hence, it might be good to remove them (as it currently happens) to avoid confusion and inaccuracies.

I believe this should apply to all output NetCDF files, regardless of the type of fields file they were converted from.

Also:

  • if time=1 (if there is no actual time dimension but the data is instantaneous), the time dimension can be dropped making the output variable in the NetCDF 2D (lat/lon)
  • if level>1 (if the field is multilevel) then the output variable in the NetCDF should include also a level = N_LEV dimension coordinate, making it 3D (or 4D if also time is >1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @atteggiani! I think that sounds good, and agree with your point that consistency in the behaviour across all fields file types would be the best way to go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at fix_forecast_reference_time and I see a few issues with the function.
I think this needs to be addressed in a separate PR, so I am going to:

  • write an issue about it
  • address the issue and open a separate PR

Then we can rebase this head branch so the changes are included.

For fix_forecast_reference_time related to this PR, after the discussion above, I am going to run the function even if the file is an ancillary (because the function would return the cube as is).

Does it sound good?

if require_heaviside_uv(c.item_code) and heaviside_uv:
apply_mask(c, heaviside_uv, args.hcrit)
if not is_ancillary:
fix_latlon_coords(c, mv.grid_type, mv.d_lat, mv.d_lon)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the original ancil2netcdf.py script keeps some functionality from fix_latlon_coords. It keeps the parts performed by add_latlon_coord_bounds, but not the parts done by fix_lat_coord_name. I.e. it will still add bounds for the latitude and longitude coordinates.

Is it important for the new version to also add bounds to the lat/lon coordinates for ancillary files? We could split fix_latlon_coords into two separate parts if this is the case – one for adding bounds, and one for renaming.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
The coordinate bounds should be added in all cases, even for ancillary files. I would still exclude the fixing of coordinate names though.
I will split the fix_latlon_coords into add_coordinates_bounds and fix_coordinates_names (also better for testing) and only exclude the fix_coordinates_names one.

Note:
Why are the boundaries only being produced for lat-lon? What about time? What about level?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds good!

Why are the boundaries only being produced for lat-lon? What about time? What about level?

Good question, I'm not too sure about this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MartinDix do we know anything about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.

Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MartinDix

Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.

I see.
For time bounds, I see that the convert_proleptic and fix_forecast_reference_time functions fiddle with the time.bounds (for specific questions/issues about this I will open a separate issue).
If we find time bounds and manage to correctly modify them, wouldn't it make sense to add them as a variable together with the lat/lon ones?

Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.

Is the CMIP6 specification on coordinate naming widely used also outside CMIP6?

I would usually avoid doing processing that the user cannot control, if is not strictly needed for the main purpose of the function (converting a file to NetCDF).
I see producing NetCDFs matching the CMIP6 specification as a "specific" application of um2nc, that should have its optional parameter, but not be the "default" option.

So, in this case, I think the "default" option should be to leave the coordinate names as they are (for any type of files, also output files), and the "fixing" of the coordinate names should be performed using an optional parameter (for example --fix-coords).

@blimlim
Copy link
Collaborator

blimlim commented Feb 14, 2025

I think there's also some small differences in the behaviour here compared to the ancil2netcdf.py script:

fix_var_name(c, st.uniquename, args.simple)
fix_standard_name(c, st.standard_name, args.verbose)
fix_long_name(c, st.long_name)
fix_units(c, st.units, args.verbose)

  • fix_var_name(): the ancil2netcdf.py script removes the section which appends _max or _min to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?

  • fix_standard_name(): ancil2netcdf.py cuts out the part which changes the standard names x_wind -> eastward_wind and y_wind -> northward_wind.

  • fix_long_name(): ancil2netcdf.py cuts out the part which clips the long_name to 110 characters. I suspect this difference wouldn't be very important though?

I don't suggest these are things that necessarily need to be changed in your PR! I just thought I'd note them down for discussion in case there is any reasons behind the differences in the ancil2netcdf.py script that we need to be aware of.

@atteggiani
Copy link
Collaborator Author

Thank you for your precious comments @blimlim!

A general idea for the following:

fix_var_name(): the ancil2netcdf.py script removes the section which appends _max or _min to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?

fix_standard_name(): ancil2netcdf.py cuts out the part which changes the standard names x_wind -> eastward_wind and y_wind -> northward_wind.

fix_long_name(): ancil2netcdf.py cuts out the part which clips the long_name to 110 characters. I suspect this difference wouldn't be very important though?

I think these steps all perform tasks that are related to the naming of STASH variables (independently from them being ancillary files, dump file, output files).
Therefore, I think that if these steps are deemed important and should therefore be kept, for consistency they should be kept for any fieldsfile type.

@blimlim
Copy link
Collaborator

blimlim commented Feb 14, 2025

I think that if these steps are deemed important and should therefore be kept, for consistency they should be kept for any fieldsfile type.

Good point, I agree it makes sense to be consistent with these

@MartinDix
Copy link
Contributor

I think there's also some small differences in the behaviour here compared to the ancil2netcdf.py script:

fix_var_name(c, st.uniquename, args.simple)
fix_standard_name(c, st.standard_name, args.verbose)
fix_long_name(c, st.long_name)
fix_units(c, st.units, args.verbose)

* `fix_var_name()`: the `ancil2netcdf.py` script removes the section which appends `_max` or `_min` to any variables. This difference might not matter as I'm wondering whether ancillary files ever would have "maximum" or "minimum" in the variables' cell methods?

* `fix_standard_name()`: `ancil2netcdf.py` cuts out the part which changes the standard names `x_wind` -> `eastward_wind` and `y_wind` -> `northward_wind`.

* `fix_long_name()`: `ancil2netcdf.py` cuts out the part which clips the `long_name` to 110 characters. I suspect this difference wouldn't be very important though?

I don't suggest these are things that necessarily need to be changed in your PR! I just thought I'd note them down for discussion in case there is any reasons behind the differences in the ancil2netcdf.py script that we need to be aware of.

I don't think it would ever be necessary to have the max/min suffix for ancillary files. Nor would we ever have ancillary files with winds. There's no problem with not having this processing, it's just a question of whether it's simpler to just include it or to exclude it.

fix_long_name is something that was just a work-around for xconv. It was only some strange variables that ever got close to the limit and I'm pretty sure no ancillary variables would trigger it. We could probably just remove it in general.

@blimlim
Copy link
Collaborator

blimlim commented Feb 18, 2025

Thanks @MartinDix for clarifying!

There's no problem with not having this processing, it's just a question of whether it's simpler to just include it or to exclude it.

In that case I suspect it's easiest to include the same processing across all file types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support conversion of ancillary files Feature Requirement: calendar fixes
3 participants