-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for ancillary files #190
base: main
Are you sure you want to change the base?
Conversation
Change get_z_sea_constants to return [None,None] if the file is an ancillary file.
…e iris calendar to handle proleptic gregorian calendar. Added another fix to iris, that was throwing an error when opening ancillary files with the calendar set to uf_units.CALENDAR_PROLEPTIC_GREGORIAN.
…file. - Updated tests for the get_z_sea_constants function
…an ancillary file Updated tests
A few more details on the need for the new_epoch_date_hours_internals function to fix The patch to the You can test this, for example, by commenting out this line in this head branch and running:
It should be the same for most (if not all) ancillary files, or any UM file without a time specification. Thank you @blimlim for asking about this. It made me realise of an issue with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @atteggiani for putting this PR together. I think it's looking good.
I had a look back over the diff between ancil2netcdf.py
and um2netcdf4.py
here and noticed some small differences between what the old script and this new version are doing. I suspect these differences aren't highly important, but have just noted them down in the comments and would be keen to hear from @MartinDix if there are important reasons behind the changes in ancil2netcdf.py
.
src/um2nc/um2netcdf.py
Outdated
|
||
# TODO: flag warnings as an error for the driver script? | ||
if not args.use64bit: | ||
convert_32_bit(c) | ||
|
||
fill_value = fix_fill_value(c) | ||
fix_forecast_reference_time(c) | ||
if not is_ancillary: | ||
fix_forecast_reference_time(c) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny suggestion – I think it should be possible to move the fix_forcast_reference_time
call further up to ~line 654, into the first if not is_ancillary:
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I tried not to restructure things too much (that might happen in a later PR), but this seems pretty sensible to include here.
Actually, on this point, I am now thinking that the fix_forecast_reference_time
should still be performed for ancillary files. This because the function itself already does nothing in case the fieldsfile doesn't have a forecast_reference_time
coordinate (as ancillary files don't have). However:
- I am not 100% sure that no ancillary file exists with a
forecast_reference_time
coordinate - Also dump files should not have the
forecast_reference_time
coordinate, but the function is called anyway for this type of fieldsfiles, so I would do the same for ancillary files
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a dig through the ESM1.5 ancillary files and found two examples that contain forcast_reference_time
data:
/g/data/vk83/configurations/inputs/access-esm1p5/modern/unused/atmosphere/climatology/global.N96/2020.05.19/qrclim.slt
This file isn't actually being used by ESM1.5 – it's just sitting there in the directory.
It contains forecast_reference_time
as the main time coordinate, and time
as an auxiliary coordinate:

The points differ between the two coordinates:
>>> print(test2[0].coord("forecast_reference_time")[0])
DimCoord : forecast_reference_time / (hours since 1970-01-01 00:00:00, standard calendar)
points: [1996-01-31 23:59:00]
>>> print(test2[0].coord("time")[0]
AuxCoord : time / (hours since 1970-01-01 00:00:00, standard calendar)
points: [1987-07-17 23:59:30]
If we run fix_forecast_reference_time()
on this file, it will delete the forecast_...
coordinates. I don't know if there would be any issues with this, as I'm not sure which of the time coordinates are most relevant for this file, or whether the file is just structured strangely.
Since time
is not a dimension coordinate in this file, we end up with a new dimension dim0
in place of forecast_reference_time
:
dimensions:
dim0 = 13 ;
latitude = 145 ;
longitude = 192 ;
bnds = 2 ;
variables:
float tsl(dim0, latitude, longitude) ;
tsl:_FillValue = 1.e+20f ;
tsl:standard_name = "soil_temperature" ;
...
...
double time(dim0) ;
time:bounds = "time_bnds" ;
time:units = "days since 1970-01-01 00:00" ;
time:standard_name = "time" ;
time:calendar = "proleptic_gregorian" ;
g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_sst_n96_greg.pp
This SST file is used for ESM1.5 AMIP simulations. It contains time
, forecast_reference_time
, and forecast_period
coordinates, but time
is the main record coordinate
In this case, running fix_forecast_reference_time()
deletes the forecast_...
coordinates and correctly keeps the time
dimension as the record dimension
time = UNLIMITED ; // (1764 currently)
latitude = 145 ;
longitude = 192 ;
bnds = 2 ;
variables:
float ts(time, latitude, longitude) ;
ts:_FillValue = 1.e+20f ;
...
I don't think there's a problem with deleting the forecast_...
coordinates for this file. It's handled by the model in a similar way to the sea ice file (/g/data/vk83/configurations/inputs/access-esm1p5/modern/amip/atmosphere/boundary_conditions/global.N96/2021.07.08/amip_seaice_n96_greg.pp
) which has no forecast_...
coordinates,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blimlim for the information provided!
I think this relates to how iris
converts fieldsfiles to cubes, and sometimes I think it messes things up.
I think the general behaviour of fix_forecast_reference_time()
should NOT vary based on the fieldsfile type, and should be only based on what we want the output NetCDF variable coordinates to look like.
For example, in your example 1, I think:
time
should be considered a dimension coordinate (as it should in all fieldsfiles)- Also,
bdns
should not be present as a dimension, but only as a variable related tolatitude
,longitude
(and maybetime
andlev
--> related comment).
Therefore the output NetCDF variable should have the following dimension coordinates:
time = 13
latitude = 145
longitude = 192
dimensions.
For auxiliary coordinates, I am not sure if forecast_reference_time
and forecast_period
, when present, can be reliable (probably not).
Hence, it might be good to remove them (as it currently happens) to avoid confusion and inaccuracies.
I believe this should apply to all output NetCDF files, regardless of the type of fields file they were converted from.
Also:
- if
time=1
(if there is no actual time dimension but the data is instantaneous), the time dimension can be dropped making the output variable in the NetCDF 2D (lat/lon) - if
level>1
(if the field is multilevel) then the output variable in the NetCDF should include also alevel = N_LEV
dimension coordinate, making it 3D (or 4D if also time is >1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @atteggiani! I think that sounds good, and agree with your point that consistency in the behaviour across all fields file types would be the best way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at fix_forecast_reference_time
and I see a few issues with the function.
I think this needs to be addressed in a separate PR, so I am going to:
- write an issue about it
- address the issue and open a separate PR
Then we can rebase this head branch so the changes are included.
For fix_forecast_reference_time
related to this PR, after the discussion above, I am going to run the function even if the file is an ancillary (because the function would return the cube as is).
Does it sound good?
if require_heaviside_uv(c.item_code) and heaviside_uv: | ||
apply_mask(c, heaviside_uv, args.hcrit) | ||
if not is_ancillary: | ||
fix_latlon_coords(c, mv.grid_type, mv.d_lat, mv.d_lon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the original ancil2netcdf.py
script keeps some functionality from fix_latlon_coords
. It keeps the parts performed by add_latlon_coord_bounds
, but not the parts done by fix_lat_coord_name
. I.e. it will still add bounds for the latitude and longitude coordinates.
Is it important for the new version to also add bounds to the lat/lon coordinates for ancillary files? We could split fix_latlon_coords
into two separate parts if this is the case – one for adding bounds, and one for renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
The coordinate bounds should be added in all cases, even for ancillary files. I would still exclude the fixing of coordinate names though.
I will split the fix_latlon_coords
into add_coordinates_bounds
and fix_coordinates_names
(also better for testing) and only exclude the fix_coordinates_names
one.
Note:
Why are the boundaries only being produced for lat-lon? What about time? What about level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds good!
Why are the boundaries only being produced for lat-lon? What about time? What about level?
Good question, I'm not too sure about this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MartinDix do we know anything about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.
Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally there would be boundaries for everything. It's easy to know what the boundaries are for lat/lon but trickier for time because it depends on whether fields are time averages or instantaneous. Also a bit uncertain for levels.
I see.
For time bounds, I see that the convert_proleptic and fix_forecast_reference_time functions fiddle with the time.bounds
(for specific questions/issues about this I will open a separate issue).
If we find time bounds and manage to correctly modify them, wouldn't it make sense to add them as a variable together with the lat/lon ones?
Names were changed from latitude/longitude to lat/lon to match the CMIP6 specification. Ancillary files should do this too for consistency.
Is the CMIP6 specification on coordinate naming widely used also outside CMIP6?
I would usually avoid doing processing that the user cannot control, if is not strictly needed for the main purpose of the function (converting a file to NetCDF).
I see producing NetCDFs matching the CMIP6 specification as a "specific" application of um2nc
, that should have its optional parameter, but not be the "default" option.
So, in this case, I think the "default" option should be to leave the coordinate names as they are (for any type of files, also output files), and the "fixing" of the coordinate names should be performed using an optional parameter (for example --fix-coords
).
I think there's also some small differences in the behaviour here compared to the um2nc-standalone/src/um2nc/um2netcdf.py Lines 633 to 636 in 17c2d57
I don't suggest these are things that necessarily need to be changed in your PR! I just thought I'd note them down for discussion in case there is any reasons behind the differences in the |
Thank you for your precious comments @blimlim! A general idea for the following:
I think these steps all perform tasks that are related to the naming of STASH variables (independently from them being ancillary files, dump file, output files). |
Good point, I agree it makes sense to be consistent with these |
I don't think it would ever be necessary to have the max/min suffix for ancillary files. Nor would we ever have ancillary files with winds. There's no problem with not having this processing, it's just a question of whether it's simpler to just include it or to exclude it. fix_long_name is something that was just a work-around for xconv. It was only some strange variables that ever got close to the limit and I'm pretty sure no ancillary variables would trigger it. We could probably just remove it in general. |
Thanks @MartinDix for clarifying!
In that case I suspect it's easiest to include the same processing across all file types |
Fixes #101.
Fixes #8.
Based on the comments in Support conversion of ancillary files #101, I excluded the following tasks when the file is an ancillary:
The unit-test passes and I am able to successfully convert all the ancillary files that I tested.
It would be good to add some regression tests for ancillary files too. More details in #189.