Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions CLOUD_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Cloud Setup for ISMIP6 Indexing

This document describes the cloud infrastructure setup required for running the ISMIP6 virtualization pipeline using Lithops on Google Cloud Platform.

## Prerequisites

- Google Cloud SDK (`gcloud`) installed and configured
- Access to a GCP project with billing enabled
- Permissions to create service accounts and IAM policies

## GCP Service Account Setup

Lithops requires a service account with proper permissions to deploy Cloud Functions and access Cloud Storage.

### 1. Set Your Project ID

```bash
export PROJECT_ID=$(gcloud config get-value project)
```

For this project, the project ID is: `ds-englacial`

### 2. Create Service Account

```bash
gcloud iam service-accounts create lithops-executor \
--display-name="Lithops Executor Service Account" \
--project=$PROJECT_ID
```

### 3. Grant Required Permissions

See [Lithops GCP Functions Documentation](https://lithops-cloud.github.io/docs/source/compute_config/gcp_functions.html) for a list of required permissions - THOSE BELOW NEED TO BE UPDATED.

Grant Cloud Functions Developer role (to deploy and manage serverless functions):

```bash
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/cloudfunctions.developer"
```

Grant Storage Object Admin role (to read/write GCS buckets):

```bash
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/storage.objectAdmin"
```

### 4. Download Service Account Key

```bash
gcloud iam service-accounts keys create ~/lithops-sa-key.json \
--iam-account=lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com
```

This creates a JSON key file at `~/lithops-sa-key.json`.

## Enable services

See [Lithops GCP Functions Documentation](https://lithops-cloud.github.io/docs/source/compute_config/gcp_functions.html) for a list of Google Cloud services which need to be enabled.

## Lithops Configuration

The `lithops.yaml` configuration file should reference the service account key:

```yaml
lithops:
backend: gcp_functions
storage: gcp_storage

gcp:
region: us-west1
credentials_path: ~/lithops-sa-key.json

gcp_functions:
region: us-west1
runtime: ismip6-icechunk

gcp_storage:
storage_bucket: ismip6-icechunk
region: us-west1
```

### Important Notes

- **Do NOT use Application Default Credentials**: Lithops requires a service account key file with `client_email` and `token_uri` fields, which ADC doesn't provide
- **Keep the key file secure**: The service account key provides full access to the granted permissions
- **Storage bucket**: The `storage_bucket` parameter specifies where Lithops stores intermediate results and metadata

## GCS Buckets Used

- `gs://ismip6`: Source data bucket (public, read-only)
- `gs://ismip6-icechunk`: Target bucket for Icechunk repositories and failure logs

## Service Account Created

- **Name**: `lithops-executor`
- **Email**: `[email protected]`
- **Roles**:
- `roles/cloudfunctions.developer`
- `roles/storage.objectAdmin`
- OTHERS/ADD ME
- **Key Location**: `~/lithops-sa-key.json`

## Build the runtime

```bash
# lithops runtime delete ismip6-icechunk -c lithops.yaml
lithops runtime build -f requirements.txt ismip6-icechunk -c lithops.yaml
lithops runtime deploy ismip6-icechunk -c lithops.yaml
```

104 changes: 104 additions & 0 deletions ICECHUNK_STORE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Icechunk Store Documentation

## Overview

This document describes the Icechunk store created by the `virtualize_with_lithops.py` script for the ISMIP6 dataset. The store contains virtualized references to NetCDF files in Google Cloud Storage, allowing efficient access to the dataset without duplicating the underlying data.

## Store Location

- **Bucket**: `gs://ismip6-icechunk`
- **Source Data**: `gs://ismip6/`

## What is Icechunk and how is it being used

Icechunk is a transactional storage system for chunked array data. Here we are using it to store metadata that references chunks in remote object storage. In this case, the Google Cloud Storage bucket gs://ismip6, enabling efficient access to large scientific datasets without copying the data.

## Store Structure

The Icechunk store is organized hierarchically using the same structure as the `gs://ismip6` source bucket, i.e.:

```
{institution}_{model_name}/{experiment}/{variable}
```

For example:
- `AWI_ISSM/ctrl/lithk`
- `JPL_ISSM/expAE01/ivol`

Each path contains a virtual Zarr dataset with metadata pointing to chunks in the original NetCDF files.

## How virtualize_with_lithops Creates the Store

### Processing Strategy

The script uses a three-step approach to build the Icechunk store:

1. **Build File Index**: Scans the `gs://ismip6/` bucket to identify all available NetCDF files
2. **Group Files**: Groups files by model name and experiment to create logical batches
3. **Process Batches**: For each batch:
- Virtualizes files in parallel using Lithops serverless functions
- Writes successful virtualizations to Icechunk in a single commit

### Virtualization Process

#### Step 1: File Virtualization

For each NetCDF file, the `virtualize_file()` function:

1. Opens the file using `virtualizarr.open_virtual_dataset()`
2. Loads coordinate variables (`time`, `x`, `y`, `lat`, `lon`, etc.) into memory
3. References data variables virtually (no data is copied)
4. Applies ISMIP6-specific fixes:
- `fix_time_encoding()`: Corrects time coordinate encoding
- `correct_grid_coordinates()`: Fixes grid coordinate metadata
5. Returns the virtual dataset with its hierarchical path

#### Step 2: Batch Writing

The `write_batch_to_icechunk()` function:

1. Opens or creates the Icechunk repository
2. Creates a writable session on the `main` branch
3. Writes all virtual datasets in the batch using `virtual_dataset_to_icechunk()`
4. Commits all changes in a single transaction

This approach minimizes the number of commits and reduces the risk of conflicts.

#### Step 3: Parallel Processing (lines 104-217)

The `process_all_files()`:

1. Groups files by model and experiment to create batches
2. Uses Lithops to parallelize virtualization within each batch
3. Processes batches sequentially to avoid Icechunk write conflicts
4. Logs any failures to `gs://ismip6-icechunk/failures/`

## Accessing the Store

See [notebooks/open_icechunk.ipynb](./notebooks/open_icechunk.ipynb)

## Commit History

Each batch write creates a commit with a message like:
```
Added 5 datasets: AWI_ISSM/ctrl/lithk, AWI_ISSM/ctrl/ivol, AWI_ISSM/ctrl/base, ...
```

The commit ID is logged during processing for traceability.

## Failure Handling

If virtualization or writing fails for any files, failures are logged to:
```
gs://ismip6-icechunk/failures/virtualization_failures_YYYYMMDD_HHMMSS.json
```

Each failure record includes:
- URL of the failed file
- Error message
- Model name and experiment
- Timestamp

## Improvements

* Virtualization is still a bit slow for the entire archive (~6 hours) grouping by model and experiment. Could this be sped up, perhaps virtualizing all files first and / or grouping only by model?
18 changes: 18 additions & 0 deletions lithops.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
lithops:
backend: gcp_functions
storage: gcp_storage
data_limit: False

gcp:
region: us-west1
credentials_path: /Users/aimeebarciauskas/lithops-sa-key.json

gcp_functions:
region: us-west1
runtime: ismip6-icechunk
runtime_memory: 8192
runtime_timeout: 540
project_id: ds-englacial

gcp_storage:
storage_bucket: ismip6-icechunk # Lithops uses storage_bucket, not bucket
2,424 changes: 2,424 additions & 0 deletions notebooks/open_icechunk.ipynb

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,25 +21,29 @@ classifiers = [
]

dependencies = [
"coiled>=1.129.3",
"cftime>=1.6.5",
"dask>=2025.11.0",
"fsspec>=2025.9.0",
"gcsfs>=2025.9.0",
"h5netcdf>=1.7.3",
"holoviews>=1.19.0",
"hvplot>=0.10.0",
"icechunk>=1.1.13",
"ipywidgets>=8.1.0",
"jupyter>=1.1.0",
"jupyter-bokeh>=4.0.5",
"matplotlib>=3.10.7",
"netcdf4>=1.7.0",
"numpy>=2.3.4",
"obstore>=0.8.2",
"pandas>=2.3.3",
"panel>=1.5.4",
"pyarrow>=22.0.0",
"pyproj>=3.7.0",
"pyyaml>=6.0.0",
"scipy>=1.16.2",
"virtualizarr>=2.2.1",
"xarray>=2025.10.1",
]

Expand Down
32 changes: 32 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Mandatory Lithops packages
google-cloud
google-cloud-storage
google-cloud-pubsub
google-auth
google-api-python-client
numpy
six
requests
redis
pika
scikit-learn
diskcache
cloudpickle
ps-mem
tblib
PyYAML
urllib3
psutil

# other packages
lithops
icechunk
obstore
xarray
virtualizarr
zarr
numpy
pandas
h5netcdf
gcsfs
pyproj
Loading