developmentseed · abarciauskas-bgse · Dec 4, 2025 · Dec 4, 2025 · Dec 5, 2025 · Dec 6, 2025
diff --git a/CLOUD_SETUP.md b/CLOUD_SETUP.md
@@ -0,0 +1,114 @@
+# Cloud Setup for ISMIP6 Indexing
+
+This document describes the cloud infrastructure setup required for running the ISMIP6 virtualization pipeline using Lithops on Google Cloud Platform.
+
+## Prerequisites
+
+- Google Cloud SDK (`gcloud`) installed and configured
+- Access to a GCP project with billing enabled
+- Permissions to create service accounts and IAM policies
+
+## GCP Service Account Setup
+
+Lithops requires a service account with proper permissions to deploy Cloud Functions and access Cloud Storage.
+
+### 1. Set Your Project ID
+
+```bash
+export PROJECT_ID=$(gcloud config get-value project)
+```
+
+For this project, the project ID is: `ds-englacial`
+
+### 2. Create Service Account
+
+```bash
+gcloud iam service-accounts create lithops-executor \
+    --display-name="Lithops Executor Service Account" \
+    --project=$PROJECT_ID
+```
+
+### 3. Grant Required Permissions
+
+See [Lithops GCP Functions Documentation](https://lithops-cloud.github.io/docs/source/compute_config/gcp_functions.html) for a list of required permissions - THOSE BELOW NEED TO BE UPDATED.
+
+Grant Cloud Functions Developer role (to deploy and manage serverless functions):
+
+```bash
+gcloud projects add-iam-policy-binding $PROJECT_ID \
+    --member="serviceAccount:lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com" \
+    --role="roles/cloudfunctions.developer"
+```
+
+Grant Storage Object Admin role (to read/write GCS buckets):
+
+```bash
+gcloud projects add-iam-policy-binding $PROJECT_ID \
+    --member="serviceAccount:lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com" \
+    --role="roles/storage.objectAdmin"
+```
+
+### 4. Download Service Account Key
+
+```bash
+gcloud iam service-accounts keys create ~/lithops-sa-key.json \
+    --iam-account=lithops-executor@${PROJECT_ID}.iam.gserviceaccount.com
+```
+
+This creates a JSON key file at `~/lithops-sa-key.json`.
+
+## Enable services
+
+See [Lithops GCP Functions Documentation](https://lithops-cloud.github.io/docs/source/compute_config/gcp_functions.html) for a list of Google Cloud services which need to be enabled.
+
+## Lithops Configuration
+
+The `lithops.yaml` configuration file should reference the service account key:
+
+```yaml
+lithops:
+    backend: gcp_functions
+    storage: gcp_storage
+
+gcp:
+    region: us-west1
+    credentials_path: ~/lithops-sa-key.json
+
+gcp_functions:
+    region: us-west1
+    runtime: ismip6-icechunk
+
+gcp_storage:
+    storage_bucket: ismip6-icechunk
+    region: us-west1
+```
+
+### Important Notes
+
+- **Do NOT use Application Default Credentials**: Lithops requires a service account key file with `client_email` and `token_uri` fields, which ADC doesn't provide
+- **Keep the key file secure**: The service account key provides full access to the granted permissions
+- **Storage bucket**: The `storage_bucket` parameter specifies where Lithops stores intermediate results and metadata
+
+## GCS Buckets Used
+
+- `gs://ismip6`: Source data bucket (public, read-only)
+- `gs://ismip6-icechunk`: Target bucket for Icechunk repositories and failure logs
+
+## Service Account Created
+
+- **Name**: `lithops-executor`
+- **Email**: `[email protected]`
+- **Roles**:
+  - `roles/cloudfunctions.developer`
+  - `roles/storage.objectAdmin`
+  - OTHERS/ADD ME
+- **Key Location**: `~/lithops-sa-key.json`
+
+## Build the runtime
+
+```bash
+# lithops runtime delete ismip6-icechunk -c lithops.yaml
+lithops runtime build -f requirements.txt ismip6-icechunk -c lithops.yaml
+lithops runtime deploy ismip6-icechunk -c lithops.yaml
+```
+
diff --git a/ICECHUNK_STORE.md b/ICECHUNK_STORE.md
@@ -0,0 +1,104 @@
+# Icechunk Store Documentation
+
+## Overview
+
+This document describes the Icechunk store created by the `virtualize_with_lithops.py` script for the ISMIP6 dataset. The store contains virtualized references to NetCDF files in Google Cloud Storage, allowing efficient access to the dataset without duplicating the underlying data.
+
+## Store Location
+
+- **Bucket**: `gs://ismip6-icechunk`
+- **Source Data**: `gs://ismip6/`
+
+## What is Icechunk and how is it being used
+
+Icechunk is a transactional storage system for chunked array data. Here we are using it to store metadata that references chunks in remote object storage. In this case, the Google Cloud Storage bucket gs://ismip6, enabling efficient access to large scientific datasets without copying the data.
+
+## Store Structure
+
+The Icechunk store is organized hierarchically using the same structure as the `gs://ismip6` source bucket, i.e.:
+
+```
+{institution}_{model_name}/{experiment}/{variable}
+```
+
+For example:
+- `AWI_ISSM/ctrl/lithk`
+- `JPL_ISSM/expAE01/ivol`
+
+Each path contains a virtual Zarr dataset with metadata pointing to chunks in the original NetCDF files.
+
+## How virtualize_with_lithops Creates the Store
+
+### Processing Strategy
+
+The script uses a three-step approach to build the Icechunk store:
+
+1. **Build File Index**: Scans the `gs://ismip6/` bucket to identify all available NetCDF files
+2. **Group Files**: Groups files by model name and experiment to create logical batches
+3. **Process Batches**: For each batch:
+   - Virtualizes files in parallel using Lithops serverless functions
+   - Writes successful virtualizations to Icechunk in a single commit
+
+### Virtualization Process
+
+#### Step 1: File Virtualization
+
+For each NetCDF file, the `virtualize_file()` function:
+
+1. Opens the file using `virtualizarr.open_virtual_dataset()`
+2. Loads coordinate variables (`time`, `x`, `y`, `lat`, `lon`, etc.) into memory
+3. References data variables virtually (no data is copied)
+4. Applies ISMIP6-specific fixes:
+   - `fix_time_encoding()`: Corrects time coordinate encoding
+   - `correct_grid_coordinates()`: Fixes grid coordinate metadata
+5. Returns the virtual dataset with its hierarchical path
+
+#### Step 2: Batch Writing
+
+The `write_batch_to_icechunk()` function:
+
+1. Opens or creates the Icechunk repository
+2. Creates a writable session on the `main` branch
+3. Writes all virtual datasets in the batch using `virtual_dataset_to_icechunk()`
+4. Commits all changes in a single transaction
+
+This approach minimizes the number of commits and reduces the risk of conflicts.
+
+#### Step 3: Parallel Processing (lines 104-217)
+
+The `process_all_files()`:
+
+1. Groups files by model and experiment to create batches
+2. Uses Lithops to parallelize virtualization within each batch
+3. Processes batches sequentially to avoid Icechunk write conflicts
+4. Logs any failures to `gs://ismip6-icechunk/failures/`
+
+## Accessing the Store
+
+See [notebooks/open_icechunk.ipynb](./notebooks/open_icechunk.ipynb)
+
+## Commit History
+
+Each batch write creates a commit with a message like:
+```
+Added 5 datasets: AWI_ISSM/ctrl/lithk, AWI_ISSM/ctrl/ivol, AWI_ISSM/ctrl/base, ...
+```
+
+The commit ID is logged during processing for traceability.
+
+## Failure Handling
+
+If virtualization or writing fails for any files, failures are logged to:
+```
+gs://ismip6-icechunk/failures/virtualization_failures_YYYYMMDD_HHMMSS.json
+```
+
+Each failure record includes:
+- URL of the failed file
+- Error message
+- Model name and experiment
+- Timestamp
+
+## Improvements
+
+* Virtualization is still a bit slow for the entire archive (~6 hours) grouping by model and experiment. Could this be sped up, perhaps virtualizing all files first and / or grouping only by model?
diff --git a/lithops.yaml b/lithops.yaml
@@ -0,0 +1,18 @@
+lithops:
+    backend: gcp_functions
+    storage: gcp_storage
+    data_limit: False
+
+gcp:
+    region: us-west1
+    credentials_path: /Users/aimeebarciauskas/lithops-sa-key.json
+
+gcp_functions:
+    region: us-west1
+    runtime: ismip6-icechunk
+    runtime_memory: 8192
+    runtime_timeout: 540
+    project_id: ds-englacial
+
+gcp_storage:
+    storage_bucket: ismip6-icechunk  # Lithops uses storage_bucket, not bucket
diff --git a/notebooks/open_icechunk.ipynb b/notebooks/open_icechunk.ipynb
diff --git a/pyproject.toml b/pyproject.toml
@@ -21,25 +21,29 @@ classifiers = [
 ]
 
 dependencies = [
+    "coiled>=1.129.3",
     "cftime>=1.6.5",
     "dask>=2025.11.0",
     "fsspec>=2025.9.0",
     "gcsfs>=2025.9.0",
     "h5netcdf>=1.7.3",
     "holoviews>=1.19.0",
     "hvplot>=0.10.0",
+    "icechunk>=1.1.13",
     "ipywidgets>=8.1.0",
     "jupyter>=1.1.0",
     "jupyter-bokeh>=4.0.5",
     "matplotlib>=3.10.7",
     "netcdf4>=1.7.0",
     "numpy>=2.3.4",
+    "obstore>=0.8.2",
     "pandas>=2.3.3",
     "panel>=1.5.4",
     "pyarrow>=22.0.0",
     "pyproj>=3.7.0",
     "pyyaml>=6.0.0",
     "scipy>=1.16.2",
+    "virtualizarr>=2.2.1",
     "xarray>=2025.10.1",
 ]
 

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,32 @@
+# Mandatory Lithops packages
+google-cloud
+google-cloud-storage
+google-cloud-pubsub
+google-auth
+google-api-python-client
+numpy
+six
+requests
+redis
+pika
+scikit-learn
+diskcache
+cloudpickle
+ps-mem
+tblib
+PyYAML
+urllib3
+psutil
+
+# other packages
+lithops
+icechunk
+obstore
+xarray
+virtualizarr
+zarr
+numpy
+pandas
+h5netcdf
+gcsfs
+pyproj