Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
110b30c
Average encoder (for FM ims)
vdplasthijs Jan 27, 2026
532ba64
Fix v issue
vdplasthijs Jan 27, 2026
39c1a93
Merge branch 'develop' into feature/eo_encoders
vdplasthijs Jan 27, 2026
25837b5
Reduce asserts
vdplasthijs Jan 27, 2026
4456b24
Merge pull request #32 from WUR-AI/feature/eo_encoders
vdplasthijs Jan 27, 2026
0a2fb90
typos
cn241 Jan 28, 2026
ec4b78c
Merge pull request #34 from cn241/feature/cn-feature
gabrieletijunaityte Jan 28, 2026
a66bd0f
Fix pooch fro s2bms
gabrieletijunaityte Jan 29, 2026
b82037e
resolve test issues first time running
vdplasthijs Jan 29, 2026
7db6994
Update readme
vdplasthijs Jan 29, 2026
31a12ea
pooch setup in butterfly ds
vdplasthijs Jan 29, 2026
8666fd3
undo because this will be dealt with on other branch
vdplasthijs Jan 29, 2026
e634af0
Merge pull request #37 from WUR-AI/feature/run_first_time
gabrieletijunaityte Jan 29, 2026
775dfd8
Fix cache dir
gabrieletijunaityte Jan 29, 2026
a7598af
Pooch for satbird
gabrieletijunaityte Jan 29, 2026
0943e1d
Fix txt extension for metadata
gabrieletijunaityte Jan 29, 2026
0400910
Fix pooch fro s2bms
gabrieletijunaityte Jan 29, 2026
dd86088
Fix cache dir
gabrieletijunaityte Jan 29, 2026
5530ec0
Pooch for satbird
gabrieletijunaityte Jan 29, 2026
6948a3a
Fix txt extension for metadata
gabrieletijunaityte Jan 29, 2026
412b183
Merge branch 'feature/fix_pooch' of github.com:WUR-AI/aether into fea…
gabrieletijunaityte Jan 29, 2026
f1b2c3e
Merge pull request #38 from WUR-AI/feature/fix_pooch
vdplasthijs Jan 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# Adjust this file for storing private and user specific environment variables, like keys or system paths.
# rename it to ".env" (excluded from version control by default)

PROJECT_ROOT="path/to/aether"
PROJECT_ROOT="path/to/aether/" # path to your local aether repo
TRAINER_PROFILE="gpu" # cpu/gpu/mps/ddp
HF_HOME="/path/to/huggingface/cache" # set or will default to './.cache/huggingface/'
DATA_DIR="../data/" # set orwill default to './data/'

#----------------------------
# OPTIONALS
#----------------------------
HF_HOME="${PROJECT_ROOT}/.cache/huggingface/" # set or will default to './.cache/huggingface/'
DATA_DIR="${PROJECT_ROOT}/data/" # set to your local data folder (for aether), or will default to '${PROJECT_ROOT}/data/'

# Working directories
# STORAGE_MODE=# or "shared"
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,4 @@ uv.lock
notebooks/01-TvdP-tmp.ipynb
*/source/*
*.tif # for now
..env.swp
32 changes: 27 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,22 @@ This project develops an EO embedding/language model that can be used for explai

### Virtual environment

First, install dependencies in a venv using [uv](https://docs.astral.sh/uv/getting-started/installation/)
To install the dependencies in a venv using [uv](https://docs.astral.sh/uv/getting-started/installation/), first, clone the repo:

```bash
# clone project
git clone https://github.com/WUR-AI/aether
cd aether
```

Then, create a virtual environment (or alternatively via conda):
```bash
# Create venv
python3 -m venv .venv
source .venv/bin/activate
```

Then, install `uv` and use this to install all packages.
```bash
# install uv manager
pip install uv
Expand All @@ -52,9 +54,16 @@ Note, running `uv sync` in the venv will always update the package to the most u

### Set paths

Next, create a file in your local repo parent folder `aether/` called `.env`. Copy the contents of `aether/env.example` and adjust the paths to your local system. **Important**: `DATA_DIR` should either point to `aether/data/` OR if it points to another folder (e.g., `my/local/data/`) then copy the contents of `aether/data/` to `my/local/data/` to ensure the butterfly use case runs using the provided example data. Other data will automatically be downloaded and organised by `pooch` if possible, or should be copied manually.
Next, create a file in your local repo parent folder `aether/` called `.env` and copy the contents of `aether/.env.example`:

Data folders should follow the following directory structure:
```bash
cp .env.example .env
```
Adjust the paths in `.env` to your local system. **At a minimum, you should set PROJECT_ROOT!**.

**Important**: `DATA_DIR` should either point to `aether/data/` (default setting) OR if it points to another folder (e.g., `my/local/data/`) then copy the contents of the `aether/data/` folder to `my/local/data/` to ensure the butterfly use case runs using the provided example data. Other data will automatically be downloaded and organised by `pooch` if possible into `DATA_DIR`, or should be copied manually.

Data folders should follow the following directory structure within `DATA_DIR`:

```
├── registry.txt <- Pooch config file, don't change.
Expand All @@ -73,7 +82,18 @@ Data folders should follow the following directory structure:
├── other_dataset/
```

### Training
### Verify installation:

To verify whether the installation was successful, run the tests in `aether/` using:
```bash
pytest --use-mock -m "not slow"
```
which should pass all tests.


## Training

Currently, we have implemented 2 models: a prediction model (that predicts target variables from EO data) and an alignment model (that aligns EO embeddings with text embeddings).

Experiment configurations (such as choosing data, encoders, hyperparameters etc.) are managed through [Hydra](https://hydra.cc/) configurations. Define your experiment configurations in `configs/experiments/experiment_name.yaml`, for example to train predictive model with GeoCLIP coordinate encoder for the Butterfly data using `configs/experiments/prediction.yaml` (copied below)

Expand Down Expand Up @@ -112,6 +132,8 @@ To execute this experiment run (inside your venv):
python train.py experiment=prediction
```

Please see the [Hydra](https://hydra.cc/) and [Hydra-Lightning template](https://github.com/ashleve/lightning-hydra-template) documentation for further examples of how to configure training runs.

## Directory structure

We follow the directory structure from the [Hydra-Lightning template](https://github.com/ashleve/lightning-hydra-template), which looks like:
Expand All @@ -136,7 +158,7 @@ We follow the directory structure from the [Hydra-Lightning template](https://gi
│ ├── eval.yaml <- Main config for evaluation
│ └── train.yaml <- Main config for training
├── data <- Project data
├── data <- Project data (for aether, this can also be elsewhere, see environment paths).
├── logs <- Logs generated by hydra and lightning loggers
Expand Down
2 changes: 1 addition & 1 deletion data/registry.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# S2BMS dataset (butterfly, ecology UC)
S2BMS.zip md5:af98bf3d1d0c4645c3c5787d49f59a70 doi:10.5281/zenodo.15198883
S2BMS.zip md5:af98bf3d1d0c4645c3c5787d49f59a70 https://zenodo.org/records/15198884/files/S2BMS.zip?download=1

# Satbird (birds, ecology UC)
Kenya.zip None https://drive.google.com/uc?id=19PSNaKQn1papoT-jN5FkzTp7Xf4M1juD
Expand Down
15 changes: 15 additions & 0 deletions data/s2bms/caption_templates/v1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[
"Location with <aux_corine_frac_top_1>, <aux_corine_frac_top_2> and <aux_corine_frac_top_3>.",
"Area with <aux_corine_frac_243> and <aux_corine_frac_322>.",
"Site with <aux_corine_frac_top_1> and <aux_corine_frac_top_2>, with <aux_bioclim_01> and <aux_bioclim_12>.",
"Location with <aux_corine_frac_top_1> and <aux_corine_frac_top_2>, with <aux_bioclim_05> and <aux_bioclim_06>.",
"Area with <aux_corine_frac_top_1> and <aux_corine_frac_top_2>, with <aux_bioclim_01> and <aux_bioclim_07>.",
"Site with <aux_corine_frac_top_1> and <aux_corine_frac_top_2>, with <aux_bioclim_13> and <aux_bioclim_14>.",
"Location with <aux_corine_frac_top_1> and <aux_corine_frac_top_2>, with <aux_bioclim_01>, <aux_bioclim_13> and <aux_bioclim_14>.",
"Area with <aux_corine_frac_top_1>, with <aux_bioclim_01> and <aux_bioclim_12>.",
"Site with <aux_corine_frac_top_1>, with <aux_bioclim_05> and <aux_bioclim_06>.",
"Location with <aux_corine_frac_top_1>, with <aux_bioclim_01> and <aux_bioclim_07>.",
"Area with <aux_corine_frac_top_1>, with <aux_bioclim_13> and <aux_bioclim_14>.",
"Site with <aux_corine_frac_top_1>, with <aux_bioclim_01>, <aux_bioclim_13> and <aux_bioclim_14>."

]
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ dependencies = [
"pre-commit>=4.5.1",
"pooch>=1.8.2",
"torchinfo>=1.8.0",
"transformers==4.57",
"gdown>=5.2.1",
]

Expand Down
10 changes: 8 additions & 2 deletions src/data/base_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ def __init__(
self.use_target_data: bool = use_target_data
self.use_aux_data: bool = use_aux_data
self.records: dict[str, Any] = self.get_records()
self.pooch_cli = None

@final
def get_records(self) -> dict[str, Any]:
Expand All @@ -93,7 +94,12 @@ def get_records(self) -> dict[str, Any]:
columns.extend(["lat", "lon"])
else:
# Add paths
self.add_modality_paths_to_df(modality, params["format"])
self.add_modality_paths_to_df(
modality,
params.get(
"format", KeyError(f"{modality} modality is missing format parameter")
),
)
columns.append(f"{modality}_path")

# Include targets
Expand Down Expand Up @@ -218,7 +224,7 @@ def pooch_setup(self) -> None:

# Initialise pooch client
self.pooch_cli = pooch.create(
path=os.path.join(self.cache_dir, self.data_dir),
path=self.cache_dir,
base_url="",
registry=None,
)
Expand Down
8 changes: 5 additions & 3 deletions src/data/butterfly_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from typing import Any, Dict, override

import numpy as np
import pooch
import torch

import src.data_preprocessing.data_utils as du
Expand Down Expand Up @@ -53,7 +54,7 @@ def setup(self):
return
elif mod == "s2":
self.setup_s2bms()
if self.modalities["s2"].get("preprocessing", "") == "zcored":
if self.modalities["s2"].get("preprocessing", "") == "zscored":
self.init_norm_stats()
elif mod == "tessera":
self.setup_tessera()
Expand All @@ -69,7 +70,8 @@ def setup_s2bms(self) -> None:

# If data does not exist or is empty → full download
if not os.path.exists(dst_dir) or len(os.listdir(dst_dir)) == 0:
import pooch
if self.pooch_cli is None:
self.pooch_setup()

os.makedirs(dst_dir, exist_ok=True)
fnames = self.pooch_cli.fetch("S2BMS.zip", processor=pooch.Unzip())
Expand All @@ -81,7 +83,7 @@ def setup_s2bms(self) -> None:
# Move files to data dir
rename_s2bms(dst_dir, fnames)

with open(os.path.join(dst_dir, "meta.tx"), "w") as f:
with open(os.path.join(dst_dir, "meta.txt"), "w") as f:
f.writelines("Data from S2BMS study\n")
f.writelines("Containing 4 channel S2 256x256px imagery.\n")
# TODO: add more
Expand Down
12 changes: 6 additions & 6 deletions src/data/satbird_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@ def __init__(
:param study_site: study site name [Kenya, USA_summer, USA_winter]
:param mock: whether to mock csv file
"""
# assert study_site in ["Kenya", "USA_summer", "USA_winter"]
assert study_site in ["Kenya"]
assert study_site in ["Kenya", "USA-summer", "USA-winter"]
# assert study_site in ["Kenya"]
self.study_site = study_site

super().__init__(
data_dir=data_dir,
Expand All @@ -47,14 +48,11 @@ def __init__(
mock=mock,
)

self.study_site = study_site

@override
def setup(self):
"""Setups the whole dataset, makes available data of requested modalities."""

# Set up each requested modality

for mod in self.modalities.keys():
if mod == "coords" and len(self.modalities.keys()) == 1:
return
Expand Down Expand Up @@ -95,8 +93,10 @@ def __getitem__(self, idx):
if modality in ["coords"]:
formatted_row["eo"][modality] = torch.tensor([row["lat"], row["lon"]])
elif modality in ["s2", "s2rgb"]:
formatted_row["eo"][modality] = self.load_s2(row[f"{modality}_path"])
s2 = self.load_s2(row[f"{modality}_path"])
# TODO: augmentations
s2 = v2.CenterCrop(self.modalities[modality].get("size", 256))(s2)
formatted_row["eo"][modality] = s2
elif modality == "tessera":
formatted_row["eo"][modality] = self.load_npy(row["tessera_path"])
# TODO any normalisation needed
Expand Down
18 changes: 13 additions & 5 deletions src/data_preprocessing/pooch_helpers.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
import os
import sys

import gdown


def drive_downloader(url, output_file, pooch_obj):
if os.path.exists(output_file):
print(f"{output_file} already exists, skipping.")
return
gdown.download(url, str(output_file), quiet=False)
"""Downloader callback for pooch that uses gdown to fetch files from Google Drive.

Uses fuzzy=True to handle Google Drive's virus scanning page and use_cookies=True to handle
access restrictions.
"""
gdown.download(
url,
str(output_file),
quiet=False,
fuzzy=True,
use_cookies=True,
)
23 changes: 17 additions & 6 deletions src/data_preprocessing/satbird.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ def pooch_satbird_downloader(

conf = {
"Kenya": ("Kenya.zip", pooch.Unzip),
"USA_summer": ("USA_summer.tar.gz", pooch.Untar),
"USA_winter": ("USA_winter.tar.gz", pooch.Untar),
"USA-summer": ("USA_summer.tar.gz", pooch.Untar),
"USA-winter": ("USA_winter.tar.gz", pooch.Untar),
}

fnames = pooch_cli.fetch(
Expand All @@ -86,7 +86,7 @@ def pooch_satbird_downloader(
extract_satbird_data(data_dir, fnames, study_site)

# Delete the unzipped dir at the end
if False:
if True:
unzip_dir = os.path.join(cache_dir, f"{study_site}.zip.unzip")
for name in os.listdir(unzip_dir):
path = os.path.join(unzip_dir, name)
Expand Down Expand Up @@ -123,7 +123,6 @@ def extract_satbird_data(data_dir: str, fnames: list[str], study_site: str) -> N
# Iterate through all file names from pooch

for fname in fnames:

# get the base name
base = os.path.basename(fname)
dst = None
Expand All @@ -137,15 +136,15 @@ def extract_satbird_data(data_dir: str, fnames: list[str], study_site: str) -> N
elif "environmental" in fname:
dst = os.path.join(env_dir, f"environmental_{base}")
elif "images_visual" in fname:
base = base.replace("_visual", " ")
base = base.replace("_visual", "")
dst = os.path.join(s2rgb_dir, f"s2rgb_{base}")
elif "images" in fname:
dst = os.path.join(s2_dir, f"s2_{base}")
elif "splits_final.csv" in fname:
splits_file.append(fname)

if dst is not None and not os.path.exists(dst):
shutil.copy(fname, dst)
shutil.move(fname, dst)
print(f"Moving {base} to {dst}")

# Compile model ready csv and split file
Expand Down Expand Up @@ -232,3 +231,15 @@ def make_model_ready_csv(
df_joined.rename(columns=rename_col, inplace=True)
df_joined.to_csv(model_ready_csv_path, index=False)
print(f"Model ready csv saved {model_ready_csv_path}")


if __name__ == "__main__":
print(os.getcwd())
study_site = "USA-winter"

setup_satbird_from_pooch(
f"data/satbird-{study_site}/",
cache_dir="data/cache",
study_site=study_site,
registry_file="data/registry.txt",
)
Loading