Supported Models (Advanced Reference)¶
This page preserves the detailed model comparison matrices and preprocessing notes.
If you are choosing a model for the first time, start with:
If you are authoring a new per-model doc page, use:
This page is best used after you already narrowed down candidate models and want to compare:
- preprocessing assumptions
- temporal packaging
- side-input requirements
- environment-variable tuning knobs
How To Use This Page¶
Quick chooser by goal¶
| Goal | Start with | Why |
|---|---|---|
| Fast baseline / simple pipeline | tessera, gse, copernicus |
Precomputed embeddings, fewer runtime dependencies |
| General on-the-fly RGB experiments | remoteclip, satmae, scalemae, dynamicvis |
Simple S2 RGB input paths |
| Time-series modeling | agrifm, anysat, galileo |
Native multi-frame temporal packaging |
| Multispectral / strict spectral semantics | dofa, terramind, thor, satvision |
Strong channel/schema assumptions |
| S1/S2 modality experiments | terrafm |
Supports S2 or S1 paths (per call) |
Readability tips¶
- Start with Quick Comparison if you are deciding between models
- Read Temporal Handling and Multi-frame Semantics before comparing temporal models
- Read Modality and Extra Inputs Matrix if you need fair cross-model benchmarking
- Read Environment Variables... only when tuning preprocessing or reproducing training pipelines
Precomputed Embeddings¶
| Model | ID | Output | Resolution | Dim | Time Coverage | Notes |
|---|---|---|---|---|---|---|
| Tessera | tessera |
pooled / grid | 10m | 128 | 2017–2025 | GeoTessera global tile embeddings |
| Google Satellite Embedding (Alpha Earth) | gse |
pooled / grid | 10 m | 64 | 2017–2024 | Annual embeddings via GEE |
| Copernicus Embed | copernicus |
pooled / grid | 0.25° | 768 | 2021 | Official Copernicus embeddings |
On-the-fly Foundation Models¶
Source of truth:
src/rs_embed/embedders/catalog.pysrc/rs_embed/embedders/onthefly_*.pysrc/rs_embed/embedders/_vit_mae_utils.pysrc/rs_embed/embedders/runtime_utils.py
Registered on-the-fly IDs:
remoteclipsatmaescalemaeanysatdynamicvisgalileowildsatprithviterrafmterraminddofafomothoragrifmsatvision
Quick Comparison¶
| Model ID | Architecture / Backbone | Input | Default Preprocessing | Resize / Crop / Pad | Output Structure | Training Alignment |
|---|---|---|---|---|---|---|
remoteclip |
rshf.remoteclip.RemoteCLIP (open_clip style CLIP ViT) |
S2 RGB (B4,B3,B2) |
raw SR 0..10000 -> /10000 -> RGB uint8; then model transform if available, else CLIP norm |
image size 224; fallback path uses Resize + CenterCrop; no pad |
pooled vector or ViT token grid | Medium (high if wrapper transform matches training; fallback is generic CLIP pipeline) |
satmae |
rshf.satmae.SatMAE |
S2 RGB (B4,B3,B2) |
raw SR -> /10000 -> RGB uint8; prefer model transform, else CLIP norm |
default 224; CLIP fallback has Resize + CenterCrop; no pad |
token sequence -> pooled or patch-token grid | Medium |
scalemae |
rshf.scalemae.ScaleMAE (ViT style) |
S2 RGB (B4,B3,B2) + input_res_m |
raw SR -> /10000 -> RGB uint8; CLIP norm tensor; pass input_res_m |
default 224; CLIP path has Resize + CenterCrop; no pad |
token sequence or pooled vector depending on wrapper output | Medium |
anysat |
AnySat from upstream hubconf.py (AnySat) |
S2 10-band TCHW (or CHW auto-expanded) | clip to 0..10000; normalize mode default per_tile_zscore; builds per-frame s2_dates |
resize TCHW to default 24; no crop, no pad | patch output [D,H,W], pooled by spatial mean/max |
Medium |
dynamicvis |
DynamicVisBackbone (official repo) |
S2 RGB (B4,B3,B2) |
raw SR -> /10000 -> RGB uint8 -> ImageNet mean/std |
default 512; resize to square; no pad | last feature map [D,H,W], pooled by spatial mean/max |
Medium |
galileo |
Encoder from official single_file_galileo.py |
S2 10-band TCHW (or CHW auto-expanded) | clip to 0..10000; normalize mode default unit_scale; constructs Galileo tensors with configurable T + per-frame months, optional NDVI channel |
default 64 with patch 8; bilinear resize; no pad | pooled token vector and S2-group token grid | Medium |
wildsat |
WildSAT backbone + optional image head from checkpoint | S2 RGB CHW | clip to 0..10000 then /10000; default normalization minmax; convert to uint8 then unit tensor |
default 224; resize RGB; no pad | pooled branch output and optional grid (token or feature path) | Medium-Low |
prithvi |
TerraTorch BACKBONE_REGISTRY Prithvi backbone |
S2 6-band (BLUE,GREEN,RED,NIR_NARROW,SWIR_1,SWIR_2) |
raw SR -> /10000 -> clamp [0,1]; prep mode from env |
default mode resize to 224; optional pad to patch multiple (legacy) |
token sequence -> pooled or patch-token grid | Medium |
terrafm |
TerraFM-B from HF code/weights | S2 12-band or S1 VV/VH | S2: /10000 to [0,1]; S1: log1p + p99 scaling to [0,1] |
resize to 224; no pad | pooled embedding, optional feature-map grid | Medium |
terramind |
TerraTorch BACKBONE_REGISTRY TerraMind backbone |
S2 SR 12-band | raw 0..10000; resize 224; z-score with TerraMind v1/v01 pretrained mean/std |
fixed 224; no pad | token sequence -> pooled or patch-token grid | High |
dofa |
TorchGeo DOFA (dofa_base_patch16_224 / dofa_large_patch16_224) |
multi-band SR CHW + wavelengths | raw SR -> /10000 to [0,1]; provide/infer wavelengths |
bilinear resize to 224; explicitly no crop/pad | pooled vector or token grid (usually 14x14) | Medium-High |
fomo |
FoMo MultiSpectralViT (FoMo-Bench) |
S2 SR 12-band | clip 0..10000; default unit_scale (optional minmax/none) |
default 64; bilinear resize; no pad | token sequence pooled; grid as spectral-mean patch-token map | Medium |
thor |
THOR via TerraTorch + thor_terratorch_ext |
S2 SR 10-band | clip 0..10000; default thor_stats z-score after reflectance scaling |
default 288; bilinear resize; no pad | pooled tokens and grouped token grid | Medium-High |
agrifm |
AgriFM PretrainingSwinTransformer3DEncoder |
S2 10-band time series [T,C,H,W] |
clip 0..10000; default agrifm_stats z-score using official config stats |
default 224; TCHW resize; no pad | feature map grid [D,H,W], pooled by spatial mean/max |
High |
satvision |
timm SwinTransformerV2 (SatVision-TOA checkpoints) |
TOA 14 channels in strict order | channel-aware normalization to [0,1] (auto/raw/unit, reflectance + emissive calibration) |
default 128; bilinear resize; no pad | model output as pooled or grid depending on tensor shape | High (if band order and calibration match checkpoint) |
Temporal Handling¶
- For most on-the-fly adapters,
TemporalSpec.range(start, end)means: filter imagery in[start, end), then build one composite patch for model input (medianby default, ormosaicif configured viaSensorSpec.composite). - In these adapters,
meta.input_timeis typically the midpoint of the temporal window and is mainly metadata (or an auxiliary time signal for models that require it), not a guaranteed single-scene acquisition date. - Multi-frame adapters:
agrifm,anysat, andgalileofetch TCHW sequences by splitting the requested range into sub-windows and compositing each sub-window into one frame. - Current single-composite adapters include:
remoteclip,satmae,scalemae,dynamicvis,wildsat,prithvi,terrafm,terramind,dofa,fomo,thor, andsatvision.
Multi-frame Semantics¶
Shared behavior for current multi-frame adapters (agrifm, anysat, galileo):
- Frame construction: split
TemporalSpec.range(start, end)intoTequal sub-windows (end-exclusive), then composite each sub-window into one frame. - Missing-observation fallback: if a sub-window has no valid image, provider path reuses a fallback composite so frame count remains stable.
- Fixed frame count: runtime always ensures exact
Tframes for model input. For user-providedinput_chw,CHWis repeated toT, andTCHWis padded/truncated toT. - Sensor compositing policy: frame composite mode follows
SensorSpec.composite(mediandefault,mosaicoptional).
Per-model temporal packaging:
| Model ID | Frame count env (default) | Temporal side input | Notes |
|---|---|---|---|
agrifm |
RS_EMBED_AGRIFM_FRAMES (8) |
none (uses TCHW directly) |
Temporal information is encoded only in the frame stack. |
anysat |
RS_EMBED_ANYSAT_FRAMES (8) |
s2_dates (per-frame DOY, 0..364) |
DOY values are derived from each frame bin midpoint date. |
galileo |
RS_EMBED_GALILEO_FRAMES (8) |
months (per-frame month, 1..12) |
By default from frame bin midpoints; RS_EMBED_GALILEO_MONTH can force a constant month for all frames. |
Modality and Extra Inputs Matrix¶
Interpretation:
- "Backbone multimodal" means the upstream foundation model family supports multiple modalities.
- "Current rs-embed path" means what this implementation currently feeds in practice.
- "Requires extra metadata" means additional non-image inputs required by the forward path (hard requirement).
| Model ID | Backbone multimodal? | Current rs-embed path uses multiple modalities? | Multi-input forward (beyond image tensor)? | Requires extra metadata? |
|---|---|---|---|---|
remoteclip |
No | No | No | No |
satmae |
No | No | No | No |
scalemae |
No | No | Yes (input_res_m) |
Yes: scale/resolution (sensor.scale_m) |
anysat |
Yes | Partially (S2-only imagery, plus temporal date tokens) | Yes (s2, s2_dates) |
Yes: day-of-year/date signal (derived from temporal range) |
dynamicvis |
No | No | No | No |
galileo |
Yes | Mostly S2 path in current adapter + temporal month tokens | Yes (multiple tensors + masks + months) |
Yes: month/time signal (derived from temporal range) |
wildsat |
No | No | No | No |
prithvi |
No (this adapter path) | No | Yes (x, temporal_coords, location_coords) |
Yes: location + time are required |
terrafm |
Yes (S1/S2) |
Yes (select one modality per call: s1 or s2) |
No | No hard extra metadata (optional S1 options: orbit, linear/DB path) |
terramind |
Yes | Usually single selected modality (S2L2A default) |
No (single selected modality tensor in this adapter) | No hard extra metadata |
dofa |
Yes (spectral generalization) | Yes (multi-band spectral input) | Yes (image + wavelength list) | Yes: per-band wavelengths (explicit or inferable from bands) |
fomo |
No | No | No | No |
thor |
No (this adapter path) | No | No | No |
agrifm |
No (this adapter path) | No | No extra side tensor, but temporal stack [T,C,H,W] required |
Temporal coverage is important (no separate metadata tensor) |
satvision |
No (this adapter path) | No | No separate side tensor | Yes: strict 14-channel order/calibration schema (band semantics) |
Practically multi-input models:
prithvi: image + temporal coords + location coordsanysat: image/time-series + date tokens (s2_dates)galileo: image-derived tensors + masks + per-frame month tokens (months)dofa: image + wavelength vectorscalemae: image +input_res_m
Environment Variables That Directly Change Preprocessing/Temporal Packaging¶
| Model ID | Main preprocessing env keys |
|---|---|
remoteclip |
fixed image_size=224 in code path; no per-model preprocess env switch |
satmae |
RS_EMBED_SATMAE_IMG |
scalemae |
RS_EMBED_SCALEMAE_IMG |
anysat |
RS_EMBED_ANYSAT_IMG, RS_EMBED_ANYSAT_NORM, RS_EMBED_ANYSAT_FRAMES |
dynamicvis |
RS_EMBED_DYNAMICVIS_IMG |
galileo |
RS_EMBED_GALILEO_IMG, RS_EMBED_GALILEO_PATCH, RS_EMBED_GALILEO_NORM, RS_EMBED_GALILEO_INCLUDE_NDVI, RS_EMBED_GALILEO_FRAMES, RS_EMBED_GALILEO_MONTH |
wildsat |
RS_EMBED_WILDSAT_IMG, RS_EMBED_WILDSAT_NORM |
prithvi |
RS_EMBED_PRITHVI_PREP, RS_EMBED_PRITHVI_IMG, RS_EMBED_PRITHVI_PATCH_MULT |
terrafm |
modality and sensor-side options (s2/s1); image size fixed to 224 in implementation |
terramind |
RS_EMBED_TERRAMIND_NORMALIZE (default z-score stats), image size fixed 224 |
dofa |
image size fixed 224; provider/tensor channels and wavelengths drive preprocessing |
fomo |
RS_EMBED_FOMO_IMG, RS_EMBED_FOMO_NORM |
thor |
RS_EMBED_THOR_IMG, RS_EMBED_THOR_NORMALIZE |
agrifm |
RS_EMBED_AGRIFM_IMG, RS_EMBED_AGRIFM_NORM, RS_EMBED_AGRIFM_FRAMES |
satvision |
RS_EMBED_SATVISION_TOA_IMG, RS_EMBED_SATVISION_TOA_NORM, channel-index and calibration env keys |
Practical Guidance¶
- For highest reproducibility, keep each model's default normalization mode unless you can match the original training pipeline exactly.
- For strict-schema models (
satvision,terramind,thor,agrifm), do not change channel order unless checkpoint metadata explicitly allows it. - If comparing embeddings across models, standardize ROI and temporal compositing first; model preprocessing differences are substantial.