Skip to content

Supported Models (Advanced Reference)

This page is for cross-model comparison after you already have a shortlist.

If you are choosing a model for the first time, start with:

If you need the exact contract for one specific model, use the per-model detail pages in Reference -> Model Details.

If you are authoring a new per-model doc page, use:


Use this page to compare:

  1. preprocessing assumptions
  2. temporal packaging
  3. side-input requirements
  4. environment-variable tuning knobs

Jump to:


How To Use This Page

Reading tips

  • Start with Quick Comparison if you are deciding between models
  • Read Temporal Handling and Multi-frame Semantics before comparing temporal models
  • Read Modality and Extra Inputs Matrix if you need fair cross-model benchmarking
  • Read Environment Variables... only when tuning preprocessing or reproducing training pipelines

Canonical model IDs in this page use the short public names from MODEL_SPECS, such as remoteclip, prithvi, terrafm, and thor. Some linked detail-page filenames still retain older names for compatibility.

Precomputed Embeddings

Model ID Output Resolution Dim Time Coverage Notes
Tessera tessera pooled / grid 10m 128 2017–2025 GeoTessera global tile embeddings
Google Satellite Embedding (Alpha Earth) gse pooled / grid 10 m 64 2017–2024 Annual embeddings via GEE
Copernicus Embed copernicus pooled / grid 0.25° 768 2021 Official Copernicus embeddings

On-the-fly Foundation Models

Source of truth:

  • src/rs_embed/embedders/catalog.py
  • src/rs_embed/embedders/onthefly_*.py
  • src/rs_embed/embedders/_vit_mae_utils.py
  • src/rs_embed/embedders/runtime_utils.py

Quick Comparison

Use this table for a first-pass side-by-side comparison of input assumptions and preprocessing behavior.

Model ID Architecture / Backbone Default Fetch Resolution Input Default Preprocessing Resize / Crop / Pad Output Structure Training Alignment
remoteclip rshf.remoteclip.RemoteCLIP (open_clip style CLIP ViT) 10m S2 RGB (B4,B3,B2) raw SR 0..10000 -> /10000 -> RGB uint8; then model transform if available, else CLIP norm image size 224; fallback path uses Resize + CenterCrop; no pad pooled vector or ViT token grid Medium (high if wrapper transform matches training; fallback is generic CLIP pipeline)
satmae rshf.satmae.SatMAE 10m S2 RGB (B4,B3,B2) raw SR -> /10000 -> RGB uint8; prefer model transform, else CLIP norm default 224; CLIP fallback has Resize + CenterCrop; no pad token sequence -> pooled or patch-token grid Medium
satmaepp rshf.satmaepp.SatMAEPP 10m S2 RGB (B4,B3,B2) raw SR -> /10000 -> RGB uint8; SatMAE++ fMoW eval preprocessing (Normalize + Resize(short side) + CenterCrop), default channel order bgr default 224; source-aligned short-side resize + center crop; no pad token sequence -> pooled or patch-token grid High
satmaepp_s2_10b SatMAE++ grouped-channel source branch (models_mae_group_channels.py, base / large runtime families) 10m S2 SR 10-band (B2,B3,B4,B5,B6,B7,B8,B8A,B11,B12) clip 0..10000; source Sentinel min/max mapping to uint8; ToTensor + Resize(short side) + CenterCrop default 96 with patch size 8; source-style resize/crop; no pad grouped token sequence -> pooled or group-reduced spatial token grid High
scalemae rshf.scalemae.ScaleMAE (ViT style) 10m S2 RGB (B4,B3,B2) + input_res_m raw SR -> /10000 -> RGB uint8; CLIP norm tensor; pass input_res_m default 224; CLIP path has Resize + CenterCrop; no pad token sequence or pooled vector depending on wrapper output Medium
anysat AnySat from upstream hubconf.py (AnySat, tiny / small / base) 10m S2 10-band TCHW (or CHW auto-expanded) clip to 0..10000; normalize mode default per_tile_zscore; builds per-frame s2_dates resize TCHW to default 24; no crop, no pad patch output [D,H,W], pooled by spatial mean/max Medium
galileo Encoder from official single_file_galileo.py 10m S2 10-band TCHW (or CHW auto-expanded) clip to 0..10000; normalize mode default unit_scale; constructs Galileo tensors with configurable T + per-frame months, optional NDVI channel default 64 with patch 8; bilinear resize; no pad pooled token vector and S2-group token grid Medium
wildsat WildSAT backbone + optional image head from checkpoint 10m S2 RGB CHW clip to 0..10000 then /10000; default normalization minmax; convert to uint8 then unit tensor default 224; resize RGB; no pad pooled branch output and optional grid (token or feature path) Medium-Low
prithvi Vendored PrithviMAE runtime with HF checkpoints 30m S2 6-band (BLUE,GREEN,RED,NIR_NARROW,SWIR_1,SWIR_2) raw SR -> /10000 -> clamp [0,1]; prep mode from env default mode resize to 224; optional pad to patch multiple (legacy) token sequence -> pooled or patch-token grid Medium
terrafm TerraFM-B from vendored runtime + HF weights 10m S2 12-band or S1 VV/VH S2: /10000 to [0,1]; S1: log1p + p99 scaling to [0,1] resize to 224; no pad pooled embedding, optional feature-map grid Medium
terramind TerraTorch BACKBONE_REGISTRY TerraMind backbone 10m S2 SR 12-band raw 0..10000; resize 224; z-score with TerraMind v1/v01 pretrained mean/std fixed 224; no pad token sequence -> pooled or patch-token grid High
dofa DOFA ViT (base / large, official checkpoints) 10m multi-band SR CHW + wavelengths raw SR -> /10000 to [0,1]; provide/infer wavelengths bilinear resize to 224; explicitly no crop/pad pooled vector or token grid (usually 14x14) Medium-High
fomo FoMo MultiSpectralViT (FoMo-Bench) 10m S2 SR 12-band clip 0..10000; default unit_scale (optional minmax/none) default 64; bilinear resize; no pad token sequence pooled; grid as spectral-mean patch-token map Medium
thor Fully vendored THOR runtime (tiny / small / base / large) 10m S2 SR 10-band clip 0..10000; default thor_stats z-score after reflectance scaling default 288; bilinear resize; no pad pooled tokens and grouped token grid Medium-High
agrifm AgriFM PretrainingSwinTransformer3DEncoder 10m S2 10-band time series [T,C,H,W] clip 0..10000; default agrifm_stats z-score using official config stats default 224; TCHW resize; no pad feature map grid [D,H,W], pooled by spatial mean/max High
satvision timm SwinTransformerV2 (SatVision-TOA checkpoints) 1000m TOA 14 channels in strict order channel-aware normalization to [0,1] (auto/raw/unit, reflectance + emissive calibration) default 128; bilinear resize; no pad model output as pooled or grid depending on tensor shape High (if band order and calibration match checkpoint)

Note:

  • "Default Fetch Resolution" refers to the default source/provider-side resolution used when fetching raw inputs.
  • It does not mean the final spatial size of the tensor after model-specific resize/crop/pad.

Temporal Handling

Read this section before comparing any model that accepts TemporalSpec.range(...).

  • For most on-the-fly adapters, TemporalSpec.range(start, end) means: filter imagery in [start, end), then build one composite patch for model input (median by default, or mosaic if configured via SensorSpec.composite).
  • In these adapters, meta.input_time is typically the midpoint of the temporal window and is mainly metadata (or an auxiliary time signal for models that require it), not a guaranteed single-scene acquisition date.
  • Multi-frame adapters: agrifm, anysat, and galileo fetch TCHW sequences by splitting the requested range into sub-windows and compositing each sub-window into one frame.
  • Current single-composite adapters include: remoteclip, satmae, satmaepp, satmaepp_s2_10b, scalemae, wildsat, prithvi, terrafm, terramind, dofa, fomo, thor, and satvision.

Multi-frame Semantics

This section only matters for adapters that construct multi-frame inputs from one requested time window.

Shared behavior for current multi-frame adapters (agrifm, anysat, galileo):

  • Frame construction: split TemporalSpec.range(start, end) into T equal sub-windows (end-exclusive), then composite each sub-window into one frame.
  • Missing-observation fallback: if a sub-window has no valid image, provider path reuses a fallback composite so frame count remains stable.
  • Fixed frame count: runtime always ensures exact T frames for model input. For user-provided input_chw, CHW is repeated to T, and TCHW is padded/truncated to T.
  • Sensor compositing policy: frame composite mode follows SensorSpec.composite (median default, mosaic optional).

Per-model temporal packaging:

Model ID Frame count env (default) Temporal side input Notes
agrifm RS_EMBED_AGRIFM_FRAMES (8) none (uses TCHW directly) Temporal information is encoded only in the frame stack.
anysat RS_EMBED_ANYSAT_FRAMES (8) s2_dates (per-frame DOY, 0..364) DOY values are derived from each frame bin midpoint date.
galileo RS_EMBED_GALILEO_FRAMES (8) months (per-frame month, 1..12) By default from frame bin midpoints; RS_EMBED_GALILEO_MONTH can force a constant month for all frames.

Modality and Extra Inputs Matrix

Use this table to avoid unfair comparisons between plain image encoders and adapters that require side inputs.

Interpretation:

  • "Backbone multimodal" means the upstream foundation model family supports multiple modalities.
  • "Current rs-embed path" means what this implementation currently feeds in practice.
  • "Requires extra metadata" means additional non-image inputs required by the forward path (hard requirement).
Model ID Backbone multimodal? Current rs-embed path uses multiple modalities? Multi-input forward (beyond image tensor)? Requires extra metadata?
remoteclip No No No No
satmae No No No No
satmaepp No No No No
satmaepp_s2_10b No (this adapter path) No No No (but strict 10-band order is required)
scalemae No No Yes (input_res_m) Yes: scale/resolution (sensor.scale_m)
anysat Yes Partially (S2-only imagery, plus temporal date tokens) Yes (s2, s2_dates) Yes: day-of-year/date signal (derived from temporal range)
galileo Yes Mostly S2 path in current adapter + temporal month tokens Yes (multiple tensors + masks + months) Yes: month/time signal (derived from temporal range)
wildsat No No No No
prithvi No (this adapter path) No Yes (x, temporal_coords, location_coords) Yes: location + time are required
terrafm Yes (S1/S2) Yes (select one modality per call: s1 or s2) No No hard extra metadata (optional S1 options: orbit, linear/DB path)
terramind Yes Usually single selected modality (S2L2A default) No (single selected modality tensor in this adapter) No hard extra metadata
dofa Yes (spectral generalization) Yes (multi-band spectral input) Yes (image + wavelength list) Yes: per-band wavelengths (explicit or inferable from bands)
fomo No No No No
thor No (this adapter path) No No No
agrifm No (this adapter path) No No extra side tensor, but temporal stack [T,C,H,W] required Temporal coverage is important (no separate metadata tensor)
satvision No (this adapter path) No No separate side tensor Yes: strict 14-channel order/calibration schema (band semantics)

Practically multi-input models:

  • prithvi: image + temporal coords + location coords
  • anysat: image/time-series + date tokens (s2_dates)
  • galileo: image-derived tensors + masks + per-frame month tokens (months)
  • dofa: image + wavelength vector
  • scalemae: image + input_res_m

Preprocessing and Temporal Env Vars

This table only lists env vars that materially change model input construction or temporal packaging.

Model ID Main preprocessing env keys
remoteclip fixed image_size=224 in code path; no per-model preprocess env switch
satmae RS_EMBED_SATMAE_IMG
satmaepp RS_EMBED_SATMAEPP_ID, RS_EMBED_SATMAEPP_IMG, RS_EMBED_SATMAEPP_CHANNEL_ORDER, RS_EMBED_SATMAEPP_BGR
satmaepp_s2_10b RS_EMBED_SATMAEPP_S2_CKPT_REPO, RS_EMBED_SATMAEPP_S2_CKPT_FILE, RS_EMBED_SATMAEPP_S2_MODEL_FN, RS_EMBED_SATMAEPP_S2_IMG, RS_EMBED_SATMAEPP_S2_PATCH, RS_EMBED_SATMAEPP_S2_GRID_REDUCE, RS_EMBED_SATMAEPP_S2_WEIGHTS_ONLY
scalemae RS_EMBED_SCALEMAE_IMG
anysat RS_EMBED_ANYSAT_IMG, RS_EMBED_ANYSAT_NORM, RS_EMBED_ANYSAT_FRAMES
galileo RS_EMBED_GALILEO_IMG, RS_EMBED_GALILEO_PATCH, RS_EMBED_GALILEO_NORM, RS_EMBED_GALILEO_INCLUDE_NDVI, RS_EMBED_GALILEO_FRAMES, RS_EMBED_GALILEO_MONTH
wildsat RS_EMBED_WILDSAT_IMG, RS_EMBED_WILDSAT_NORM
prithvi RS_EMBED_PRITHVI_PREP, RS_EMBED_PRITHVI_IMG, RS_EMBED_PRITHVI_PATCH_MULT
terrafm modality and sensor-side options (s2/s1); image size fixed to 224 in implementation
terramind RS_EMBED_TERRAMIND_NORMALIZE (default z-score stats), image size fixed 224
dofa image size fixed 224; provider/tensor channels and wavelengths drive preprocessing
fomo RS_EMBED_FOMO_IMG, RS_EMBED_FOMO_NORM
thor RS_EMBED_THOR_IMG, RS_EMBED_THOR_NORMALIZE
agrifm RS_EMBED_AGRIFM_IMG, RS_EMBED_AGRIFM_NORM, RS_EMBED_AGRIFM_FRAMES
satvision RS_EMBED_SATVISION_TOA_IMG, RS_EMBED_SATVISION_TOA_NORM, channel-index and calibration env keys

Practical Guidance

  • For highest reproducibility, keep each model's default normalization mode unless you can match the original training pipeline exactly.
  • For strict-schema models (satvision, terramind, thor, agrifm), do not change channel order unless checkpoint metadata explicitly allows it.
  • If comparing embeddings across models, standardize ROI and temporal compositing first; model preprocessing differences are substantial.