Skip to content

RemoteCLIP (remoteclip)

Quick Facts

Field Value
Model ID remoteclip
Aliases remoteclip_s2rgb
Family / Backbone RemoteCLIP (CLIP-style ViT via rshf.remoteclip.RemoteCLIP)
Adapter type on-the-fly
Training alignment Medium (higher if wrapper model.transform(...) matches training pipeline; fallback is generic CLIP preprocess)

RemoteCLIP In 30 Seconds

RemoteCLIP is a CLIP-style vision-language ViT continually fine-tuned on remote-sensing image-text pairs, so its embeddings live in a shared image/text space that supports caption-based retrieval — in rs-embed you are getting the visual side of that shared space from a 3-band RGB Sentinel-2 input.

In rs-embed, its most important characteristics are:

  • RGB-only (B4,B3,B2) with a fixed 224×224 preprocessing path: see Input Contract
  • checkpoint override goes through sensor.collection="hf:<repo>" rather than an environment variable: see Environment Variables / Tuning Knobs
  • preprocessing prefers the wrapper model.transform(...) but falls back to a generic CLIP pipeline — these paths are not identical and should be logged: see Preprocessing Pipeline

Input Contract

Field Value
Backend provider (auto recommended)
TemporalSpec required TemporalSpec.range(start, end) — treated as filter-and-composite window
Default collection COPERNICUS/S2_SR_HARMONIZED
Default bands (order) B4, B3, B2
Default fetch scale_m=10, cloudy_pct=30, composite="median"
input_chw CHW, C=3 in (B4,B3,B2) order
Side inputs none

Checkpoint override via sensor.collection

Use sensor.collection="hf:<repo_or_path>" (e.g. hf:MVRL/remote-clip-vit-base-patch32) to swap in a different RemoteCLIP checkpoint — the hf: prefix is how this adapter distinguishes checkpoint overrides from regular provider collections.


Preprocessing Pipeline

Resize is the default — tiling is also available

The pipeline below shows the default input_prep="resize" path. For large ROIs, use input_prep="tile" to split the input into tiles and preserve spatial detail. See Choosing Settings.

flowchart LR
    INPUT["S2 RGB"] --> PREP["Normalize → uint8\n→ model.transform or CLIP fallback"]
    PREP --> FWD["CLIP ViT forward"]
    FWD --> POOL["pooled: token mean/max"]
    FWD --> GRID["grid: patch-token (D,H,W)"]

Current adapter image size

The image size is fixed at 224 in this adapter path.


Architecture Concept

flowchart LR
    subgraph Input
        RGB["S2 RGB\n(B4,B3,B2)"]
    end
    subgraph "CLIP ViT"
        RGB --> PRE["Preprocess\n(model.transform\nor CLIP fallback)"]
        PRE --> FWD["CLIP ViT\nforward"]
    end
    subgraph "Shared Image ↔ Text Space"
        FWD --> EMB["Embeddings support\ncaption-based\nsimilarity & retrieval"]
        EMB --> POOL["pooled:\ntoken mean/max"]
        EMB --> GRID["grid:\npatch-token (D,H,W)"]
    end

Environment Variables / Tuning Knobs

Env var Default Effect
RS_EMBED_REMOTECLIP_FETCH_WORKERS 8 Provider prefetch worker count for batch APIs
RS_EMBED_REMOTECLIP_BATCH_SIZE CPU:8, CUDA:64 Inference batch size for batch APIs
HUGGINGFACE_HUB_CACHE / HF_HOME / HUGGINGFACE_HOME unset Controls HF cache path used for model snapshot downloads

Checkpoint override

Set sensor.collection="hf:<repo_or_local_path>" (not env-based in this adapter).


Examples

Minimal example

from rs_embed import get_embedding, PointBuffer, TemporalSpec, OutputSpec

emb = get_embedding(
    "remoteclip",
    spatial=PointBuffer(lon=121.5, lat=31.2, buffer_m=2048),
    temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
    output=OutputSpec.pooled(),
    backend="auto",
)

Custom checkpoint via sensor.collection="hf:..."

from rs_embed import get_embedding, PointBuffer, TemporalSpec, OutputSpec, SensorSpec

emb = get_embedding(
    "remoteclip",
    spatial=PointBuffer(lon=121.5, lat=31.2, buffer_m=2048),
    temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
    sensor=SensorSpec(
        collection="hf:MVRL/remote-clip-vit-base-patch32",
        bands=("B4", "B3", "B2"),
        scale_m=10,
        cloudy_pct=30,
        composite="median",
    ),
    output=OutputSpec.grid(),
    backend="auto",
)


Reference

  • Provider-only — backend="tensor" is not supported.
  • The adapter prefers model.transform when available; otherwise falls back to CLIP-style preprocessing — the two paths may produce slightly different embeddings.
  • Grid output depends on the wrapper exposing a token sequence; some RemoteCLIP wrappers only return pooled vectors.