RemoteCLIP (remoteclip)¶
Quick Facts¶
| Field | Value |
|---|---|
| Model ID | remoteclip |
| Aliases | remoteclip_s2rgb |
| Family / Backbone | RemoteCLIP (CLIP-style ViT via rshf.remoteclip.RemoteCLIP) |
| Adapter type | on-the-fly |
| Training alignment | Medium (higher if wrapper model.transform(...) matches training pipeline; fallback is generic CLIP preprocess) |
RemoteCLIP In 30 Seconds
RemoteCLIP is a CLIP-style vision-language ViT continually fine-tuned on remote-sensing image-text pairs, so its embeddings live in a shared image/text space that supports caption-based retrieval — in rs-embed you are getting the visual side of that shared space from a 3-band RGB Sentinel-2 input.
In rs-embed, its most important characteristics are:
- RGB-only (
B4,B3,B2) with a fixed224×224preprocessing path: see Input Contract - checkpoint override goes through
sensor.collection="hf:<repo>"rather than an environment variable: see Environment Variables / Tuning Knobs - preprocessing prefers the wrapper
model.transform(...)but falls back to a generic CLIP pipeline — these paths are not identical and should be logged: see Preprocessing Pipeline
Input Contract¶
| Field | Value |
|---|---|
| Backend | provider (auto recommended) |
TemporalSpec |
required TemporalSpec.range(start, end) — treated as filter-and-composite window |
| Default collection | COPERNICUS/S2_SR_HARMONIZED |
| Default bands (order) | B4, B3, B2 |
| Default fetch | scale_m=10, cloudy_pct=30, composite="median" |
input_chw |
CHW, C=3 in (B4,B3,B2) order |
| Side inputs | none |
Checkpoint override via sensor.collection
Use sensor.collection="hf:<repo_or_path>" (e.g. hf:MVRL/remote-clip-vit-base-patch32) to swap in a different RemoteCLIP checkpoint — the hf: prefix is how this adapter distinguishes checkpoint overrides from regular provider collections.
Preprocessing Pipeline¶
Resize is the default — tiling is also available
The pipeline below shows the default input_prep="resize" path. For large ROIs, use input_prep="tile" to split the input into tiles and preserve spatial detail. See Choosing Settings.
flowchart LR
INPUT["S2 RGB"] --> PREP["Normalize → uint8\n→ model.transform or CLIP fallback"]
PREP --> FWD["CLIP ViT forward"]
FWD --> POOL["pooled: token mean/max"]
FWD --> GRID["grid: patch-token (D,H,W)"]
Current adapter image size
The image size is fixed at 224 in this adapter path.
Architecture Concept¶
flowchart LR
subgraph Input
RGB["S2 RGB\n(B4,B3,B2)"]
end
subgraph "CLIP ViT"
RGB --> PRE["Preprocess\n(model.transform\nor CLIP fallback)"]
PRE --> FWD["CLIP ViT\nforward"]
end
subgraph "Shared Image ↔ Text Space"
FWD --> EMB["Embeddings support\ncaption-based\nsimilarity & retrieval"]
EMB --> POOL["pooled:\ntoken mean/max"]
EMB --> GRID["grid:\npatch-token (D,H,W)"]
end
Environment Variables / Tuning Knobs¶
| Env var | Default | Effect |
|---|---|---|
RS_EMBED_REMOTECLIP_FETCH_WORKERS |
8 |
Provider prefetch worker count for batch APIs |
RS_EMBED_REMOTECLIP_BATCH_SIZE |
CPU:8, CUDA:64 |
Inference batch size for batch APIs |
HUGGINGFACE_HUB_CACHE / HF_HOME / HUGGINGFACE_HOME |
unset | Controls HF cache path used for model snapshot downloads |
Checkpoint override
Set sensor.collection="hf:<repo_or_local_path>" (not env-based in this adapter).
Examples¶
Minimal example¶
from rs_embed import get_embedding, PointBuffer, TemporalSpec, OutputSpec
emb = get_embedding(
"remoteclip",
spatial=PointBuffer(lon=121.5, lat=31.2, buffer_m=2048),
temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
output=OutputSpec.pooled(),
backend="auto",
)
Custom checkpoint via sensor.collection="hf:..."¶
from rs_embed import get_embedding, PointBuffer, TemporalSpec, OutputSpec, SensorSpec
emb = get_embedding(
"remoteclip",
spatial=PointBuffer(lon=121.5, lat=31.2, buffer_m=2048),
temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
sensor=SensorSpec(
collection="hf:MVRL/remote-clip-vit-base-patch32",
bands=("B4", "B3", "B2"),
scale_m=10,
cloudy_pct=30,
composite="median",
),
output=OutputSpec.grid(),
backend="auto",
)
Paper & Links¶
- Publication: TGRS 2024
- Code: ChenDelong1999/RemoteCLIP
Reference¶
- Provider-only —
backend="tensor"is not supported. - The adapter prefers
model.transformwhen available; otherwise falls back to CLIP-style preprocessing — the two paths may produce slightly different embeddings. - Grid output depends on the wrapper exposing a token sequence; some RemoteCLIP wrappers only return pooled vectors.