API: Export¶
This page documents dataset export APIs and export-specific runtime behavior.
Related pages:
export_batch (primary / recommended)¶
export_batch(
*,
spatials: List[SpatialSpec],
temporal: Optional[TemporalSpec],
models: List[str],
out: Optional[str] = None,
layout: Optional[str] = None,
out_dir: Optional[str] = None,
out_path: Optional[str] = None,
names: Optional[List[str]] = None,
backend: str = "gee",
device: str = "auto",
output: OutputSpec = OutputSpec.pooled(),
sensor: Optional[SensorSpec] = None,
per_model_sensors: Optional[Dict[str, SensorSpec]] = None,
format: str = "npz",
save_inputs: bool = True,
save_embeddings: bool = True,
save_manifest: bool = True,
fail_on_bad_input: bool = False,
chunk_size: int = 16,
num_workers: int = 8,
continue_on_error: bool = False,
max_retries: int = 0,
retry_backoff_s: float = 0.0,
async_write: bool = True,
writer_workers: int = 2,
resume: bool = False,
show_progress: bool = True,
input_prep: InputPrepSpec | str = "resize",
) -> Any
Recommended export entry point: export inputs + embeddings + manifest for single or multiple ROIs × one or multiple models in one go.
For new code, prefer learning export_batch(...) first and use out + layout to choose output organization.
out_dir / out_path remain supported as backward-compatible aliases for the same two layouts.
- Decoupled output target API:
out + layout(recommended for new code) out_dirmode: one file per point (legacy-compatible parameter style; good for massive numbers of points)out_pathmode: merge into a single output file (legacy-compatible parameter style; good for fewer points and portability)
Parameters
spatials: non-empty listtemporal: can beNone(some models don’t require time)models: non-empty list of model IDs- Output target:
- recommended API: provide both
outandlayout("per_item"or"combined") - legacy-compatible API: choose one of
out_dir/out_path - do not mix
out+layoutwithout_dir/out_path names: used only inout_dirmode, for output filenames (length must equalspatials)backend: recommended to passbackend="auto"unless you need an explicit provider override (for example"gee")sensor: a sharedSensorSpecfor all models (if models are on-the-fly)per_model_sensors: overrideSensorSpecper model; keys are model stringsformat:"npz"or"netcdf"save_inputs: whether to save model input patches (CHW numpy)save_embeddings: whether to save embedding arrayssave_manifest: whether to save JSON manifests (each export artifact will have an accompanying.json)fail_on_bad_input: whether to raise immediately if input checks failchunk_size: process points in chunks (controls memory/throughput)- also used as the default inference batch size when batched inference is enabled
- in
per_itemmode with GEE prefetch enabled, rs-embed uses a one-slot prefetch pipeline (double buffering), so input-cache peak memory can be roughly up to 2 chunks (to overlapprefetch(chunk k+1)withinfer/write(chunk k)) num_workers: concurrency for GEE patch prefetching (ThreadPool)continue_on_error: keep exporting remaining points/models even if one item failsmax_retries: retry count for provider fetch/write operationsretry_backoff_s: sleep seconds between retriesasync_write: write output files asynchronously inout_dirmodewriter_workers: writer thread count whenasync_write=Trueresume: skip already-exported outputs and continue from remaining itemsshow_progress: show progress during batch export (overall progress + per-model inference progress)input_prep: large-ROI input policy ("resize"default,"tile","auto", orInputPrepSpec(...))
Automatic inference behavior
- In per-item output mode (
out_dir/layout="per_item"): device="cpu"(or auto-resolved CPU): defaults to per-item inferencedevice="cuda"/mps/ other accelerators (or auto-resolved GPU): prefers batched inference when the embedder implements batch APIs- In combined output mode (
out_path/layout="combined"), rs-embed keeps the historical behavior of attempting batched model APIs (with fallback to single-item inference if batched execution fails) - Model-level scheduling remains serial (one model at a time)
Returns
layout="per_item"/out_dirmode:List[dict](manifest for each point)layout="combined"/out_pathmode:dict(combined manifest)
Example: recommended (out + layout)
from rs_embed import export_batch, PointBuffer, TemporalSpec
export_batch(
spatials=[PointBuffer(121.5, 31.2, 2048)],
temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
models=["remoteclip"],
out="exports/combined_run",
layout="combined", # writes exports/combined_run.npz
)
Example: legacy-compatible out_dir (per-item files)
from rs_embed import export_batch, PointBuffer, TemporalSpec
spatials = [
PointBuffer(121.5, 31.2, 2048),
PointBuffer(120.5, 30.2, 2048),
]
export_batch(
spatials=spatials,
temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
models=["remoteclip", "prithvi"],
out_dir="exports",
names=["p1", "p2"],
input_prep="tile", # optional: API-side tiled inference for large ROIs
save_inputs=True,
save_embeddings=True,
chunk_size=32,
num_workers=8,
)
Example: legacy-compatible out_path (single merged file)
from rs_embed import export_batch, PointBuffer, TemporalSpec
export_batch(
spatials=[PointBuffer(121.5, 31.2, 2048)],
temporal=TemporalSpec.range("2022-06-01", "2022-09-01"),
models=["remoteclip"],
out_path="combined.npz",
)
Key performance feature: avoid duplicate downloads
When backend resolves to a provider path (for example backend="gee", or backend="auto" on provider-backed models),
and save_inputs=True and save_embeddings=True, export_batch prefetches the raw patch once,
and passes that same patch into the embedder via input_chw to compute embeddings—avoiding the pattern of “download once to save inputs + download again for embeddings”.
About parallelism
export_batch currently has two levels of execution behavior:
- IO level: GEE prefetching is parallelized (ThreadPool, controlled by num_workers).
- In per-item mode, rs-embed uses a one-slot double buffer: while chunk k is running inference / writing outputs, chunk k+1 can be prefetched in the background (after the first chunk).
- This improves throughput when fetch and inference times are comparable, at the cost of a higher input-cache peak (roughly up to 2 chunks).
- In combined mode, prefetch still runs as a distinct stage before model execution (to keep checkpoint/resume semantics simpler and memory behavior predictable).
- Inference level:
- model level: models are executed serially (one model at a time), to keep runtime/GPU behavior stable.
- batch level: for a single model, rs-embed can run batched inference when the embedder implements batch APIs (for example get_embeddings_batch / get_embeddings_batch_from_inputs). This is used in combined mode and, on GPU/accelerators, also in per-item mode.
In short: rs-embed supports batch-level inference acceleration, while keeping model-level scheduling serial by design.
export_npz (compatibility / convenience wrapper)¶
export_npz(
*,
spatial: SpatialSpec,
temporal: Optional[TemporalSpec],
models: List[str],
out_path: str,
backend: str = "gee",
device: str = "auto",
output: OutputSpec = OutputSpec.pooled(),
sensor: Optional[SensorSpec] = None,
per_model_sensors: Optional[Dict[str, SensorSpec]] = None,
save_inputs: bool = True,
save_embeddings: bool = True,
save_manifest: bool = True,
fail_on_bad_input: bool = False,
continue_on_error: bool = False,
max_retries: int = 0,
retry_backoff_s: float = 0.0,
input_prep: InputPrepSpec | str = "resize",
) -> Dict[str, Any]
Convenience wrapper around export_batch(...) for a single spatial query that always writes a single .npz file.
New code should usually prefer export_batch(...) so you only need to learn one export API.
- Creates parent directory if needed
- Appends
.npzsuffix if missing - Delegates to
export_batch(..., out_path=..., format="npz")
Use export_batch(...) directly when you need:
- multiple spatials
- non-
npzformats (for examplenetcdf) - output layout control (
out + layout/out_dir/out_path)