OGDC Recipes

An OGDC recipe is a directory containing a meta.yaml file and other associated recipe-specific configuration files that define a data transformation pipeline.

The QGreenland-Net team maintains the ogdc-recipes repository, which contains operational examples of data transformation recipes that can be used as examples.

Recipe Configuration

All of the configuration options for recipes are modeled by Pydantic. See the ogdc_runner.models.recipe_config documentation for complete information on configuration options.

meta.yaml

The meta.yaml provides key metadata that drive the OGDC recipe’s execution and is defined by the ogdc_runner.models.recipe_config.RecipeMeta Pydantic model.

An example recipe meta.yml is shown below:

Key configuration options are:

name

Each recipe must have a name, which is a string of characters providing a human-readable name for the given recipe.

Example: "Water measurements from seal tag data"

workflow

Section containing configuration on what type of workflow this recipe uses, and any workflow-specific configuration options. See Workflow types below for more information about different workflow types.

See the ogdc_runner.models.recipe_config.Workflow class for details.

input

The input data source. See the ogdc_runner.models.recipe_config.RecipeInput class for details.

URL Inputs

Provide a direct URL to your data:

input:
  params:
    - type: "url"
      value: "https://example.com/data.zip"

DataONE Inputs

If your data is in a DataONE repository, you can fetch it using the dataset identifier:

input:
  params:
    - type: "dataone"
      value: "resource_map_doi:10.18739/A29G5GD39"

Need specific files from the dataset? Use wildcard patterns:

input:
  params:
    - type: "dataone"
      value: "resource_map_doi:10.18739/A29G5GD39"
      filename: "*.nc" # Fetch only NetCDF files

Wildcard patterns work like you’d expect:

  • * matches anything (e.g., data_*.nc gets data_1.nc, data_2.nc, etc.)

  • ? matches one character (e.g., data_?.nc gets data_1.nc but not data_10.nc)

Check out the seal-tags and greenland-ice-sheet recipes to see DataONE inputs in practice.

output

ogdc_runner.models.recipe_config.RecipeOutput is the base class representing configuration for OGDC recipe outputs. Child classes define the output-type specific configuration required to publish final outputs of a recipe.

PVC Output

If no configuration is supplied, this is the default. Recipe outputs will be stored on the qgnet-ogdc-workflow-pvc PVC in kubernetes under a directory named after the recipe_id.

See ogdc_runner.models.recipe_config.PvcRecipeOutput for details.

Temporary output

When the output type is set to temporary, recipe outputs will be stored temporarily (for 7 days). After successful workflow completion, users can retrieve this output as a .zip file via the ogdc-runner get-output command.

See ogdc_runner.models.recipe_config.TemporaryRecipeOutput for details.

DataONE output

Warning

Although dataone_id is a documented output type, it is currently unused.

See ogdc_runner.models.recipe_config.DataOneRecipeOutput for details.

Workflow types

There are multiple types of OGDC workflow. Which an author should use depends on the data processing use-case.

Shell Workflow

shell is a workflow type that involves executing a series of sh commands in sequence, much like a shell script. This workflow type is best suited for relatively simple transformations on small/medium sized data.

See ogdc_runner.models.recipe_config.ShellWorkflow for details on configuration options.

In addition to meta.yaml, shell workflows expect a recipe.sh file that defines the series of commands to be run against the input data.

It is expected that most of the commands included in the recipe.sh be gdal or ogr2ogr commands to perform e.g., reprojection or subsetting.

An example of a recipe.sh file is shown below:

Warning

Although recipe.sh file should contain valid sh commands such as ogr2ogr, it is not expected to be executable as a shell script on its own (without ogdc-runner). This is because there are some specific expectations that must be followed, as outlined below!

  • It is expected that each command in the recipe.sh place data in /output_dir/

  • The input data for each step is always assumed to be in /input_dir/. The previous step’s /output_dir/ becomes the next step’s /input_dir/. The first step’s /input_dir/ contains the data specified in the meta.yaml’s input.

  • Multi-line constructs are not allowed. It is assumed that each line not prefixed by # is a command that will be executed via sh -c {line}.

  • Each command is executed in isolation. Do not expect envvars (e.g., export ENVVAR=foo) to persist between lines.

For an example of a recipe using the shell workflow, we recommend taking a look at the ogdc-recipes seal-tags recipe example.

Visualization Workflow

The visualization workflow takes a geospatial data file as input and produces 3D web-tiles of the data for visualization in a web-map.

See ogdc_runner.models.recipe_config.VizWorkflow for details on configuration options.

Warning

This section of the documentation is incomplete! TODO: more detail / link to viz workflow documentation.

For an example of a recipe using the visualization workflow, we recommend taking a look at the ogdc-recipes viz-workflow recipe example.

Parallel Execution

Currently only shell workflows support parallel execution for processing multiple input files concurrently. Parallel execution distributes work across multiple Argo workflow tasks, enabling efficient processing of large datasets.

Configuration

Parallel execution is configured via the parallel field within the workflow configuration. See ogdc_runner.models.recipe_config.ParallelConfig for complete configuration options.

workflow:
  type: "shell"
  parallel:
    enabled: true
    partition_strategy: "files"
    partition_size: 2

enabled

Boolean flag to enable parallel execution. When false (default), workflow executes sequentially.

partition_strategy

Currently supports "files" strategy, which groups input files into partitions for parallel processing.

partition_size

Number of files per partition. The orchestrator divides input files into chunks of this size, creating one parallel task per partition. For example, with 5 input files and partition_size: 2, three partitions are created: two with 2 files and one with 1 file.

Note

Partitions may have different numbers of files. If the total number of input files doesn’t divide evenly by partition_size, the last partition will contain the remainder. For instance, 7 files with partition_size: 3 creates partitions of [3, 3, 1] files.

Execution Model

Parallel execution uses Argo’s DAG (Directed Acyclic Graph) to create independent tasks that can run concurrently. The maximum parallelism is controlled at the workflow level, allowing Argo to automatically schedule tasks as cluster resources become available.

Each parallel task:

  • Receives a partition of input files via workflow parameters

  • Executes the same command independently for each file in its partition

  • Writes outputs to isolated directories (one per partition)

  • Runs in a separate container with its own resource allocation

Important

File-level execution: Each command in the recipe is executed once per file in the partition. The runner sets environment variables ($INPUT_FILE and $OUTPUT_FILE) for each file, and your command processes them one at a time within the partition. You don’t need to handle the partition splitting - the orchestrator does this automatically.