JaneliaSciComp/spark_start

subworkflow

Starts Spark processing either by spinning up a cluster or setting up variables so that processing can run locally as individual jobs

spark bigdata infrastructure

Module Information

Repository
https://github.com/JaneliaSciComp/nextflow-modules/tree/main/subworkflows/janelia/spark_start
Source
Janelia
Organization
JaneliaSciComp
Authors
@krokicki , @cgoina

Inputs

Name Type Description
ch_meta tuple Channel of tuples containing a meta map and a list of data paths Structure: [ val(meta), [path(data_paths)] ]
config map additional spark configuration
spark_cluster boolean Whether or not to spin up a Spark cluster
working_dir path Path shared by workers for logging and jar distribution
spark_workers integer Number of workers in the cluster
min_workers integer Minimum number of spark workers that must be available on the spark cluster
spark_worker_cpus integer Number of CPUs per Spark worker
spark_executor_cpus integer Number of CPUs for a spark executor
spark_executor_mem_gb integer Memory resource in GB allocated for a spark executor
spark_executor_overhead_mem_gb integer Memory overhead for a spark executor
spark_driver_cpus integer Number of CPUs for the Spark driver
spark_driver_mem_gb integer Number of GB of memory for the Spark driver
spark_gb_per_core integer Number of GB of memory per worker core

Outputs

Name Type Description
spark_context tuple The tuple from input ch_meta with the spark_context map appended. Structure: [ val(meta), val(spark_context) ]

Quick Start

Include this subworkflow in your Nextflow pipeline:

include { SPARK_START } from 'https://github.com/JaneliaSciComp/nextflow-modules/tree/main/subworkflows/janelia/spark_start'
View on GitHub Report Issue