JaneliaSciComp/spark_start
subworkflowStarts Spark processing either by spinning up a cluster or setting up variables so that processing can run locally as individual jobs
spark bigdata infrastructure
Module Information
Inputs
| Name | Type | Description |
|---|---|---|
| ch_meta | tuple | Channel of tuples containing a meta map and a list of data paths Structure: [ val(meta), [path(data_paths)] ] |
| config | map | additional spark configuration |
| spark_cluster | boolean | Whether or not to spin up a Spark cluster |
| working_dir | path | Path shared by workers for logging and jar distribution |
| spark_workers | integer | Number of workers in the cluster |
| min_workers | integer | Minimum number of spark workers that must be available on the spark cluster |
| spark_worker_cpus | integer | Number of CPUs per Spark worker |
| spark_executor_cpus | integer | Number of CPUs for a spark executor |
| spark_executor_mem_gb | integer | Memory resource in GB allocated for a spark executor |
| spark_executor_overhead_mem_gb | integer | Memory overhead for a spark executor |
| spark_driver_cpus | integer | Number of CPUs for the Spark driver |
| spark_driver_mem_gb | integer | Number of GB of memory for the Spark driver |
| spark_gb_per_core | integer | Number of GB of memory per worker core |
Outputs
| Name | Type | Description |
|---|---|---|
| spark_context | tuple | The tuple from input ch_meta with the spark_context map appended. Structure: [ val(meta), val(spark_context) ] |
Quick Start
Include this subworkflow in your Nextflow pipeline:
include { SPARK_START } from 'https://github.com/JaneliaSciComp/nextflow-modules/tree/main/subworkflows/janelia/spark_start'