study-dealer-nextflow

A Nextflow pipeline that processes biological study data for the VEuPathDB project. The pipeline "deals studies to the wrangler" by taking various genomics datasets and processing them through standardized wrangling workflows.

Overview

This pipeline automates the processing of diverse genomics datasets into standardized formats for the VEuPathDB ecosystem. It supports multiple data types including RNA sequencing, phenotype data, antibody arrays, cellular localization, and RFLP datasets.

Key Features

Mode-based architecture: Process different data types (RNA-seq, phenotype, etc.) through specialized workflows
Multi-dataset study support: Automatically groups related datasets into unified studies
Flexible configuration: JSON-based study mapping and glob pattern file discovery
Containerized execution: Docker-based processing with reproducible environments
Validation & quality control: Built-in validation against baseline and EDA profiles
Smart processing: Avoids reprocessing already-wrangled studies

Prerequisites

Nextflow (version 20.01.0 or later)
Docker (recommended) or Singularity/Apptainer
Access to VEuPathDB data directories
GUS home directory with configuration file at {gusHomeDir}/config/gus.config

Quick Start

Basic Usage

# Run RNA-seq mode (default)
nextflow run main.nf

# Run phenotype mode
nextflow run main.nf --mode phenotype

# Process a specific dataset
nextflow run main.nf --mode phenotype --datasetName "PlasmoDB_Rod_Mal_Phenotype_RSRC"

# Custom output directory
nextflow run main.nf --outputDir /path/to/results

Configuration

Ensure your GUS home directory is set and contains config/gus.config:

nextflow run main.nf --gusHomeDir /path/to/gusHome

The pipeline expects the GUS config file at {gusHomeDir}/config/gus.config.

Processing Modes

RNA-seq Mode (`--mode rnaseq`)

Processes RNA sequencing studies with support for:

EBI RNA-seq count data
Standard RNA-seq count matrices
WGCNA eigengene data
AI-generated sample metadata
Multi-organism studies

Features:

Automatic strandedness detection (sense/antisense/unstranded)
Multi-dataset study grouping via JSON configuration
Database validation to prevent reprocessing

Example:

nextflow run main.nf --mode rnaseq

Phenotype Mode (`--mode phenotype`)

Processes phenotype datasets using custom R wrangling scripts.

Example:

nextflow run main.nf --mode phenotype --datasetName "PlasmoDB_Rod_Mal_Phenotype_RSRC"

Other Modes

Antibody Array (--mode antibodyArray)
Cellular Localization (--mode cellularLocalization)
RFLP (--mode rflp)

All use the same pattern: custom R scripts in lib/R/{mode}/{DatasetName}.R

Configuration Parameters

Parameter	Description	Default
`mode`	Processing mode (rnaseq, phenotype, etc.)	`rnaseq`
`gusHomeDir`	Path to GUS environment root (must contain config/gus.config)	`/path/to/gusHome`
`workflowDataDir`	Base directory for input data	`$baseDir/data`
`outputDir`	Results output directory	`$launchDir/results`
`datasetName`	Filter to specific dataset	`""`
`multiDatasetStudies`	JSON file for study mapping	`$baseDir/data/rnaseq_sample_reannotation/multiDatasetStudy.json`
`studyWranglerTag`	Docker image version	`1.0.27`

Input Data Organization

The pipeline expects data organized in a specific directory structure:

data/
└── {projectName}/          # e.g., PlasmoDB, HostDB
    └── {organismAbbrev}/   # e.g., pfal3D7, hsapREF
        └── {mode}/         # e.g., rnaseq, phenotype
            └── {datasetName}/
                └── {data files}

Example:

data/PlasmoDB/pfal3D7/rnaseq/pfal3D7_Lee_Gambian_ebi_rnaSeq_RSRC/analysis_output/countsForEda_firststrand.txt

Multi-Dataset Studies (RNA-seq)

For RNA-seq studies spanning multiple organisms, define study groupings in data/rnaseq_sample_reannotation/multiDatasetStudy.json:

[
  {
    "study": "HPI_Lee_Gambian",
    "datasets": [
      "hsapREF_Lee_Gambian_ebi_rnaSeq_RSRC",
      "pfal3D7_Lee_Gambian_ebi_rnaSeq_RSRC"
    ]
  }
]

Datasets not in the JSON are processed as individual studies.

Output Structure

Results are published to {outputDir}/{study_name}/:

results/{study_name}/
├── install.json              # Study definition (VDI format)
├── study.cache              # Serialized study object
├── entitytypegraph.cache    # Entity relationships
└── *.cache                  # Additional cache files

Adding New Datasets

For Phenotype-like Modes

Create wrangling script: lib/R/{mode}/{DatasetName}.R

library(study.wrangler)

wrangle <- function() {
  # Read data
  entity <- entity_from_file("data.txt")

  # Configure entity
  entity <- entity %>%
    set_entity_metadata(name = "samples", display_name = "Samples") %>%
    redetect_column_as_id("sample_id") %>%
    set_variable_metadata("phenotype", display_name = "Phenotype")

  # Create and validate study
  study <- study("MyStudy", entity)
  if(!validate(study, profiles=c("baseline", "eda"))) {
    stop("Study invalid")
  }

  # Export
  export_to_vdi(study, getwd())
  return(study)
}

Run pipeline:

nextflow run main.nf --mode {mode} --datasetName {DatasetName}

For RNA-seq Mode

Ensure data files match expected patterns:

Count files: countsForEda_*.txt
Eigengenes: *eigengenes*.txt
Sample metadata: entity-sample.{tsv,yaml}

For multi-organism studies, add entry to multiDatasetStudy.json.

Docker Configuration

The pipeline uses Docker by default with the following settings:

User mapping: Runs containers as current user for proper file permissions
Network: Host network access enabled
Environment variables:
- MY_R_LIB: Points to custom R scripts (lib/R/)
- GUS_HOME: Database configuration location

Containers Used

veupathdb/study-wrangler:1.0.27: Main processing with R and study.wrangler library
veupathdb/alpine_bash:latest: Utility operations
veupathdb/vdi-plugin-wrangler: Database name extraction

Using Singularity

To use Singularity instead of Docker, edit nextflow.config:

includeConfig 'conf/singularity.config'  // Instead of docker.config

Troubleshooting

Common Issues

Pipeline skips my study:

Check that database name exists in external databases but NOT in EDA
Review logs for filtering messages: log.info "Skipping study..."

Validation errors:

Ensure wrangle() function returns valid study object
Check that all required metadata is set
Verify variable data shapes match actual data

File not found errors:

Verify data directory structure matches expected pattern
Check file patterns in nextflow.config
Ensure workflowDataDir parameter points to correct location

GUS config errors:

Verify GUS config file exists at {gusHomeDir}/config/gus.config
Check --gusHomeDir parameter points to correct location

Debug Mode

Run with Nextflow's debug options:

nextflow run main.nf -with-trace -with-report -with-dag flowchart.html

Performance

Concurrent processes: Limited to 2 (maxForks = 2) to prevent resource exhaustion
Benchmarking: RNA-seq mode includes timing utilities for performance analysis
Database validation: Recent addition prevents redundant reprocessing

Architecture

Components

workflows/: High-level orchestration (RNA-seq multi-dataset handling)
subworkflows/: Reusable components (single study processing)
modules/: Atomic processes (wrangling, filtering, validation)
bin/: Executable R scripts (wrangleRNASeq.R, singleStudyWrangle.R)
lib/R/: Custom dataset-specific wrangling scripts

Processing Flow (RNA-seq)

Discover files via glob patterns
Extract metadata from directory paths
Mix counts, WGCNA, and sample metadata
Add organism prefixes and filter system rows
Group datasets by study
Validate against database names
Wrangle each study group
Validate output
Export to VDI format

Contributing

When adding new functionality:

Follow the mode-based dispatcher pattern in main.nf
Use study.wrangler library for consistency
Implement validation with "baseline" and "eda" profiles
Add file patterns to nextflow.config
Document custom wrangling scripts

License

VEuPathDB Project

Contact

For issues or questions, please contact the VEuPathDB development team.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
bin		bin
conf		conf
data		data
lib/R		lib/R
modules		modules
subworkflows		subworkflows
workflows		workflows
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

License

VEuPathDB/study-dealer-nextflow

Folders and files

Latest commit

History

Repository files navigation

study-dealer-nextflow

Overview

Key Features

Prerequisites

Quick Start

Basic Usage

Configuration

Processing Modes

RNA-seq Mode (--mode rnaseq)

Phenotype Mode (--mode phenotype)

Other Modes

Configuration Parameters

Input Data Organization

Multi-Dataset Studies (RNA-seq)

Output Structure

Adding New Datasets

For Phenotype-like Modes

For RNA-seq Mode

Docker Configuration

Containers Used

Using Singularity

Troubleshooting

Common Issues

Debug Mode

Performance

Architecture

Components

Processing Flow (RNA-seq)

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

RNA-seq Mode (`--mode rnaseq`)

Phenotype Mode (`--mode phenotype`)

Packages