-
Notifications
You must be signed in to change notification settings - Fork 1k
BiG Slice modules for nf-core #9564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| channels: | ||
| - conda-forge | ||
| - bioconda | ||
| - defaults | ||
| dependencies: | ||
| - bioconda::bigslice=2.0.0 | ||
| - conda-forge::biopython=1.81 | ||
| - conda-forge::pandas=2.0.3 |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,84 @@ | ||||||
| /* | ||||||
| * BIGSLICE_PREP_INPUT: Prepares antiSMASH outputs for BiG-SLiCE analysis | ||||||
| * | ||||||
| * This process transforms antiSMASH output directories into the specific | ||||||
| * directory structure and file format that BiG-SLiCE expects for clustering | ||||||
| * biosynthetic gene clusters (BGCs). | ||||||
| * | ||||||
| * BiG-SLiCE input structure: | ||||||
| * input/ | ||||||
| * ├── datasets.tsv # dataset configuration file | ||||||
| * ├── <dataset_name>/ # GBK files organized by sample (each GBK contains a BGC region) | ||||||
| * │ ├── sample1/ | ||||||
| * │ │ ├── region001.gbk # BGC region 1 in GenBank format | ||||||
| * │ │ └── region002.gbk # BGC region 2 in GenBank format | ||||||
| * │ └── sample2/ | ||||||
| * │ └── region001.gbk # BGC region 1 in GenBank format | ||||||
| * └── taxonomy/ | ||||||
| * └── dataset_taxonomy.tsv # taxonomic information (9-column format) | ||||||
| */ | ||||||
|
Comment on lines
+1
to
+19
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This info should go in the meta.yml |
||||||
| process BIGSLICE_PREP_INPUT { | ||||||
| label 'bigslice' | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| tag "dataset=${params.bgc_bigslice_dataset_name}" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| conda "${moduleDir}/environment.yml" | ||||||
| container "${ workflow.containerEngine == 'singularity' && !task. ext.singularity_pull_docker_container ? | ||||||
| 'https://depot.galaxyproject.org/singularity/bigslice:2.0.2--pyh8ed023e_0': | ||||||
| 'quay.io/biocontainers/bigslice:2.0.2--pyh8ed023e_0' }" | ||||||
|
|
||||||
| input: | ||||||
| // list of antiSMASH output directories (one per sample) | ||||||
| path antismash_dirs | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add a meta map here
Suggested change
|
||||||
|
|
||||||
| output: | ||||||
| // complete "input" folder structure for BiG-SLiCE (contains dataset/, taxonomy/, datasets.tsv) | ||||||
| path "input", emit: input_dir | ||||||
| path "versions.yml", emit: versions | ||||||
|
Comment on lines
+34
to
+36
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. emit meta maps! and |
||||||
|
|
||||||
| script: | ||||||
| // prepare quoted directory list for bash for-loop processing | ||||||
| def quoted = antismash_dirs.collect { "\"${it}\"" }.join(' ') | ||||||
| def VERSION = '2.0.2' | ||||||
| """ | ||||||
| set -euo pipefail | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is already set by the |
||||||
|
|
||||||
| ROOT="input" # BiG-SLiCE input root directory | ||||||
| DS="${params.bgc_bigslice_dataset_name}" # dataset name (e.g., 'antismash') | ||||||
| OUT="\$ROOT/\$DS" # dataset-specific output directory | ||||||
| TAXROOT="\$ROOT/taxonomy" # taxonomy directory | ||||||
|
|
||||||
| rm -rf "\$ROOT" | ||||||
| mkdir -p "\$OUT" "\$TAXROOT" | ||||||
|
Comment on lines
+45
to
+51
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is there an |
||||||
|
|
||||||
| for d in ${quoted}; do | ||||||
| [ -d "\$d" ] || continue # skip if directory doesn't exist | ||||||
| sample=\$(basename "\$d") # extract sample name from directory path | ||||||
| mkdir -p "\$OUT/\$sample" # create sample-specific subdirectory | ||||||
|
|
||||||
| find -L "\$d" -type f \\( -name "*.region*.gbk" -o -name "*.gbk" \\) -print0 \ | ||||||
| | xargs -0 -I{} cp -f "{}" "\$OUT/\$sample/" | ||||||
| done | ||||||
|
|
||||||
| if [ -n "${params.bgc_bigslice_taxonomy ?: ''}" ]; then | ||||||
| cp "${params.bgc_bigslice_taxonomy}" "\$TAXROOT/dataset_taxonomy.tsv" | ||||||
| else | ||||||
| printf "accession\\ttaxdomain\\tphylum\\tclass\\torder\\tfamily\\tgenus\\tspecies\\torganism\\n" > "\$TAXROOT/dataset_taxonomy.tsv" | ||||||
|
|
||||||
| for d in "\$OUT"/*/; do | ||||||
| [ -d "\$d" ] || continue | ||||||
| acc=\$(basename "\$d")/ # sample accession (with trailing slash as per BiG-SLiCE format) | ||||||
| printf "%s\\tUnknown\\tUnknown\\tUnknown\\tUnknown\\tUnknown\\tUnknown\\tUnknown\\tUnknown\\n" "\$acc" >> "\$TAXROOT/dataset_taxonomy.tsv" | ||||||
| done | ||||||
| fi | ||||||
|
|
||||||
| { | ||||||
| echo "# dataset_name\\tdataset_path\\ttaxonomy_path\\tdescription" | ||||||
| printf "%s\\t%s\\t%s\\t%s\\n" "\$DS" "\$DS" "taxonomy/dataset_taxonomy.tsv" "antiSMASH \$DS" | ||||||
| } > "\$ROOT/datasets.tsv" | ||||||
|
|
||||||
| cat <<-END_VERSIONS > versions.yml | ||||||
| "${task.process}": | ||||||
| bigslice: \$(bigslice --version 2>&1 | grep -oP 'BiG-SLiCE \\K[0-9.]+' || echo "${VERSION}") | ||||||
| END_VERSIONS | ||||||
| """ | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a stub test aswell :) |
||||||
| } | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the meta map is not completely valid :) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| name: bigslice_prep_input | ||
| description: Prepares antiSMASH outputs for BiG-SLiCE clustering analysis by organizing BGC GenBank files into the required directory structure | ||
| keywords: | ||
| - biosynthetic gene cluster | ||
| - BGC | ||
| - antismash | ||
| - bigslice | ||
| - data preparation | ||
| - genome mining | ||
|
|
||
| tools: | ||
| - bigslice: | ||
| description: A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data | ||
| homepage: https://github.com/medema-group/BiG-SLiCE | ||
| documentation: https://github.com/medema-group/BiG-SLiCE/wiki | ||
| tool_dev_url: https://github.com/medema-group/BiG-SLiCE | ||
| doi: "10.1038/s41467-021-21428-1" | ||
| licence: ["AGPL v3"] | ||
|
|
||
| input: | ||
| - antismash_dirs: | ||
| type: directory | ||
| description: | | ||
| List of antiSMASH output directories, one per sample. | ||
| Each directory should contain *.region*.gbk files representing | ||
| detected biosynthetic gene clusters. | ||
| pattern: "*/" | ||
|
|
||
| output: | ||
| - input_dir: | ||
| type: directory | ||
| description: | | ||
| Structured input directory for BiG-SLiCE containing: | ||
| - datasets.tsv: dataset configuration file | ||
| - <dataset_name>/: organized BGC GenBank files by sample | ||
| - taxonomy/: taxonomic information in GTDB format | ||
| pattern: "input/" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| channels: | ||
| - conda-forge | ||
| - bioconda | ||
| - defaults | ||
| dependencies: | ||
| - bioconda::bigslice=2.0.0 | ||
| - conda-forge::scikit-learn=1.3.0 | ||
| - conda-forge::numpy=1.24.3 | ||
| - conda-forge::pandas=2.0.3 | ||
| - conda-forge::biopython=1.81 | ||
| - conda-forge::matplotlib-base=3.7.2 | ||
| - conda-forge::flask=2.3.3 |
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see comments on other module, they apply here aswell |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| /* | ||
| * BIGSLICE_RUN: Executes BiG-SLiCE clustering analysis on prepared BGC data | ||
| * | ||
| * BiG-SLiCE (Biosynthetic Gene cluster - Super Linear Clustering Engine) is a tool | ||
| * for rapid clustering and analysis of biosynthetic gene clusters from genomic data. | ||
| * | ||
| * This process: | ||
| * 1. Takes the structured input directory created by BIGSLICE_PREP_INPUT | ||
| * 2. Uses pre-trained machine learning models to analyze BGC features | ||
| * 3. Performs hierarchical clustering of BGCs based on genetic similarity | ||
| * 4. Generates interactive web application and detailed clustering results | ||
| * | ||
| * Output structure: | ||
| * output/ | ||
| * ├── app/ # interactive web application for visualization | ||
| * ├── result/ # clustering results and analysis data | ||
| * ├── LICENSE.txt # BiG-SLiCE license information | ||
| * ├── requirements.txt # Python dependencies for the web app | ||
| * └── start_server.sh # script to launch the interactive web server | ||
| */ | ||
| process BIGSLICE_RUN { | ||
| label 'bigslice' | ||
|
|
||
| conda "${moduleDir}/enviroment.yml" | ||
| container "${ workflow.containerEngine == 'singularity' && !task. ext.singularity_pull_docker_container ? | ||
| 'https://depot.galaxyproject.org/singularity/bigslice:2.0.2--pyh8ed023e_0': | ||
| 'quay.io/biocontainers/bigslice:2.0.2--pyh8ed023e_0' }" | ||
|
|
||
| input: | ||
| path input_dir // structured input directory from BIGSLICE_PREP_INPUT (contains datasets.tsv, BGC files, taxonomy) | ||
| path models_dir // pre-trained BiG-SLiCE models directory (e.g., bigslice-models.2022-11-30) | ||
|
|
||
| output: | ||
| path "output", emit: outdir // complete BiG-SLiCE output directory with clustering results and reports | ||
| path "versions.yml", emit: versions | ||
|
|
||
| script: | ||
| def VERSION = '2.0.2' | ||
| """ | ||
| set -euo pipefail | ||
|
|
||
| # clean any existing output directory to ensure fresh results | ||
| rm -rf output 2>/dev/null || true | ||
|
|
||
| # execute BiG-SLiCE clustering analysis | ||
| # -i: input directory containing prepared BGC data and configuration | ||
| # --program_db_folder: directory with pre-trained machine learning models | ||
| # output: destination directory for all results and reports | ||
| bigslice \ | ||
| -i "${input_dir}" \ | ||
| --program_db_folder "${models_dir}" \ | ||
| output | ||
|
|
||
| cat <<-END_VERSIONS > versions.yml | ||
| "${task.process}": | ||
| bigslice: \$(bigslice --version 2>&1 | grep -oP 'BiG-SLiCE \\K[0-9.]+' || echo "${VERSION}") | ||
| END_VERSIONS | ||
| """ | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| name: bigslice_run | ||
| description: Executes BiG-SLiCE clustering analysis on prepared BGC data using pre-trained machine learning models | ||
| keywords: | ||
| - biosynthetic gene cluster | ||
| - BGC | ||
| - clustering | ||
| - machine learning | ||
| - comparative genomics | ||
| - secondary metabolites | ||
| - bigslice | ||
|
|
||
| tools: | ||
| - bigslice: | ||
| description: A highly scalable, user-interactive tool for the large scale analysis of Biosynthetic Gene Clusters data | ||
| homepage: https://github.com/medema-group/BiG-SLiCE | ||
| documentation: https://github.com/medema-group/BiG-SLiCE/wiki | ||
| tool_dev_url: https://github.com/medema-group/BiG-SLiCE | ||
| doi: "10.1038/s41467-021-21428-1" | ||
| licence: ["AGPL v3"] | ||
|
|
||
| input: | ||
| - input_dir: | ||
| type: directory | ||
| description: | | ||
| Structured input directory created by BIGSLICE_PREP_INPUT containing: | ||
| - datasets.tsv: dataset configuration | ||
| - BGC GenBank files organized by sample | ||
| - taxonomy information | ||
| pattern: "input/" | ||
| - models_dir: | ||
| type: directory | ||
| description: | | ||
| BiG-SLiCE pre-trained models directory. | ||
| Download from: https://github.com/medema-group/BiG-SLiCE/releases | ||
| Example: bigslice-models.2022-11-30.tar.gz | ||
| pattern: "*/" | ||
|
|
||
| output: | ||
| - outdir: | ||
| type: directory | ||
| description: | | ||
| Complete BiG-SLiCE output directory containing: | ||
| - result/: clustering results and analysis data | ||
| - app/: interactive web application for visualization | ||
| - start_server.sh: script to launch web interface | ||
| pattern: "output/" | ||
| - versions: | ||
| type: file | ||
| description: File containing software versions | ||
| pattern: "versions.yml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the module specifications we want the module name to be reflected in the process name. In this case BIGSLICE_PREPINPUT