Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36

dev-coder2 · 2025-03-31T08:19:42Z

Hi @singjc
This pull request introduces a new script, Feature_extraction.py, which converts our existing feature extraction notebook into a standalone script.

Key updates include:

Batch and Parallel Processing: The script processes DIA-NN parquet files to extract MS1 and MS2 features in batches, utilizing GPU acceleration with cuDF for efficient data processing.
Configuration Updates: Minor updates have been made to the dquartic_train_config.json to include the required file paths and settings for the pipeline. Currently, the MS1 and MS2 paths reflect the main outputs from DIA-NN. Once we generate the combined batch features for MS1 and MS2 through this script, these paths will be updated to point to the final CSV outputs.
We can now extract MS1 and MS2 features by simply running:
"python Feature_extraction.py --config dquartic_train_config.json"

This enhancement streamlines our workflow and improves processing efficiency for the D4 diffusion model pipeline. Please review the changes and merge them into the main branch.

…dates

singjc

Great work! Thanks for implementing this. I added a few comments and suggestions.

dquartic_train_config.json

dquartic/dia-nn to massdash feature extraction/Feature_extraction.py

singjc · 2025-04-04T15:20:21Z

dquartic/dia-nn to massdash feature extraction/Feature_extraction.py

+            if ms1_list:
+                batch_ms1_df = pd.concat(ms1_list, ignore_index=True)
+                batch_ms1_filename = os.path.join(self.ms1_dir, f"ms1_features_batch_{batch_idx}.csv")
+                batch_ms1_df.to_csv(batch_ms1_filename, index=False)
+                ms1_all_batches.append(batch_ms1_df)
+            if ms2_list:
+                batch_ms2_df = pd.concat(ms2_list, ignore_index=True)
+                batch_ms2_filename = os.path.join(self.ms2_dir, f"ms2_features_batch_{batch_idx}.csv")
+                batch_ms2_df.to_csv(batch_ms2_filename, index=False)
+                ms2_all_batches.append(batch_ms2_df)


Should we also ensure that there is matching data for both MS1 and MS2 and append only if that's the case? To avoid cases where there is ms1 extracted data, but no ms2 extracted data? Not sure if this actually happens or not.

I believe this scenario is unlikely. Our data was initially processed with DIA-NN and subsequently re-extracted with MassDash. These tools are designed to ensure consistent and proper MS1 and MS2 data.

dquartic/dia-nn to massdash feature extraction/Feature_extraction.py

dev-coder2 · 2025-04-04T17:04:33Z

Hi @singjc,

I have taken your feedback, and I am looking at it point by point.
Just one request: can I send my GSoC proposal to you as a personal message on Discord for feedback?

singjc · 2025-04-04T17:08:01Z

Just one request: can I send my GSoC proposal to you as a personal message on Discord for feedback?

Sure, I will take a look

dev-coder2 · 2025-04-05T15:43:21Z

Sure, I will take a look

Hi @singjc, I've sent you my proposal. Could you please check it?

dev-coder2 · 2025-04-10T09:31:03Z

Hi @singjc,

I have updated this PR to address all your feedback. Here's a summary of the changes:

Structure and Integration

Renamed Feature_extraction.py to feature_extraction.py and moved it to the data_preprocessing module
Created a dedicated feature_extraction_config.json file for better separation of concerns
Integrated the feature extraction functionality into the main CLI with a new extract-features command

GPU Acceleration and Fallback

Implemented cuDF's pandas accelerator mode that automatically falls back to standard pandas on systems without a GPU
Added proper detection and logging of GPU availability

Performance Optimization

Replaced the 3-level nested loop structure with joblib parallelization for the innermost loop
Made effective use of the threads parameter to process runs concurrently
Added proper vectorization of operations where possible

Error Handling and Validation

Added explicit checks for empty feature lists with descriptive error messages
Implemented validation to ensure data consistency between MS1 and MS2 results
Improved error reporting throughout the pipeline

Documentation and Code Cleanup

Updated the docstring with detailed information about output files, columns, and typical file sizes
Removed unnecessary return statements from methods where return values weren't used
Removed confusing comments about import adjustments
Ensured consistent code style throughout

These improvements make the feature extraction process more robust, better integrated with the main project, and more efficient on both GPU and CPU systems while maintaining the same functionality.

add batch & parallel MS1/MS2 feature extraction script with config up…

e0e2674

…dates

singjc requested changes Apr 4, 2025

View reviewed changes

Improve feature extraction implementation

dd6e247

dev-coder2 force-pushed the feature_extraction_script branch from b33ddd1 to dd6e247 Compare April 10, 2025 10:04

dev-coder2 requested a review from singjc April 10, 2025 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36

Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36

dev-coder2 commented Mar 31, 2025

Uh oh!

singjc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

singjc Apr 4, 2025

Uh oh!

dev-coder2 Apr 10, 2025

Uh oh!

Uh oh!

dev-coder2 commented Apr 4, 2025 •

edited

Loading

Uh oh!

singjc commented Apr 4, 2025

Uh oh!

dev-coder2 commented Apr 5, 2025 •

edited

Loading

Uh oh!

dev-coder2 commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36

Are you sure you want to change the base?

Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36

Conversation

dev-coder2 commented Mar 31, 2025

Uh oh!

singjc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

singjc Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

dev-coder2 Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dev-coder2 commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singjc commented Apr 4, 2025

Uh oh!

dev-coder2 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dev-coder2 commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dev-coder2 commented Apr 4, 2025 •

edited

Loading

dev-coder2 commented Apr 5, 2025 •

edited

Loading