Skip to content

Conversation

@dev-coder2
Copy link
Contributor

Hi @singjc
This pull request introduces a new script, Feature_extraction.py, which converts our existing feature extraction notebook into a standalone script.

Key updates include:

  • Batch and Parallel Processing: The script processes DIA-NN parquet files to extract MS1 and MS2 features in batches, utilizing GPU acceleration with cuDF for efficient data processing.

  • Configuration Updates: Minor updates have been made to the dquartic_train_config.json to include the required file paths and settings for the pipeline. Currently, the MS1 and MS2 paths reflect the main outputs from DIA-NN. Once we generate the combined batch features for MS1 and MS2 through this script, these paths will be updated to point to the final CSV outputs.

  • We can now extract MS1 and MS2 features by simply running:
    "python Feature_extraction.py --config dquartic_train_config.json"

This enhancement streamlines our workflow and improves processing efficiency for the D4 diffusion model pipeline. Please review the changes and merge them into the main branch.

Copy link
Collaborator

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for implementing this. I added a few comments and suggestions.

Comment on lines 261 to 270
if ms1_list:
batch_ms1_df = pd.concat(ms1_list, ignore_index=True)
batch_ms1_filename = os.path.join(self.ms1_dir, f"ms1_features_batch_{batch_idx}.csv")
batch_ms1_df.to_csv(batch_ms1_filename, index=False)
ms1_all_batches.append(batch_ms1_df)
if ms2_list:
batch_ms2_df = pd.concat(ms2_list, ignore_index=True)
batch_ms2_filename = os.path.join(self.ms2_dir, f"ms2_features_batch_{batch_idx}.csv")
batch_ms2_df.to_csv(batch_ms2_filename, index=False)
ms2_all_batches.append(batch_ms2_df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also ensure that there is matching data for both MS1 and MS2 and append only if that's the case? To avoid cases where there is ms1 extracted data, but no ms2 extracted data? Not sure if this actually happens or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this scenario is unlikely. Our data was initially processed with DIA-NN and subsequently re-extracted with MassDash. These tools are designed to ensure consistent and proper MS1 and MS2 data.

@dev-coder2
Copy link
Contributor Author

dev-coder2 commented Apr 4, 2025

Hi @singjc,

I have taken your feedback, and I am looking at it point by point.
Just one request: can I send my GSoC proposal to you as a personal message on Discord for feedback?

@singjc
Copy link
Collaborator

singjc commented Apr 4, 2025

Just one request: can I send my GSoC proposal to you as a personal message on Discord for feedback?

Sure, I will take a look

@dev-coder2
Copy link
Contributor Author

dev-coder2 commented Apr 5, 2025

Sure, I will take a look

Hi @singjc, I've sent you my proposal. Could you please check it?

@dev-coder2
Copy link
Contributor Author

Hi @singjc,

I have updated this PR to address all your feedback. Here's a summary of the changes:

Structure and Integration

  • Renamed Feature_extraction.py to feature_extraction.py and moved it to the data_preprocessing module
  • Created a dedicated feature_extraction_config.json file for better separation of concerns
  • Integrated the feature extraction functionality into the main CLI with a new extract-features command

GPU Acceleration and Fallback

  • Implemented cuDF's pandas accelerator mode that automatically falls back to standard pandas on systems without a GPU
  • Added proper detection and logging of GPU availability

Performance Optimization

  • Replaced the 3-level nested loop structure with joblib parallelization for the innermost loop
  • Made effective use of the threads parameter to process runs concurrently
  • Added proper vectorization of operations where possible

Error Handling and Validation

  • Added explicit checks for empty feature lists with descriptive error messages
  • Implemented validation to ensure data consistency between MS1 and MS2 results
  • Improved error reporting throughout the pipeline

Documentation and Code Cleanup

  • Updated the docstring with detailed information about output files, columns, and typical file sizes
  • Removed unnecessary return statements from methods where return values weren't used
  • Removed confusing comments about import adjustments
  • Ensured consistent code style throughout

These improvements make the feature extraction process more robust, better integrated with the main project, and more efficient on both GPU and CPU systems while maintaining the same functionality.

@dev-coder2 dev-coder2 force-pushed the feature_extraction_script branch from b33ddd1 to dd6e247 Compare April 10, 2025 10:04
@dev-coder2 dev-coder2 requested a review from singjc April 10, 2025 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants