-
Notifications
You must be signed in to change notification settings - Fork 11
Add Feature_extraction.py for Batch & Parallel MS1/MS2 Feature Extraction #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
singjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for implementing this. I added a few comments and suggestions.
dquartic/dia-nn to massdash feature extraction/Feature_extraction.py
Outdated
Show resolved
Hide resolved
dquartic/dia-nn to massdash feature extraction/Feature_extraction.py
Outdated
Show resolved
Hide resolved
dquartic/dia-nn to massdash feature extraction/Feature_extraction.py
Outdated
Show resolved
Hide resolved
dquartic/dia-nn to massdash feature extraction/Feature_extraction.py
Outdated
Show resolved
Hide resolved
| if ms1_list: | ||
| batch_ms1_df = pd.concat(ms1_list, ignore_index=True) | ||
| batch_ms1_filename = os.path.join(self.ms1_dir, f"ms1_features_batch_{batch_idx}.csv") | ||
| batch_ms1_df.to_csv(batch_ms1_filename, index=False) | ||
| ms1_all_batches.append(batch_ms1_df) | ||
| if ms2_list: | ||
| batch_ms2_df = pd.concat(ms2_list, ignore_index=True) | ||
| batch_ms2_filename = os.path.join(self.ms2_dir, f"ms2_features_batch_{batch_idx}.csv") | ||
| batch_ms2_df.to_csv(batch_ms2_filename, index=False) | ||
| ms2_all_batches.append(batch_ms2_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also ensure that there is matching data for both MS1 and MS2 and append only if that's the case? To avoid cases where there is ms1 extracted data, but no ms2 extracted data? Not sure if this actually happens or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this scenario is unlikely. Our data was initially processed with DIA-NN and subsequently re-extracted with MassDash. These tools are designed to ensure consistent and proper MS1 and MS2 data.
|
Hi @singjc, I have taken your feedback, and I am looking at it point by point. |
Sure, I will take a look |
Hi @singjc, I've sent you my proposal. Could you please check it? |
|
Hi @singjc, I have updated this PR to address all your feedback. Here's a summary of the changes: Structure and Integration
GPU Acceleration and Fallback
Performance Optimization
Error Handling and Validation
Documentation and Code Cleanup
These improvements make the feature extraction process more robust, better integrated with the main project, and more efficient on both GPU and CPU systems while maintaining the same functionality. |
b33ddd1 to
dd6e247
Compare
Hi @singjc
This pull request introduces a new script, Feature_extraction.py, which converts our existing feature extraction notebook into a standalone script.
Key updates include:
Batch and Parallel Processing: The script processes DIA-NN parquet files to extract MS1 and MS2 features in batches, utilizing GPU acceleration with cuDF for efficient data processing.
Configuration Updates: Minor updates have been made to the dquartic_train_config.json to include the required file paths and settings for the pipeline. Currently, the MS1 and MS2 paths reflect the main outputs from DIA-NN. Once we generate the combined batch features for MS1 and MS2 through this script, these paths will be updated to point to the final CSV outputs.
We can now extract MS1 and MS2 features by simply running:
"python Feature_extraction.py --config dquartic_train_config.json"
This enhancement streamlines our workflow and improves processing efficiency for the D4 diffusion model pipeline. Please review the changes and merge them into the main branch.