A transformer-based model for genomic selection.
preprocessing/split.py- get the data information: which split, env, and hybrid- input
data/raw/Training_Data/1_Training_Trait_Data_2014_2021.csvdata/raw/Testing_Data/1_Submission_Template_2022.csv
- output -
data/splits.csv
- input
-
preprocessing/vcf2num.sh- Convert VCF to numeric matrix- input -
data/raw/Training_Data/5_Genotype_Data_All_Years.vcf - output -
data/plink/geno.raw
- input -
-
preprocessing/hybrid2p.py- Infer parental genotypes from the hybrid info provided by G2F- input -
data/plink/geno.raw - output -
data/g_parents.csv
- input -
-
preprocessing/synthesize_f1.py- Synthesize the F1 genotypes back from parental genotypes- input -
data/g_parents.csv - output -
data/g_f1.csv
- input -
make_images.py- Combine genotype and EC data into feature images- input
data/g_parents.csvdata/splits.csvdata/raw/Training_Data/6_Training_EC_Data_2014_2021.csvdata/raw/Testing_Data/6_Testing_EC_Data_2022.csv
- output
data/images/<split>/%id.png- feature images: 384 x 1152 x 3data/images/<split>/annotation.txt- labels (yield)
- input
-
gsformer.py- GSformer PyTorch module -
train.py- Train the model- input
data/images/train/data/images/val/
- output
out/gsformer.pt- trained model weights
- input
-
inference.py- Make predictions on the test set- input
data/images/test/out/gsformer.pt
- output
out/pred.csv- raw predicted values
- input
-
submission.py- Format the submission file- input
out/pred.csvdata/splits.csvdata/raw/Testing_Data/1_Submission_Template_2022.csv
- output
out/submission.csv- formatted submission file
- input
- 26,213 SNP markers
- 567 (9 * 63) sequential EC variables (1-9 soil layers)
- 144 non-sequential EC variables
data/images/<split>/annotation.txt- 1 label (yield)
-
data/- data files-
images/- feature imagestrain/- training settest/- testing setval/- validation set%id.png- feature images: 384 x 1152 x 3annotation.txt- labels (yield)
-
plink/- genotype data (PLINK)geno.raw- genotype data (FID, IID, PAT, MAT, SEX, PHENOTYPE, 26213 SNPs)
-
raw/- original dataset provided by G2FTraining_Data/- training datasetsTesting_Data/- testing datasets
-
g_f1.csv- F1 genotypes (synthesized) -
g_parents.csv- Parental genotypes (inferred) -
splits.csv- envs and lines of each data split
-
-
out/- project outputsgsformer.pt- trained model weightspred.csv- raw predicted valuessubmission.csv- formatted submission file
-
preprocessing/- scripts for data preprocessing
