-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
450 lines (225 loc) · 32.3 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# latent_phenotype_project
NOTE: File names are listed and described in the order that they are supposed to be run.
NOTE: Any file name containing "step0" is not meant to be run directly. Rather, some other file in the sequence calls that file to run when necessary.
NOTE: Descriptions for files that are called by other files but do not have "step0" in the name start with "DO NOT RUN DIRECTLY"
NOTE: Descriptions starting with "IMPORTANT; run in sections, not all at once" have sections that need to be completed manually in the code. You may run the code in subsections that do not overlap the sections to be completed manually.
NOTE: Directories are ordered from top to bottom as the sequence in which they should be run.
## Directory: step1_get_phenotypes_complex
- `step1a_get_UKB_phenotypes.py`: Imports and processes UK biobank (UKB) data fields. Handles multiple measurements, modifies smoking pack years, calculates annual alcohol consumption, and binarizes categorical variables.
- `step1a_library.py`: Functions used by `step1a_get_UKB_phenotypes.py`.
## Directory: step1_get_phenotypes_simple
- `step1a_import_HF_ICD_codes_and_covariates.py`: Imports additional UKB data fields related to heart failure; binarizes and renames as needed.
- `step1b_process_HF_ICD_codes_and_covariates.py`: Binarizes ICD codes, creates the all cause heart failure binary phenotype ('AHF' in text, 'any_HF' in code), and removes ICD codes not significantly correlated to AHF.
## Directory: step2_get_UKB_samples
- `step2a_make_UKB_sample_getters.py`: Generates bash scripts for importing individual' non-imputed genotype data.
- `step2b_get_UKB_samples.sh`: Executes bash scripts from the previous step to collect genotype samples.
## Directory: step3_merge_chr_and_remove_quitters
- `step3a_get_people_who_quit.py`: generates a list of individuals who opted out of UKB data use (from other manually typed lists).
- `step3b_merge_datasets.sh`: Merges plink files from Step 2; removes opt-out individuals.
- `step3c_find_SNP_cutoffs.py`: Plots SNPs' minor allele frequencies, missingness, and HWE p-values (exploratory only).
- `step3d_remove_bad_SNPs.sh`: Removes SNPs exceeding thresholds for aforementioned quantities.
- `step3e_find_sample_cutoffs.py`: Plots individuals' heterozygosity, average SNP missingness, and X chromosome heterozygosity scores (exploratory only).
- `step3f_remove_bad_samples.sh`: Removes individuals exceeding thresholds for aforementioned quantities.
## Directory: step4_remove_relatives
- `step4a_divide_eids_into_subsets_prep.py`: Splits remaining eids from Step 3 into 10 equal subsets.
- `step4b_divide_eids_into_subsets.sh`: Partitions plink files based on the 10 eid subsets; removes remainder eids.
- `step4c_make_king_subjobs.py`: Generates a bash script that runs KING for each SNP subset and each pair of SNP subsets.
- `step4d_run_king_subjobs.sh`: Runs KING through the previously generated bash scripts. King outputs all pairs of related individuals with 3rd degree relatedness or more.
- `step4e_get_unrelated_eids.py`: Prunes dataset to keep only 4th degree or less related individuals. The number of all-cause heart failure cases is prioritized over the total sample size. Both are maximized with that in mind.
- `step4f_remove_relatives.sh`: Excludes relatives based on prior steps. From these unrelated individuals, creates a pruned SNP dataset such that all SNPs are in low LD for genetic principle component computation. Also keeps the non-pruned dataset.
## Directory: step5_verify_sample_unrelatedness
- `step5a_divide_eids_into_subsets_prep.py`: Splits eids from Step 3 into 10 equal subsets.
- `step5b_divide_eids_into_subsets.sh`: Partitions plink files into 10 subsets corresponding to the eid subsets.
- `step5c_make_king_subjobs.py`: Generates a bash script to run KING for each SNP subset and each pair of SNP subsets.
- `step5d_run_king_subjobs.sh`: Executes the KING subjob scripts.
- `step5e_confirm_unrelatedness.py`: Confirms that no related individuals remain post-Step 4. This validates that KING was used correctly.
## Directory: step6_PCA
- `check_LD.sh`: Computes LD between SNP pairs in pruned SNP set from Step 4.
- `check_LD.py`: Validates that plink was used correctly, so no SNP pairs exceed the LD R^2 threshold from Step 4.
- `step0_run_PCA.par`: Specifies parameters for Eigensoft's PCA computation.
- `step6a_make_PCA_documents.py`: Prepares input files for Eigensoft's PCA.
- `step6b_run_PCA.sh`: Executes PCA computation using Eigensoft.
## Directory: step7_adjust_HF_for_covariates_logistic_PCA
- `step7a_get_HF_ICD_codes_unrelated.py`: Imports phenotypes and relevant UK Biobank fields for unrelated individuals from Step 4.
- `step7b_logPCA_transform_the_data.py`: DO NOT RUN DIRECTLY. Applies logistic PCA to 311 ICD10 codes and heart failure. Creates latent phenotypes (k specified via argparse) and adjusts them using PCs from Step 6.
- `step7b_logPCA_transform_the_data.sh`: Executes the above for k = 1 to 20; paper uses k=15.
- `step7c_impute_missing_values.py`: DO NOT RUN DIRECTLY. Applies MICE imputation to environmental factors with missing values. Correlation threshold for feature selection specified via "nn".
- `step7c_impute_missing_values.sh`: Runs the above for various nn values; paper uses nn=0.05.
- `step7d_get_imputed_values_and_trasformed_variables.py`: Similar to above, but sets nn at 0.05 and does not simulate missingness. Outputs imputed environmental factors.
- `step7d_get_imputed_values_and_trasformed_variables.sh`: Executes the above, recommended for job submission due to long runtime.
- `step7e_get_PCs_effs.py`: Outputs logistic regression beta coefficients for genetic PCs vs AHF, used to correct the standard logistic regression GWAS against AHF.
## Directory: step7_adjust_HF_for_covariates_NN
- `step7a_create_AE_phenotypes.py`: DO NOT RUN DIRECTLY. Computes autoencoder test accuracy based on layer nodes, cross-validation folds, and dropout rate. Generates latent phenotypes.
- `step7a_create_AE_phenotypes.sh`: Runs the above for specified nodes, folds, and dropout rates; paper uses 0.3 dropout.
- `step7b_create_best_phenotypes_normal_AE_0.3dop.py`: Trains final autoencoder model twice, using the first run's weights as a starting point for the second.
- `step7b_create_best_phenotypes_normal_AE_0.3dop.sh`: Executes the above model training.
- `step7c_impute_missing_values.py`: DO NOT RUN DIRECTLY. Uses MICE to impute missing environmental data, selecting features based on correlation "nn".
- `step7c_impute_missing_values.sh`: Executes imputation for various "nn"; paper uses nn=0.05.
- `step7d_get_imputed_values_and_transformed_variables.py`: Similar to above, fixes "nn" at 0.05 and outputs imputed factors.
- `step7d_get_imputed_values_and_transformed_variables.sh`: Executes the above, recommended for job submission.
- `step7e_compute_network_shapley_values.py`: Calculates Shapley values for each latent phenotype.
- `step7e_compute_network_shapley_values.sh`: Applies the above to all latent phenotypes.
- `step7f_analyze_shapley_values.py`: Produces dendrograms based on correlations of high-impact or high-correlation Shapley values.
## Directory: step7_adjust_HF_for_covariates_PCA
- `step0_compute_SV_p_values.py`: Identifies key ICD10 codes affecting SNP-latent phenotype correlations. Used by `step7g_sub_phenotype_analysis.sh`.
- `step7a_get_HF_ICD_codes_unrelated.py`: Fetches unrelated individuals' phenotypes and UK Biobank fields from Step 4.
- `step7b_PCA_transform_the_data.py`: Performs PCA on 311 ICD10 codes and all-cause heart failure, adjusting latent phenotypes with PCs from Step 6.
- `step7c_impute_missing_values.py`: Applies MICE imputation to environmental factors, using a fixed "nn" of 0.05. Confirmed to outperform mean imputation.
- `step7d_get_imputed_values_and_transformed_variables.py`: Similar to `step7c`, but with "nn" fixed at 0.05 and no missingness simulation. Outputs imputed factors.
- `step7e_compute_network_shapley_values.py`: Calculates Shapley values for ICD10 codes and heart failure in PCA-based latent phenotypes.
- `step7f_analyze_shapley_values.py`: Generates dendrograms based on Shapley value correlations.
- `step7g_sub_phenotype_analysis.sh`: Executes `step0_compute_SV_p_values.py` for varying numbers of top-contributing ICD10 codes.
## Directory: step8_get_imputed_ukb_samples
- `step8.1_get_imputed_data_setup.py`: Generates shell scripts for importing GWAS-targeted, imputed SNPs per chromosome.
- `step8.2_get_rsID_positions.py`: Lists SNPs to remove per chromosome based on low MAF or insufficient information.
- `step8.3_get_eids.py`: copies a file identifying which unrelated individuals to retain in the analysis.
- `step8.4_get_imputed_data.sh`: Executes shell scripts from `step8.1`.
- `step8.5_make_bims_tab_spaced.py`: Converts BIM files to tab-spaced format.
## Directory: step9_regress_phenotypes_against_SNPs_logistic_PCA
- `step0_binary_HF_QTL_getter.py`: Performs logistic regression on a subset of SNPs against all-cause heart failure, corrected by genetic PCs.
- `step0_QQ_plot_getter.py`: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
- `step0_significant_QTL_getter.py`: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
- `step9a_get_genotype_metadata.sh`: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
- `step9b_get_significant_QTLs_setup.py`: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in `step0_binary_HF_QTL_getter.py` to change the environmental factor.
- `step9c_get_significant_QTLs.sh`: Executes scripts generated by the previous step.
- `step9d_get_QQ_plots.sh`: Applies `step0_QQ_plot_getter.py` to all latent phenotypes.
- `step9e_get_binary_HF_QTLs.sh`: creates bash scripts to run `step0_binary_HF_QTL_getter.py` on all SNP subsets in parallel.
- `step9f_get_QQ_plots_normal_GWAS.sh`: applies `step0_QQ_plot_getter.py` to all cause heart failure.
## Directory: step9_regress_phenotypes_against_SNPs_NN
- `step0_QQ_plot_getter.py`: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
- `step0_significant_QTL_getter.py`: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
- `step9a_get_genotype_metadata.sh`: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
- `step9b_get_significant_QTLs_setup.py`: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in `step0_binary_HF_QTL_getter.py` to change the environmental factor.
- `step9c_get_significant_QTLs.sh`: Executes scripts generated by the previous step.
- `step9d_get_QQ_plots.sh`: Applies `step0_QQ_plot_getter.py` to all latent phenotypes.
## Directory: step9_regress_phenotypes_against_SNPs_PCA
- `step0_QQ_plot_getter.py`: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
- `step0_significant_QTL_getter.py`: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
- `step9a_get_genotype_metadata.sh`: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
- `step9b_get_significant_QTLs_setup.py`: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in `step0_binary_HF_QTL_getter.py` to change the environmental factor.
- `step9c_get_significant_QTLs.sh`: Executes scripts generated by the previous step.
- `step9d_get_QQ_plots.sh`: Applies `step0_QQ_plot_getter.py` to all latent phenotypes.
## Directory: step10_get_significant_SNPs_logistic_PCA
- `step0_compute_GxE_p_values.py`: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
- `step0_filter_significant_SNPs_and_get_GxE_effects.py`: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run `step0_compute_GxE_p_values.py`
- `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py`: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
- `step10a_get_significant_rsIDs.py`: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes)
- `step10b_get_significant_SNPs.sh`: Retreives plink files for all SNPs from the previous step
- `step10c_filter_significant_SNPs_and_get_GxE_effects.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects.py` to all chromosomes and environmental factors
- `step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py` to all chromosomes and environmental factors
- `step10d_get_significant_GxE_p_values.sh`: applies `step0_compute_GxE_p_values.py` to all independently nominally significant rsIDs.
- `step10e_access_common_SNPs.py`: Generates lists of main and interaction effects for various environmental factors with logistic PCA latent phenotypes, counts of each effect, and data for table 1a (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
- `step10f_get_CV_folds.py`: Generates 30 outer training/validation index sets for nested cross-validation.
- `step10g_get_CV_testing_accuracy.py`: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
- `step10g_get_CV_testing_accuracy.sh`: applies `step10g_get_CV_testing_accuracy.py` to all 30 outer training/validation index sets.
## Directory: step10_get_significant_SNPs_NN
- `step0_compute_GxE_p_values.py`: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
- `step0_filter_significant_SNPs_and_get_GxE_effects.py`: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run `step0_compute_GxE_p_values.py`
- `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py`: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
- `step10a_get_significant_rsIDs.py`: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes)
- `step10b_get_significant_SNPs.sh`: Retreives plink files for all SNPs from the previous step
- `step10c_filter_significant_SNPs_and_get_GxE_effects.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects.py` to all chromosomes and environmental factors
- `step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py` to all chromosomes and environmental factors
- `step10d_get_significant_GxE_p_values.sh`: applies `step0_compute_GxE_p_values.py` to all independently nominally significant rsIDs.
- `step10e_access_common_SNPs.py`: Generates lists of main and interaction effects for various environmental factors with NN latent phenotypes, counts of each effect, and data for table 1a (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
- `step10f_get_CV_folds.py`: Generates 30 outer training/validation index sets for nested cross-validation.
- `step10g_get_CV_testing_accuracy.py`: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
- `step10g_get_CV_testing_accuracy.sh`: applies `step10g_get_CV_testing_accuracy.py` to all 30 outer training/validation index sets.
## Directory: step10_get_significant_SNPs_PCA
- `step0_compute_GxE_p_values.py`: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
- `step0_filter_significant_SNPs_and_get_GxE_effects.py`: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run `step0_compute_GxE_p_values.py`
- `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py`: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
- `step10a_get_significant_rsIDs.py`: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes).
- `step10b_get_significant_SNPs.sh`: Retreives plink files for all SNPs from the previous step
- `step10c_filter_significant_SNPs_and_get_GxE_effects.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects.py` to all chromosomes and environmental factors
- `step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh`: applies `step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py` to all chromosomes and environmental factors
- `step10d_get_significant_GxE_p_values.sh`: applies `step0_compute_GxE_p_values.py` to all independently nominally significant rsIDs.
- `step10e_access_common_SNPs.py`: Generates lists of main and interaction effects for various environmental factors with PCA latent phenotypes, counts of each effect, and data for table 1 (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
- `step10f_get_CV_folds.py`: Generates 30 outer training/validation index sets for nested cross-validation.
- `step10g_get_CV_testing_accuracy.py`: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
- `step10g_get_CV_testing_accuracy.sh`: applies `step10g_get_CV_testing_accuracy.py` to all 30 outer training/validation index sets.
## Directory: step11_analyze_complete_dataset
### Code Files
- `step11a_merge_rsID_output.py`: Generates Table 1b and effect-specific rsID file pairs (rsIDs_*_effects.txt and rsIDs_*_effects_pvals.txt, where * is the effect type) for FUMA input.
- `Important Note on FUMA details`: For each rsID file pair, Inputs both files into FUMA with a placeholder total sample size of 380000. It is merely a required input that does not effect the distances between SNP hits and genes, thereby making it irrelevant. Initially uses a 500KB gene-SNP distance, later refined to 300KB for precision with small gene count reduction. Skips gene by exercise interactions due to limited hits. Renames FUMA output files as annov_*.txt. for clarity, where * = (main, smoking, gender, alcohol).
- `step11b_get_enrichment_all.R`: IMPORTANT; run in sections, not all at once. Intermediate manual steps generate miEAA.csv, which lists KEGG-enriched miRNA pathways. Uses FUMA outputs and rsIDs_*_effects.txt as input. Generates Tables 2 and S6.
- `step11c_get_chr_seperated_lists.py`: makes one file per SNP hit for input into the LDlink tool.
- `step11d_ldlink.sh`: Sends files from previous step to LDlink, fetching data on prior GWAS hits in LD with each SNP hit.
- `step11e_make_SNP_tables_LD_pruning.py`: Creates Table 1a, counting independent GWAS SNP hits that output from step11d_ldlink.sh finds are related to AHF. The term list "possible_AHF_terms" was manually curated and validated through substring matching of terms related to cardiovascular dysfunction in the output from step11d_ldlink.sh.
- `step11f_miRNA_enrichment_analysis.R`: IMPORTANT; run in sections, not all at once. Computes p-values for enrichment of genic SNP hits in genes that enrich miRNA. Produces Tables 3 and S7. Manual steps are clearly outlined for gene-disease associations. Also uses data from Genevestigator ("see genevestigator_hits_methods folder").
- `step11g_make_heritability_figure.py`: Generates Table S5b and Figure 3a using outputs from step10g_get_CV_testing_accuracy.sh across PCA, logistic_PCA, and NN directories.
- `step11h_confirm_model_consistency.py`: Generates Tables 1c and 1d as per manuscript methods.
- `step11i_get_independent_normal_GWAS_SNPs.py`: Identifies SNP hits from standard logistic GWAS that are independent of six default AHF-defining SNPs.
- `step11j_make_figure3b.sh`: Prepares files for Figure 3b, including six default SNPs and one additional independent SNP (rs73188900) found in the previous step.
- `step11k_make_table1.R`: Creates Figure 3b and Table 1 using the output from the previous step. Also generates figure 1.
- `step11l_make_supp_figs.py`: Produces Tables S1-S4 and Figures S1, S2, and partial input for Figure S3a.
- `step11m_finish_fig_S3a.R`: Completes Figure S3a.
### Intermediate Files
- `annov_alcohol.txt`: relevant FUMA output when using `rsIDs_GxAlcohol_effects.txt` and `rsIDs_GxAlcohol_effects_pvals.txt` as input. Originally named "annov.txt".
- `annov_gender.txt`: relevant FUMA output when using `rsIDs_GxGender_effects.txt` and `rsIDs_GxGender_effects_pvals.txt` as input. Originally named "annov.txt".
- `annov_main.txt`: relevant FUMA output when using `rsIDs_main_effects.txt` and `rsIDs_main_effects_pvals.txt` as input. Originally named "annov.txt".
- `annov_smoking.txt`: relevant FUMA output when using `rsIDs_GxSmoking_effects.txt` and `rsIDs_GxSmoking_effects_pvals.txt` as input. Originally named "annov.txt".
- `disease_gene_associations_logistic_PCA_smoking.tsv`: Output from entering "CASZ1::AKR7A3::PPIE::HIVEP3::SSBP3::ST6GALNAC3::DPYD::NOS1AP::DDR2::SLC9A2::THSD7B::RAPGEF4::GPR155::ALS2::CHL1::IL5RA::THRB::CACNA2D3::CADPS::PRICKLE2::ROBO1::GAP43::LSAMP::AGTR1::MED12L::SERPINI1::MECOM::PARL::ST6GAL1::MUC4::EVC2::NPNT::HPGD::TENM3::LPCAT1::DROSHA::PRLR::GHR::EDIL3::CHSY3::FAM53C::EXOC2::ATXN1::BTBD9::TFEB::RUNX2::ESR1::TMEM242::PDE10A::COX19::ICA1::CREB5::PKD1L1::CALN1::CALCR::RINT1::ATXN7L1::DGKI::DPP6::TNFRSF10B::ADRA1A::ASPH::HNF4G::UBR5::TBC1D31::TMEM65::SH3GL2::FBP1::PALM2-AKAP2::GARNL3::FAM107B::CXCL12::MYPN::CDH23::CCDC147::SORCS3::RBM20::GRK5::ATE1::SYT9::SBF2::SLC22A8::TENM4::MAML2::NTM::TMTC1::ANO4::ATP12A::LRCH1::HTR2A::KLHL1::GPC6::FNTB::GALNT16::SMOC1::MAP3K9::KCNK13::C14orf159::ITPK1::UNC79::EML1::RYR3::WDR72::UNC13C::ARNT2::KIAA1199::PDE8A::SLCO3A1::CDIP1::NXN::PIK3R5::CA10::GAA::PIEZO2::BCL2::CLEC4M::ZNF627::SLC1A6::KDELR1::MACROD2::PCSK2::STX16-NPEPL1::CDH4::RIPK4::POF1B" into https://www.disgenet.org/search. Output file was originally named "54897__22977__10450__59269__23648__256435__1806__9722__4921__6549__80731__11069__151556__57679__10752__3568__7068__55799__8618__166336__6091__2596__4045__185__116931__5274__2122__55486__6480__4585__132884__2557.tsv" and was downloaded on 9/12/2023.
- `disease_gene_associations_NN_smoking.tsv`: Output from entering "EPHA8::PGM1::SCCPDH::CNTNAP5::PLEKHM3::CNTN3::MAP1B::GRK6::GLP1R::COL19A1::SCARA5::ANP32B::NAV2::KSR2::TMEM132B::XYLT1::AP2B1::DMD" into https://www.disgenet.org/search. Output file was originally named "2046__5236__129684__389072__5067__4131__2870__2740__1310__286133__10541__89797__283455__114795__64131__163__1756_gene_gda_summary.tsv" and was downloaded on 9/12/2023.
- `disease_gene_associations_PCA_smoking.tsv`: Output from entering "CAMTA1::ECE1::HSPG2::CSMD2::NFIA::LPAR3::CHIA::KCNN3::PAPPA2::SOX13::PTPN14::TP53BP2::DNAH14::PCNXL2::GPR137B::RYR2::OR2L13::GRHL1::ALK::NDUFAF7::GALM::SLC8A1::PRKCE::STON1-GTF2A1L::GPR75-ASB3::EML6::ARHGAP25::LRRTM4::CTNNA2::ACVR1::STK39::CCDC173::PDE11A::CCDC141::PAX3::SRGAP3::DYNC1LI1::ROBO2::GUCA1C::IGSF11::ADCY5::NEK11::MLF1::PHC3::TNIK::KCNMB2::EIF2B5::EIF2B5::LPP::DLG1::CLNK::KCNIP4::GPR125::GABRB1::FIP1L1::EPHA5::DCLK2::KIAA0922::RAPGEF2::TRIO::MYO10::EGFLAM::PDE4D::ARHGEF28::ARSB::EFNA5::GRAMD3::UBE2D2::STK32A::COL23A1::F13A1::DEK::CDKAL1::OPN5::EYS::EPHA7::PREP::CLVS2::TRDN::PERP::HIVEP2::SYNE1::SMOC2::GET4::TNRC18::HDAC9::JAZF1::AMPH::GLI3::WBSCR17::GTPBP10::COL26A1::TBXAS1::CNTNAP2::CSMD1::SGCZ::CSGALNACT1::PIWIL2::DPYSL2::ELP3::ZMAT4::XKR4::CNGB3::SDC2::NCALD::MTSS1::ASAP1::KCNQ3::SLA::ZFAT::KANK1::AK3::OSTF1::PCSK5::GNAQ::TLE1::STX17::ZNF462::SUSD1::COL27A1::AKNA::BRINP1::DENND1A::CAMK1D::C1QL3::RSU1::PARD3::TSPAN14::HPS1::CTBP2::C10orf90::TMEM41B::INSC::NELL1::SLC17A6::SHANK2::DLG2::RAB38::OPCML::RAD52::SLC2A14::GRIN2B::TMCC3::ANKS1B::NUP37::BRAP::DTX1::CCDC92::TMEM132D::CRYL1::ATP8A2::FREM2::FAM124A::PCDH9::ABCC4::NALCN::COL4A1::DCUN1D2::NPAS3::EGLN3::RGS6::CDC42BPB::ATP10A::GABRB3::GABRG3::MEIS2::FBN1::CGNL1::RORA::VPS13C::ZNF609::KIAA1024::PDE8A::MFGE8::ABHD2::CERS3::ADAMTS18::WWOX::CDH13::STX8::PRKCA::SLC39A11::RIT2::ST8SIA5::DCC::MBP::ATP9B::ATCAY::TICAM1::C19orf47::PSG8::PLAUR::ZNF227::VASP::SLC17A7::ZNF175::ZBTB45::PTPRT::CDH22::RUNX1::YBEY::CECR2::CABIN1::SGSM1::LARGE::MPPED1::PJA1" into https://www.disgenet.org/search. Output file was originally named "23261__1889__3339__114784__4774__23566__27159__3782__60676__9580__5784__7159__127602__7107__6262__284521__29841__238__55471__130589__6546__5581__286749__100302652__400954__9938__80059__1496__90__27347__50940__2.tsv" and was downloaded on 9/12/2023.
- `miEAA.csv`: Output from entering `MIRS_smoking.txt`into https://ccb-compute2.cs.uni-saarland.de/mieaa2/user_input/ by following manual steps 1 and 2 in `step11b_get_enrichment_all.R`. Was originally named "miEAA - miRNA Enrichment and Annotation -- Analysis results.csv" and was downloaded on 5/29/2023.
- `rsIDs_GxAlcohol_effects.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_GxAlcohol_effects_pvals.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_GxExercise_effects.txt`: Not used as FUMA input due to there being only one SNP hit.
- `rsIDs_GxExercise_effects_pvals.txt`: Not used as FUMA input due to there being only one SNP hit.
- `rsIDs_GxGender_effects.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_GxGender_effects_pvals.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_GxSmoking_effects.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_GxSmoking_effects_pvals.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_main_effects.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `rsIDs_main_effects_pvals.txt`: FUMA input. Refer to `Important Note on FUMA details`.
- `SNP_MAFs_rsIDs_GxAlcohol_effects.txt`: for possible future use
- `SNP_MAFs_rsIDs_GxExercise_effects.txt`: for possible future use
- `SNP_MAFs_rsIDs_GxGender_effects.txt`: for possible future use
- `SNP_MAFs_rsIDs_GxSmoking_effects.txt`: used by `step11f_miRNA_enrichment_analysis.R` and for possible future use
- `SNP_MAFs_rsIDs_main_effects.txt`: for possible future use
- `step11e_logistic_PCA_GxAlcohol_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_logistic_PCA_GxAlcohol_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_logistic_PCA_GxGender_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_logistic_PCA_GxGender_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_logistic_PCA_GxSmoking_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_logistic_PCA_GxSmoking_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_logistic_PCA_main_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_logistic_PCA_main_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_GxAlcohol_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_GxAlcohol_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_GxGender_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_GxGender_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_GxSmoking_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_NN_GxSmoking_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_NN_main_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_NN_main_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_GxAlcohol_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_GxAlcohol_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_GxGender_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_GxGender_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_GxSmoking_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_PCA_GxSmoking_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a. Used by `step11f_miRNA_enrichment_analysis.R` to make input for https://www.disgenet.org/search.
- `step11e_PCA_main_rsIDs_known.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
- `step11e_PCA_main_rsIDs_novel.txt`: Shows SNP hit subsets used by `step10e_access_common_SNPs.py` to make table 1a.
### Output Files
- `step11f_logistic_PCA_miRNA_associated_genic_SNP_enrichment.txt`: counts and p-value for enrichment of logistic PCA SNP hits inside of genes linked to miRNA
- `step11f_NN_miRNA_associated_genic_SNP_enrichment.txt`: counts and p-value for enrichment of NN SNP hits inside of genes linked to miRNA
- `step11f_PCA_miRNA_associated_genic_SNP_enrichment.txt`: counts and p-value for enrichment of PCA SNP hits inside of genes linked to miRNA
- `table1a.txt`: refer to manuscript
- `table1b.txt`: refer to manuscript
- `table1c.txt`: refer to manuscript
- `table1d.txt`: refer to manuscript
- `table2a.txt`: refer to manuscript
- `table2b.txt`: refer to manuscript
- `table2c.txt`: refer to manuscript
- `table3a.txt`: refer to manuscript
- `table3b.txt`: refer to manuscript
- `table3c.txt`: refer to manuscript
- `table3d.txt`: refer to manuscript
### Resource Folders
- `all_sig_rsIDs_logistic_PCA`: all statistically significant SNP hits for logistic PCA latent phenotypes
- `all_sig_rsIDs_NN`: all statistically significant SNP hits for NN latent phenotypes
- `all_sig_rsIDs_PCA`: all statistically significant SNP hits for PCA latent phenotypes
- `genevestigator_hits`: all studies with differentially expressed genes
- `genevestigator_hits_methods`: contains methodological details regarding `genevestigator_hits`