-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathSummaryStatsGeneticData.R
127 lines (90 loc) · 3.4 KB
/
SummaryStatsGeneticData.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
'''
We calculated the Allelic richness and expected heterozygosity scores
for the 'BIG DATA' file of 153 populations.
This is ALOT of numbers!! So I want to come up with some basic summary
statistics to allow us to make some generalization in our data analysis.
We can later compare between these values and the Newfoundland only populations using
basic statistics such as chi-square tests to see if the scores differ significantly
'''
# First I need the data set. I'll put it into another object called GenDat
# (I'm working inside our Salmon-Project directory, so all the docs are already here, no need for a file path)
GenDat <- read.csv("HWEdat.csv")
#I'll calculate the mean and the range (min and max) for each score
#I'll save each in an object to put them all together later for further analysis
#Allelic Richness
ARMean <- mean(GenDat$Allelic.Richness)
ARMin <- min(GenDat$Allelic.Richness)
ARMax <- max(GenDat$Allelic.Richness)
ARRange <-ARMax - ARMin
#Expected hetero
EHMean <- mean(GenDat$Expected.Het)
EHMin <- min(GenDat$Expected.Het)
EHMax <- max(GenDat$Expected.Het)
EHRange <-EHMax - EHMin
'''
Now we do the same for the Newfoundland file
'''
# First I need the data set. I'll put it into another object called GenDat
# (I'm working inside our Salmon-Project directory, so all the docs are already here, no need for a file path)
NFGenDat <- read.csv("HWEdatNFL.csv")
#I'll calculate the mean and the range (min and max) for each score
#I'll save each in an object to put them all together later for further analysis
#Allelic Richness
ARMeanNF <- mean(NFGenDat$Allelic.Richness)
ARMinNF <- min(NFGenDat$Allelic.Richness)
ARMaxNF <- max(NFGenDat$Allelic.Richness)
ARRangeNF <-ARMaxNF - ARMinNF
#Expected hetero
EHMeanNF <- mean(NFGenDat$Expected.Het)
EHMinNF <- min(NFGenDat$Expected.Het)
EHMaxNF <- max(NFGenDat$Expected.Het)
EHRangeNF <-EHMaxNF - EHMinNF
# Make a simple data frame for allelic richness and expected heterozygosity
#Allelic Richness
Mean1 <- c(ARMean, ARMeanNF)
Min1 <- c(ARMin, ARMinNF)
Max1 <- c(ARMax, ARMaxNF)
Range1 <- c(ARRange, ARRangeNF)
#Expected Hetero
Mean2 <- c(EHMean, EHMeanNF)
Min2 <- c(EHMin, EHMinNF)
Max2 <- c(EHMax, EHMinNF)
Range2 <- c(EHRange, EHRangeNF)
#Making the frame
ARFrame <- data.frame(Mean1, Min1, Max1, Range1)
EHFrame <- data.frame(Mean2, Min2, Max2, Range2)
#Adding row names
rownames(ARFrame) <- c("BigFile", "Newfoundland")
rownames(EHFrame) <- c("BigFile", "Newfoundland")
'''
To compare if the mean allelic richness and expected heterozygosity differ significantly
between the two files, we can use a t-test.
'''
t.test(GenDat$Allelic.Richness, NFGenDat$Allelic.Richness)
'''
Result
Welch Two Sample t-test
data: GenDat$Allelic.Richness and NFGenDat$Allelic.Richness
t = -26.267, df = 86.898, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.686750 -4.027356
sample estimates:
mean of x mean of y
8.012222 12.369275
Significantly higher allelic richness in Newfoundland
'''
t.test(GenDat$Expected.Het, NFGenDat$Expected.Het)
'''
Result
Welch Two Sample t-test
data: GenDat$Expected.Het and NFGenDat$Expected.Het
t = -1.199, df = 142.94, p-value = 0.2325
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01346548 0.00329782
sample estimates:
mean of x mean of y
0.8347712 0.8398551
No significant difference in expected heterozygosity
'''