Statistics in research article-Univariate analysis [科研文章中的统计-单因素分析]

写作原因

  • 最近学习 R 语言,这篇博客是学习 R 语言一个简单成果展示,这是第一部分,单因素分析
  • 科研论文中的统计方法很重要,关系到结果的可信度,然而中国发表的很多文章,统计过关的不多
  • 本文以 R 语言为切入点,以我自己文章中的数据为原型,简单阐述单因素分析的方法原理及过程
  • 有关 R 语言介绍及安装方法,请移步 R 语言官方网1

R 分析基本流程

数据导入,在 R 语言中导入数据框用 read.table,再显示数据

F2TTC <- read.table("file/F2TTC.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE, check.names = FALSE)
F2TTC
#+RESULTS:
   Sham THP 40  + Sham     IR THP 10  + IR THP 20  + IR THP 40  + IR
1     0              0 52.860       44.503       39.304       31.750
2     0              0 46.746       26.012       26.299       33.871
3     0              0 54.279       36.196       35.175       32.801
4     0              0 46.767       43.956       38.761       34.568
5     0              0 47.651       42.660       38.844       32.157
6     0              0 55.392       36.196       41.210       34.302
7     0              0 51.055       43.956       19.734       23.853
8    NA             NA 45.789       42.660       23.413       16.602
9    NA             NA 28.584           NA           NA       18.508
10   NA             NA 36.514           NA           NA       13.746
11   NA             NA 38.700           NA           NA       15.606
12   NA             NA 47.441           NA           NA        0.000
13   NA             NA 40.071           NA           NA        0.000

检测数据是否有离群点及数据正态性判断

  • 用 reshape2 变换数据
library(reshape2)
F2TTCR <- melt(F2TTC)
names(F2TTCR)[1:2] <- c("group", "value")
F2TTCR
#+RESULTS:
	    group  value
1            Sham  0.000
2            Sham  0.000
3            Sham  0.000
4            Sham  0.000
5            Sham  0.000
6            Sham  0.000
7            Sham  0.000
8            Sham     NA
9            Sham     NA
10           Sham     NA
11           Sham     NA
12           Sham     NA
13           Sham     NA
14 THP 40  + Sham  0.000
15 THP 40  + Sham  0.000
16 THP 40  + Sham  0.000
17 THP 40  + Sham  0.000
18 THP 40  + Sham  0.000
19 THP 40  + Sham  0.000
20 THP 40  + Sham  0.000
21 THP 40  + Sham     NA
22 THP 40  + Sham     NA
23 THP 40  + Sham     NA
24 THP 40  + Sham     NA
25 THP 40  + Sham     NA
26 THP 40  + Sham     NA
27             IR 52.860
28             IR 46.746
29             IR 54.279
30             IR 46.767
31             IR 47.651
32             IR 55.392
33             IR 51.055
34             IR 45.789
35             IR 28.584
36             IR 36.514
37             IR 38.700
38             IR 47.441
39             IR 40.071
40   THP 10  + IR 44.503
41   THP 10  + IR 26.012
42   THP 10  + IR 36.196
43   THP 10  + IR 43.956
44   THP 10  + IR 42.660
45   THP 10  + IR 36.196
46   THP 10  + IR 43.956
47   THP 10  + IR 42.660
48   THP 10  + IR     NA
49   THP 10  + IR     NA
50   THP 10  + IR     NA
51   THP 10  + IR     NA
52   THP 10  + IR     NA
53   THP 20  + IR 39.304
54   THP 20  + IR 26.299
55   THP 20  + IR 35.175
56   THP 20  + IR 38.761
57   THP 20  + IR 38.844
58   THP 20  + IR 41.210
59   THP 20  + IR 19.734
60   THP 20  + IR 23.413
61   THP 20  + IR     NA
62   THP 20  + IR     NA
63   THP 20  + IR     NA
64   THP 20  + IR     NA
65   THP 20  + IR     NA
66   THP 40  + IR 31.750
67   THP 40  + IR 33.871
68   THP 40  + IR 32.801
69   THP 40  + IR 34.568
70   THP 40  + IR 32.157
71   THP 40  + IR 34.302
72   THP 40  + IR 23.853
73   THP 40  + IR 16.602
74   THP 40  + IR 18.508
75   THP 40  + IR 13.746
76   THP 40  + IR 15.606
77   THP 40  + IR  0.000
78   THP 40  + IR  0.000
  • 检测数据是否有离群点,用 car 包中的 outlierTest 检测,下面数据 P > 0.05,说明有离群点
library(car)
F2TTCout <- outlierTest(lm(value ~ group, data = F2TTCR))
F2TTCout
#+RESULTS:
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
    rstudent unadjusted p-value Bonferonni p
77 -3.020835          0.0039982       0.2239
  • 用 Q-Q 图检验正态性假设,数据是非正态的
tiff(filename = "qqpF2T.tif",width = 15,height = 18,units ="cm",compression="lzw",bg="white",res=600)
qqPlot(lm(value ~ group, data=F2TTCR), simulate=TRUE, main="Q-Q Plot", labels=TRUE)
dev.off()

blogqqpF2T.png

Figure 1: Q-Q 图

接下来应该进行方差齐性检验,常用方差齐性检验有 leveneTest 和 bartlett.test 两种

bartlett.test(value ~ group, data = F2TTCR)
leveneTest(F2TTCR$value, F2TTCR$group)
#+RESULTS:
	Bartlett test of homogeneity of variances
data:  value by group
Bartlett's K-squared = Inf, df = 5, p-value < 2.2e-16

Levene's Test for Homogeneity of Variance (center = median)
      Df F value    Pr(>F)
group  5  5.5917 0.0003619
      50

由于我上面数据非正态,所以用非参数检验方法检验各组差异,分两部分,一是各组之间的差异,用 kruskal.test,二是两两比较,用 posthoc.kruskal.nemenyi.test,由下面结果知道,IR 组和 Sham 组相比,有统计学意义(P < 0.01),用“**”表示差异,THP 10 + IR 组与 Sham 组,THP 40 + IR 与 IR 组相比,有统计学意义(P < 0.05),用“*”表示差异

kruskal.test(value ~ group, data = F2TTCR)
library(PMCMR)
F2TTCSEM <- posthoc.kruskal.nemenyi.test(value ~ group, data=F2TTCR, dist="Chisquare")
F2TTCSEM
#+RESULTS:
Kruskal-Wallis rank sum test

data:  value by group
Kruskal-Wallis chi-squared = 44.519, df = 5, p-value = 1.817e-08

Pairwise comparisons using Nemenyi-test with Chi-squared        
		       approximation for independent samples 
data:  value by group 

	       Sham   THP 40  + Sham IR     THP 10  + IR THP 20  + IR
THP 40  + Sham 1.0000 -              -      -            -           
IR             0.0001 0.0001         -      -            -           
THP 10  + IR   0.0154 0.0154         0.9673 -            -           
THP 20  + IR   0.1416 0.1416         0.5666 0.9764       -           
THP 40  + IR   0.6422 0.6422         0.0109 0.3315       0.8559      
P value adjustment method: none

做图

  • 数据处理,为做图做准备,用 dplyr 包进行数据整合
library(plotrix)
library(dplyr)
F2T <- summarise(group_by(F2TTCR, group), mean=mean(value, na.rm = TRUE), sem=std.error(value, na.rm=TRUE))
F2T <- mutate(F2T, sign1=c(NA,NA,"**","*",NA,NA), sign2=c(NA,NA,NA,NA,NA,"*"))
F2T
#+RESULTS:
Source: local data frame [6 x 5]

	   group     mean      sem sign1 sign2
	  (fctr)    (dbl)    (dbl) (chr) (chr)
1           Sham  0.00000 0.000000    NA    NA
2 THP 40  + Sham  0.00000 0.000000    NA    NA
3             IR 45.52685 2.139659    **    NA
4   THP 10  + IR 39.51738 2.269635     *    NA
5   THP 20  + IR 32.84250 2.963525    NA    NA
6   THP 40  + IR 22.13569 3.477622    NA     *
  • 做图
library(ggplot2)
library(grid)
levels(F2T$group) <- gsub("  ", "\n", levels(F2T$group))
tiff(filename = "F2TTC.tif",width = 16,height = 10,units ="cm",compression="lzw",bg="white",res=1200)
ggplot(data=F2T,aes(x=group,y=mean,fill=group, group=1)) +
  geom_errorbar(aes(ymin=mean,ymax=mean+sem, width = 0.2),size=0.8) +
  geom_bar(position="stack",stat = "identity",width=0.6,size=0.8, colour="black") +
  scale_fill_manual(values=c("grey100", "grey75","grey0", "grey90","grey50", "grey30")) +
  theme_classic(base_family="Times New Roman") + scale_x_discrete("") +
  scale_y_continuous("Infarct Size (%)", expand=c(0,0),limits = c(0, 70.2), breaks=seq(0,70, by=10)) +
  theme(axis.text.x = element_text(family="Times New Roman",face="bold", size=12), axis.text.y = element_text(family="Times New Roman",face="bold", size=12),axis.title.y = element_text(family="Times New Roman", face="bold", size=15, margin=margin(0,10,0,0)),axis.line = element_line(size = 0.8),axis.ticks = element_line(size = 0.8),axis.ticks.length=unit(0.2,"cm"), plot.margin=unit(c(5,5,0,5),"mm"), legend.position="none") +
  geom_path(x=c(1,1,1,3,3,3),y=c(50,52,52,52,52,50),size = 0.6) +
  geom_path(x=c(1,1,1,4,4,4),y=c(54,56,56,56,56,54),size = 0.6) +
  geom_path(x=c(3,3,3,6,6,6),y=c(60,62,62,62,62,60),size = 0.6) +
  annotate("text",family="Times New Roman", x = 2, y = 53, label = "**",size=5) +
  annotate("text",family="Times New Roman", x = 2.5, y=57, label = "*",size=5) +
  annotate("text",family="Times New Roman", x = 4.5, y = 63, label = "*",size=5)
dev.off()

blogF2TTC.png

Figure 2: TTC 统计图

另外两个例子

非参数检验,多组作图

F2N3 <- read.table("F2N3.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE, check.names = FALSE)
F2N24 <- read.table("F2N24.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE, check.names = FALSE)
F2N3R <- melt(F2N3)
names(F2N3R)[1:2] <- c("group", "value")
F2N24R <- melt(F2N24)
names(F2N24R)[1:2] <- c("group", "value")
F2N3R$time <- rep("3 h",96)
F2N24R$time <- rep("24 h",96)
F2NR <- rbind(F2N3R, F2N24R)
outlierTest(lm(value ~ group, data = F2NR))
tiff(filename = "qqpF2N.tif",width = 15,height = 18,units ="cm",compression="lzw",bg="white",res=600)
qqPlot(lm(value ~ group, data=F2NR), simulate=TRUE, main="Q-Q Plot", labels=TRUE)
dev.off()
bartlett.test(value ~ group, data = F2NR)
leveneTest(F2NR$value, F2NR$group)
kruskal.test(value ~ group, data = F2NR)
F2N3SEM <- posthoc.kruskal.nemenyi.test(value ~ group, data=F2N3R, dist="Chisquare")
F2N24SEM <- posthoc.kruskal.nemenyi.test(value ~ group, data=F2N24R, dist="Chisquare")
F2N <- summarise(group_by(F2NR, group, time), mean=mean(value, na.rm = TRUE), sem=std.error(value, na.rm=TRUE))
F2N <- F2N[order(desc(F2N$time)), ]
F2N$time <- factor(F2N$time,levels=unique(F2N$time))
tiff(filename = "F2NSS.tif",width = 20, height = 10,units ="cm",compression="lzw",bg="white",res=1200)
ggplot(data=F2N,aes(x=time,y=mean,fill=group, width=0.7)) +
  geom_bar(position=position_dodge(width=0.8),stat = "identity",width=0.6,size=0.8) +
  geom_errorbar(aes(ymin=mean,ymax=mean+sem, width = 0.2),position=position_dodge(width=0.8),size=0.8) +
  geom_bar(position=position_dodge(width=0.8), colour="black",stat ="identity",width=0.6,size=0.8) +
  scale_fill_manual(values=c("grey100", "grey75","grey0", "grey90","grey50", "grey30")) +
  theme_classic(base_family="Times New Roman") + scale_x_discrete("") +
  scale_y_continuous("Neurological Score", expand=c(0,0),limits = c(0, 20.05), breaks=seq(0,20, by=5)) +
  theme(axis.text.x = element_text(family="Times New Roman",face="bold", size=12), axis.text.y = element_text(family="Times New Roman",face="bold", size=12),axis.title.y = element_text(family="Times New Roman", face="bold", size=15, margin=margin(0,10,0,0)),axis.line = element_line(size = 0.8),axis.ticks = element_line(size = 0.8),axis.ticks.length=unit(0.2,"cm"), plot.margin=unit(c(5,5,0,5),"mm"), legend.title=element_blank(),legend.text = element_text(size = 12, face = "bold"),legend.key.width = unit(0.8, "cm"),legend.key.height = unit(0.5, "cm"),legend.position="right") +
  geom_segment(aes(x=0.67, y=16, xend=0.67, yend=16.5)) +
  geom_segment(aes(x=0.67, y=16.5, xend=1.33, yend=16.5)) +
  annotate("text",family="Times New Roman", x=0.93, y=17, label="*", size=5) +
  annotate("text",family="Times New Roman", x=1.07, y=17, label="**", size=5) +
  annotate("text",family="Times New Roman", x=1.2, y=17, label="**", size=5) +
  annotate("text",family="Times New Roman", x=1.33, y=17, label="*", size=5) +
  geom_segment(aes(x=1.67, y=16, xend=1.67, yend=16.5)) +
  geom_segment(aes(x=1.67, y=16.5, xend=2.20, yend=16.5)) +
  annotate("text",family="Times New Roman", x=1.93, y=17, label="**", size=5) +
  annotate("text",family="Times New Roman", x=2.07, y=17, label="**", size=5) +
  annotate("text",family="Times New Roman", x=2.2, y=17, label="**", size=5)
dev.off()

blogF2NSS.png

Figure 3: 神经学评分统计图

参数检验,一组作图

F3Ev <- read.table("F3Ev.csv", header=FALSE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
F3Ev <- as.data.frame(t(F3Ev))
names(F3Ev)[1:4] <- c("Sham", "THP 40  + Sham", "IR", "THP 40  + IR")
library(reshape2)
F3EvR <- melt(F3Ev)
names(F3EvR)[1:2] <- c("group", "value")
library(car)
outlierTest(lm(value ~ group, data = F3EvR))
tiff(filename = "qqpF3E.tif",width = 15,height = 18,units ="cm",compression="lzw",bg="white",res=600)
qqPlot(lm(value ~ group, data=F3EvR), simulate=TRUE, main="Q-Q Plot", labels=TRUE)
dev.off()
leveneTest(F3EvR$value, F3EvR$group)
bartlett.test(value ~ group, data = F3EvR)
library(multcomp)
F3EvOA <- aov(value ~ group, data=F3EvR)
summary(F3EvOA)
F3EvSEM <- summary(glht(F3EvOA, linfct = mcp(group = "Tukey")))
library(plotrix)
library(dplyr)
F3E <- summarise(group_by(F3EvR, group), mean=mean(value, na.rm = TRUE), sem=std.error(value, na.rm=TRUE))
library(ggplot2)
library(grid)
levels(F3E$group) <- gsub("  ", "\n", levels(F3E$group))
tiff(filename = "F3Evans Blue.tif",width = 11,height = 10,units ="cm",compression="lzw",bg="white",res=1200)
ggplot(data=F3E,aes(x=group,y=mean,fill=group, group=1)) +
  geom_errorbar(aes(ymin=mean,ymax=mean+sem, width = 0.2),size=0.8) +
  geom_bar(position="stack",stat = "identity",width=0.6,size=0.8, colour="black") +
  scale_fill_manual(values=c("grey100", "grey75","grey0", "grey30")) +
  theme_classic(base_family="Times New Roman") + scale_x_discrete("") +
  scale_y_continuous("Evans Blue (ug/g)", expand=c(0,0),limits = c(0, 20.05), breaks=seq(0,20, by=5)) +
  theme(axis.text.x = element_text(family="Times New Roman",face="bold", size=12), axis.text.y = element_text(family="Times New Roman",face="bold", size=12),axis.title.y = element_text(family="Times New Roman", face="bold", size=15, margin=margin(0,10,0,0)),axis.line = element_line(size = 0.8),axis.ticks = element_line(size = 0.8),axis.ticks.length=unit(0.2,"cm"), plot.margin=unit(c(5,5,0,5),"mm"), legend.position="none") +
  geom_path(x=c(1,1,3,3),y=c(16, 16.5, 16.5, 16),size = 0.6) +
  geom_path(x=c(3,3,4,4),y=c(17.5,18,18,17.5),size = 0.6) +
  annotate("text",family="Times New Roman", x = 2, y = 17, label = "**",size=5) +
  annotate("text",family="Times New Roman", x = 3.5, y=18.5, label = "**",size=5)
dev.off()

blogF3Evans%20Blue.png

Figure 4: Evans Blue 统计图

注意事项

Creative Commons licensing

TITLE: Statistics in research article-Univariate analysis [科研文章中的统计-单因素分析]
AUTHOR: lengyueyang
DATE: 2016-04-23 19:26:52 UTC+08:00
UPDATED:
LICENSE: The blog is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, commercial use is not allowed, for any reprint, please indicate address and signature. 88x31.png

Comments

Comments powered by Disqus