Calculating sample size estimates for RNA sequencing data

Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into c...

Full description

Saved in:
Bibliographic Details
Published inJournal of computational biology Vol. 20; no. 12; p. 970
Main Authors Hart, Steven N, Therneau, Terry M, Zhang, Yuji, Poland, Gregory A, Kocher, Jean-Pierre
Format Journal Article
LanguageEnglish
Published United States 01.12.2013
Subjects
Online AccessGet more information
ISSN1557-8666
DOI10.1089/cmb.2012.0283

Cover

Abstract Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
AbstractList Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
Author Kocher, Jean-Pierre
Zhang, Yuji
Hart, Steven N
Poland, Gregory A
Therneau, Terry M
Author_xml – sequence: 1
  givenname: Steven N
  surname: Hart
  fullname: Hart, Steven N
  organization: 1 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic , Rochester, Minnesota
– sequence: 2
  givenname: Terry M
  surname: Therneau
  fullname: Therneau, Terry M
– sequence: 3
  givenname: Yuji
  surname: Zhang
  fullname: Zhang, Yuji
– sequence: 4
  givenname: Gregory A
  surname: Poland
  fullname: Poland, Gregory A
– sequence: 5
  givenname: Jean-Pierre
  surname: Kocher
  fullname: Kocher, Jean-Pierre
BackLink https://www.ncbi.nlm.nih.gov/pubmed/23961961$$D View this record in MEDLINE/PubMed
BookMark eNo1j09LxDAUxIMo7h89epV8gXbz0jRNjkvRVVhWED0vL-mLVNpubdqDfnorKgwzl2GY34qdd6eOGLsBkYIwduNbl0oBMhXSZGdsCXleJEZrvWCrGN-FgEyL4pItZGY1zFoyW2LjpwbHunvjEdu-IR7rL-IUx7rFkSIPp4E_H7Y80sdEnf8pVjjiFbsI2ES6_ss1e72_eykfkv3T7rHc7hOvpBgTJcmGoK1TBVZgnZPeKQ9UeJxNQaYEIiqhC5sRKPAUcjAwUxAY1Fqu2e3vbj-5lqpjP8y_hs_jP4P8BoLBR88
CitedBy_id crossref_primary_10_7717_peerj_18627
crossref_primary_10_1186_s13148_023_01445_5
crossref_primary_10_3390_molecules23051136
crossref_primary_10_7717_peerj_18069
crossref_primary_10_7717_peerj_14262
crossref_primary_10_2196_resprot_6024
crossref_primary_10_1016_j_bbi_2024_07_008
crossref_primary_10_1021_acsomega_3c09171
crossref_primary_10_1177_1744806920936502
crossref_primary_10_1016_j_bbi_2017_04_002
crossref_primary_10_1177_1744806919878088
crossref_primary_10_1186_s13059_016_0881_8
crossref_primary_10_1002_glia_23785
crossref_primary_10_1128_mBio_00575_18
crossref_primary_10_3390_vetsci11070299
crossref_primary_10_7717_peerj_14136
crossref_primary_10_7717_peerj_14259
crossref_primary_10_1261_rna_046011_114
crossref_primary_10_7717_peerj_14814
crossref_primary_10_1136_bmjopen_2024_084081
crossref_primary_10_1016_j_csbj_2015_08_004
crossref_primary_10_1093_intbio_zyab007
crossref_primary_10_1002_0471142905_hg1113s83
crossref_primary_10_1038_s41598_019_45371_0
crossref_primary_10_1097_CEJ_0000000000000664
crossref_primary_10_7717_peerj_13167
crossref_primary_10_1136_bmjopen_2021_053397
crossref_primary_10_2196_39252
crossref_primary_10_1093_bib_bbw144
crossref_primary_10_1093_humrep_dead256
crossref_primary_10_3390_biomedicines11030742
crossref_primary_10_1111_vcp_13412
crossref_primary_10_7717_peerj_14369
crossref_primary_10_1172_JCI140723
crossref_primary_10_1016_j_csbj_2020_05_018
crossref_primary_10_1038_s41467_024_44828_9
crossref_primary_10_2460_ajvr_77_7_693
crossref_primary_10_1186_s12864_024_10362_7
crossref_primary_10_1109_TCBB_2018_2873010
crossref_primary_10_1016_j_jtho_2022_11_006
crossref_primary_10_1667_RR15310_1
crossref_primary_10_1158_1078_0432_CCR_21_0267
crossref_primary_10_3389_fnmol_2017_00045
crossref_primary_10_3390_ijms25179322
crossref_primary_10_1007_s10815_023_02911_y
crossref_primary_10_1186_s12967_018_1749_3
crossref_primary_10_1016_j_lfs_2021_119121
crossref_primary_10_7717_peerj_16654
crossref_primary_10_1016_j_nbd_2019_05_006
crossref_primary_10_1038_s41598_019_43935_8
crossref_primary_10_1371_journal_pone_0225062
crossref_primary_10_1111_prd_12350
crossref_primary_10_1038_srep22416
crossref_primary_10_7554_eLife_90135
crossref_primary_10_2147_JIR_S363538
crossref_primary_10_3389_fendo_2022_888948
crossref_primary_10_1093_hropen_hox022
crossref_primary_10_7717_peerj_16006
crossref_primary_10_3390_nu16070955
crossref_primary_10_1142_S0219720015500183
crossref_primary_10_1177_1751143720966286
crossref_primary_10_1093_nar_gkw545
crossref_primary_10_7717_peerj_16368
crossref_primary_10_1007_s00109_020_01922_x
crossref_primary_10_1155_2018_3028290
crossref_primary_10_3390_biom13020221
crossref_primary_10_1007_s43032_019_00093_6
crossref_primary_10_1038_nrg3934
crossref_primary_10_1093_bib_bbab566
crossref_primary_10_1186_s12864_017_3797_0
crossref_primary_10_3390_cancers13205143
crossref_primary_10_1016_j_ebiom_2023_104772
crossref_primary_10_1016_j_bbr_2018_12_047
crossref_primary_10_1371_journal_pone_0140049
crossref_primary_10_1111_tpj_15886
crossref_primary_10_1371_journal_pone_0152274
crossref_primary_10_1093_biostatistics_kxz016
crossref_primary_10_3389_falgy_2024_1349741
crossref_primary_10_3389_fonc_2023_1200387
crossref_primary_10_7554_eLife_90135_3
crossref_primary_10_1113_JP279646
crossref_primary_10_7717_peerj_14603
crossref_primary_10_3390_ijms24098175
crossref_primary_10_1177_0333102419851812
crossref_primary_10_1186_s12920_018_0370_x
crossref_primary_10_7717_peerj_14968
crossref_primary_10_1186_s13073_015_0153_3
crossref_primary_10_1016_j_celrep_2020_01_026
crossref_primary_10_1038_s41593_019_0490_4
crossref_primary_10_1530_ERC_17_0470
crossref_primary_10_3390_genes10100801
crossref_primary_10_1111_eci_14140
crossref_primary_10_1002_hep_30156
crossref_primary_10_3389_fped_2020_00197
crossref_primary_10_3390_biomedinformatics1020004
crossref_primary_10_1016_j_cmet_2017_04_003
crossref_primary_10_1038_mp_2015_167
crossref_primary_10_7717_peerj_15123
crossref_primary_10_1016_j_exer_2015_05_009
crossref_primary_10_7717_peerj_15001
crossref_primary_10_1016_j_ejphar_2024_176426
crossref_primary_10_1038_s41467_024_47793_5
crossref_primary_10_1038_s41598_018_26700_1
crossref_primary_10_1093_cz_zoaa007
crossref_primary_10_18632_oncotarget_16664
crossref_primary_10_1093_molehr_gaaa060
crossref_primary_10_1016_j_jagp_2022_02_003
crossref_primary_10_3389_fonc_2019_01417
crossref_primary_10_1093_hmg_ddaa188
crossref_primary_10_7717_peerj_13455
crossref_primary_10_1038_s41388_021_02040_9
crossref_primary_10_1080_14737159_2016_1198258
crossref_primary_10_1093_bioinformatics_btu640
crossref_primary_10_1016_j_devcel_2018_07_005
crossref_primary_10_7554_eLife_34817
crossref_primary_10_1016_j_scr_2023_103086
crossref_primary_10_7717_peerj_15077
crossref_primary_10_1177_1744806918816462
crossref_primary_10_1371_journal_pone_0311379
crossref_primary_10_1186_s12872_021_02280_3
crossref_primary_10_1053_j_gastro_2019_02_027
crossref_primary_10_7717_peerj_11020
crossref_primary_10_3389_fimmu_2019_02081
crossref_primary_10_1371_journal_pcbi_1005457
crossref_primary_10_1177_10998004211003980
crossref_primary_10_3390_ani13071199
crossref_primary_10_1167_iovs_17_23599
crossref_primary_10_1101_gr_267070_120
crossref_primary_10_7717_peerj_16949
crossref_primary_10_1007_s13592_017_0542_2
crossref_primary_10_7717_peerj_18327
crossref_primary_10_1002_acn3_272
crossref_primary_10_3390_ijms241411669
crossref_primary_10_1007_s40291_020_00494_3
crossref_primary_10_1242_dmm_047225
crossref_primary_10_3390_children9111764
crossref_primary_10_1186_s12872_020_01629_4
crossref_primary_10_1093_bib_bbx061
crossref_primary_10_1136_lupus_2022_000698
crossref_primary_10_1007_s00360_024_01591_z
crossref_primary_10_1186_s12859_016_0994_9
crossref_primary_10_1007_s41060_024_00534_9
crossref_primary_10_3389_fpls_2018_00108
crossref_primary_10_1038_s41598_023_45317_7
crossref_primary_10_1038_s41467_022_31436_8
crossref_primary_10_1186_s12920_018_0379_1
crossref_primary_10_3389_fnmol_2017_00304
crossref_primary_10_3389_fimmu_2022_968991
crossref_primary_10_1134_S207905971703011X
crossref_primary_10_1371_journal_pone_0153782
crossref_primary_10_1101_gr_277397_122
crossref_primary_10_4137_CIN_S17688
crossref_primary_10_1186_s12920_017_0270_5
crossref_primary_10_1097_HCO_0000000000000275
crossref_primary_10_1186_s12929_021_00718_6
crossref_primary_10_1002_pmic_202200414
crossref_primary_10_1093_bioinformatics_btaa832
crossref_primary_10_1111_pcn_12550
crossref_primary_10_3390_jcm10030523
crossref_primary_10_1371_journal_pone_0308711
crossref_primary_10_1186_s13148_015_0052_x
crossref_primary_10_7717_peerj_14342
crossref_primary_10_1515_sagmb_2016_0008
crossref_primary_10_1111_rssc_12330
crossref_primary_10_3389_fimmu_2021_773070
crossref_primary_10_1177_0022034518761644
crossref_primary_10_1186_s13063_021_05442_y
crossref_primary_10_1111_mec_13526
crossref_primary_10_1371_journal_pone_0251868
crossref_primary_10_1111_ceo_13554
crossref_primary_10_1136_bmjopen_2022_067002
crossref_primary_10_7554_eLife_09800
crossref_primary_10_1186_s12940_019_0535_x
crossref_primary_10_1016_j_jaut_2024_103255
crossref_primary_10_1038_s41366_018_0303_y
crossref_primary_10_1038_s41467_021_26779_7
crossref_primary_10_1093_bioinformatics_btaa607
crossref_primary_10_3389_fgene_2021_700489
crossref_primary_10_1186_s12864_022_08306_0
crossref_primary_10_1038_gene_2015_23
crossref_primary_10_1186_s12859_018_2445_2
crossref_primary_10_1097_HJH_0000000000003226
crossref_primary_10_1186_s12859_018_2191_5
crossref_primary_10_1371_journal_pone_0191407
crossref_primary_10_3390_genes15030344
crossref_primary_10_1128_mBio_00100_16
crossref_primary_10_7717_peerj_13591
crossref_primary_10_1186_s12865_014_0040_5
crossref_primary_10_1021_acs_chemrestox_0c00368
crossref_primary_10_1126_scitranslmed_aav7816
crossref_primary_10_1016_j_vaccine_2015_04_096
crossref_primary_10_3389_fimmu_2022_1093242
crossref_primary_10_3389_frph_2024_1329760
crossref_primary_10_1158_0008_5472_CAN_15_1629
crossref_primary_10_1177_2515841419835460
crossref_primary_10_1523_JNEUROSCI_1929_17_2017
crossref_primary_10_1002_ece3_5956
crossref_primary_10_3389_fmolb_2024_1368372
crossref_primary_10_1038_s41598_018_36500_2
crossref_primary_10_3390_fermentation9080697
crossref_primary_10_1016_j_jaci_2024_01_014
crossref_primary_10_1080_21541264_2019_1704128
crossref_primary_10_1016_j_fm_2019_03_008
crossref_primary_10_1093_bioadv_vbad152
crossref_primary_10_1186_s12887_019_1564_x
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
DOI 10.1089/cmb.2012.0283
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
DatabaseTitleList MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
ExternalDocumentID 23961961
Genre Research Support, Non-U.S. Gov't
Journal Article
Research Support, N.I.H., Extramural
GrantInformation_xml – fundername: NIAID NIH HHS
  grantid: U01 AI089859
GroupedDBID ---
0R~
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ITC
MV1
NPM
NQHIM
O9-
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
ID FETCH-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662
IngestDate Thu Apr 03 07:08:14 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 12
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662
OpenAccessLink https://www.ncbi.nlm.nih.gov/pmc/articles/3842884
PMID 23961961
ParticipantIDs pubmed_primary_23961961
PublicationCentury 2000
PublicationDate 2013-12-01
PublicationDateYYYYMMDD 2013-12-01
PublicationDate_xml – month: 12
  year: 2013
  text: 2013-12-01
  day: 01
PublicationDecade 2010
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2013
References 20979621 - Genome Biol. 2010;11(10):R106
19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40
19289445 - Bioinformatics. 2009 May 1;25(9):1105-11
21176179 - Genome Biol. 2010;11(12):220
19371405 - Biol Direct. 2009;4:14
21536721 - Genome Res. 2011 Jun;21(6):991-8
22383036 - Nat Protoc. 2012 Mar;7(3):562-78
19261174 - Genome Biol. 2009;10(3):R25
22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5
20436464 - Nat Biotechnol. 2010 May;28(5):511-5
22769017 - BMC Genomics. 2012;13:304
References_xml – reference: 19261174 - Genome Biol. 2009;10(3):R25
– reference: 19371405 - Biol Direct. 2009;4:14
– reference: 20436464 - Nat Biotechnol. 2010 May;28(5):511-5
– reference: 21176179 - Genome Biol. 2010;11(12):220
– reference: 21536721 - Genome Res. 2011 Jun;21(6):991-8
– reference: 22383036 - Nat Protoc. 2012 Mar;7(3):562-78
– reference: 19289445 - Bioinformatics. 2009 May 1;25(9):1105-11
– reference: 20979621 - Genome Biol. 2010;11(10):R106
– reference: 22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5
– reference: 19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40
– reference: 22769017 - BMC Genomics. 2012;13:304
SSID ssj0013607
Score 2.4996135
Snippet Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly...
SourceID pubmed
SourceType Index Database
StartPage 970
SubjectTerms Algorithms
Animals
Gene Expression Profiling - methods
High-Throughput Nucleotide Sequencing - methods
Humans
Models, Biological
RNA, Messenger - genetics
Sample Size
Title Calculating sample size estimates for RNA sequencing data
URI https://www.ncbi.nlm.nih.gov/pubmed/23961961
Volume 20
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-EPQgvt-Sg7el2qTZtD2KKCK4B1HQkyRpCoq7K-t60F_vTJI-fKJeQmm6S8n37exkMvMNIXvgVZQmkUVU5lkRiSTrRtoWKioE7HCLVKvMpfyf9-TplTi77l43oWxXXTLW--b1y7qS_6AK9wBXrJL9A7L1l8INuAZ8YQSEYfwVxkfqwbj2WxgTUKjz23m6e7UdVM7ooxPpkggveoedkDGND4ZqtK9cUuNaPFThwaDP1NgoX93j-6A1RzhAtNHAqmeHvB2NXpoAax2Nvnm-v6uN8LBKpgz1MSGcGkIPLGmlcdhgLrvwHyd935TKnvK4zRveso657xHyyWrHGYqemr7GVDu-jy5P-zlY9Me-g5AnOWz4vHr7z7MfRLSrqUkymaZov3sY1KkOm2ScBvlVeJODd--BYtHhsx82Hs4BuVwg8wEmeuhpsEgm7GCJzPheoi9LZO68FuB9WiZ5ixrUU4MiNWhNDQrUoEAN2lCDIjVWyNXJ8eXRaRS6ZERG8HgcCW7zspS5FqkqWK41N1oYZlOjYAAfRcRKKYEBw8QywYwtu8zJulmWKSn5KpkaDAd2nVCZMpvLVGsWW5ElpeZ4rGo0Figbpu0GWfMrcPvopVBuq7XZ_HZmi8w21Nkm0yX89uwOOHJjvetgeAMNYkeW
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Calculating+sample+size+estimates+for+RNA+sequencing+data&rft.jtitle=Journal+of+computational+biology&rft.au=Hart%2C+Steven+N&rft.au=Therneau%2C+Terry+M&rft.au=Zhang%2C+Yuji&rft.au=Poland%2C+Gregory+A&rft.date=2013-12-01&rft.eissn=1557-8666&rft.volume=20&rft.issue=12&rft.spage=970&rft_id=info:doi/10.1089%2Fcmb.2012.0283&rft_id=info%3Apmid%2F23961961&rft_id=info%3Apmid%2F23961961&rft.externalDocID=23961961