Calculating sample size estimates for RNA sequencing data

Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into c...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computational biology Vol. 20; no. 12; p. 970
Main Authors	Hart, Steven N, Therneau, Terry M, Zhang, Yuji, Poland, Gregory A, Kocher, Jean-Pierre
Format	Journal Article
Language	English
Published	United States 01.12.2013
Subjects	Algorithms Animals Gene Expression Profiling - methods High-Throughput Nucleotide Sequencing - methods Humans Models, Biological RNA, Messenger - genetics Sample Size
Online Access	Get more information
ISSN	1557-8666
DOI	10.1089/cmb.2012.0283

Cover

Abstract	Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
AbstractList	Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
Author	Kocher, Jean-Pierre Zhang, Yuji Hart, Steven N Poland, Gregory A Therneau, Terry M
Author_xml	– sequence: 1 givenname: Steven N surname: Hart fullname: Hart, Steven N organization: 1 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic , Rochester, Minnesota – sequence: 2 givenname: Terry M surname: Therneau fullname: Therneau, Terry M – sequence: 3 givenname: Yuji surname: Zhang fullname: Zhang, Yuji – sequence: 4 givenname: Gregory A surname: Poland fullname: Poland, Gregory A – sequence: 5 givenname: Jean-Pierre surname: Kocher fullname: Kocher, Jean-Pierre
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/23961961$$D View this record in MEDLINE/PubMed
BookMark	eNo1j09LxDAUxIMo7h89epV8gXbz0jRNjkvRVVhWED0vL-mLVNpubdqDfnorKgwzl2GY34qdd6eOGLsBkYIwduNbl0oBMhXSZGdsCXleJEZrvWCrGN-FgEyL4pItZGY1zFoyW2LjpwbHunvjEdu-IR7rL-IUx7rFkSIPp4E_H7Y80sdEnf8pVjjiFbsI2ES6_ss1e72_eykfkv3T7rHc7hOvpBgTJcmGoK1TBVZgnZPeKQ9UeJxNQaYEIiqhC5sRKPAUcjAwUxAY1Fqu2e3vbj-5lqpjP8y_hs_jP4P8BoLBR88
CitedBy_id	crossref_primary_10_7717_peerj_18627 crossref_primary_10_1186_s13148_023_01445_5 crossref_primary_10_3390_molecules23051136 crossref_primary_10_7717_peerj_18069 crossref_primary_10_7717_peerj_14262 crossref_primary_10_2196_resprot_6024 crossref_primary_10_1016_j_bbi_2024_07_008 crossref_primary_10_1021_acsomega_3c09171 crossref_primary_10_1177_1744806920936502 crossref_primary_10_1016_j_bbi_2017_04_002 crossref_primary_10_1177_1744806919878088 crossref_primary_10_1186_s13059_016_0881_8 crossref_primary_10_1002_glia_23785 crossref_primary_10_1128_mBio_00575_18 crossref_primary_10_3390_vetsci11070299 crossref_primary_10_7717_peerj_14136 crossref_primary_10_7717_peerj_14259 crossref_primary_10_1261_rna_046011_114 crossref_primary_10_7717_peerj_14814 crossref_primary_10_1136_bmjopen_2024_084081 crossref_primary_10_1016_j_csbj_2015_08_004 crossref_primary_10_1093_intbio_zyab007 crossref_primary_10_1002_0471142905_hg1113s83 crossref_primary_10_1038_s41598_019_45371_0 crossref_primary_10_1097_CEJ_0000000000000664 crossref_primary_10_7717_peerj_13167 crossref_primary_10_1136_bmjopen_2021_053397 crossref_primary_10_2196_39252 crossref_primary_10_1093_bib_bbw144 crossref_primary_10_1093_humrep_dead256 crossref_primary_10_3390_biomedicines11030742 crossref_primary_10_1111_vcp_13412 crossref_primary_10_7717_peerj_14369 crossref_primary_10_1172_JCI140723 crossref_primary_10_1016_j_csbj_2020_05_018 crossref_primary_10_1038_s41467_024_44828_9 crossref_primary_10_2460_ajvr_77_7_693 crossref_primary_10_1186_s12864_024_10362_7 crossref_primary_10_1109_TCBB_2018_2873010 crossref_primary_10_1016_j_jtho_2022_11_006 crossref_primary_10_1667_RR15310_1 crossref_primary_10_1158_1078_0432_CCR_21_0267 crossref_primary_10_3389_fnmol_2017_00045 crossref_primary_10_3390_ijms25179322 crossref_primary_10_1007_s10815_023_02911_y crossref_primary_10_1186_s12967_018_1749_3 crossref_primary_10_1016_j_lfs_2021_119121 crossref_primary_10_7717_peerj_16654 crossref_primary_10_1016_j_nbd_2019_05_006 crossref_primary_10_1038_s41598_019_43935_8 crossref_primary_10_1371_journal_pone_0225062 crossref_primary_10_1111_prd_12350 crossref_primary_10_1038_srep22416 crossref_primary_10_7554_eLife_90135 crossref_primary_10_2147_JIR_S363538 crossref_primary_10_3389_fendo_2022_888948 crossref_primary_10_1093_hropen_hox022 crossref_primary_10_7717_peerj_16006 crossref_primary_10_3390_nu16070955 crossref_primary_10_1142_S0219720015500183 crossref_primary_10_1177_1751143720966286 crossref_primary_10_1093_nar_gkw545 crossref_primary_10_7717_peerj_16368 crossref_primary_10_1007_s00109_020_01922_x crossref_primary_10_1155_2018_3028290 crossref_primary_10_3390_biom13020221 crossref_primary_10_1007_s43032_019_00093_6 crossref_primary_10_1038_nrg3934 crossref_primary_10_1093_bib_bbab566 crossref_primary_10_1186_s12864_017_3797_0 crossref_primary_10_3390_cancers13205143 crossref_primary_10_1016_j_ebiom_2023_104772 crossref_primary_10_1016_j_bbr_2018_12_047 crossref_primary_10_1371_journal_pone_0140049 crossref_primary_10_1111_tpj_15886 crossref_primary_10_1371_journal_pone_0152274 crossref_primary_10_1093_biostatistics_kxz016 crossref_primary_10_3389_falgy_2024_1349741 crossref_primary_10_3389_fonc_2023_1200387 crossref_primary_10_7554_eLife_90135_3 crossref_primary_10_1113_JP279646 crossref_primary_10_7717_peerj_14603 crossref_primary_10_3390_ijms24098175 crossref_primary_10_1177_0333102419851812 crossref_primary_10_1186_s12920_018_0370_x crossref_primary_10_7717_peerj_14968 crossref_primary_10_1186_s13073_015_0153_3 crossref_primary_10_1016_j_celrep_2020_01_026 crossref_primary_10_1038_s41593_019_0490_4 crossref_primary_10_1530_ERC_17_0470 crossref_primary_10_3390_genes10100801 crossref_primary_10_1111_eci_14140 crossref_primary_10_1002_hep_30156 crossref_primary_10_3389_fped_2020_00197 crossref_primary_10_3390_biomedinformatics1020004 crossref_primary_10_1016_j_cmet_2017_04_003 crossref_primary_10_1038_mp_2015_167 crossref_primary_10_7717_peerj_15123 crossref_primary_10_1016_j_exer_2015_05_009 crossref_primary_10_7717_peerj_15001 crossref_primary_10_1016_j_ejphar_2024_176426 crossref_primary_10_1038_s41467_024_47793_5 crossref_primary_10_1038_s41598_018_26700_1 crossref_primary_10_1093_cz_zoaa007 crossref_primary_10_18632_oncotarget_16664 crossref_primary_10_1093_molehr_gaaa060 crossref_primary_10_1016_j_jagp_2022_02_003 crossref_primary_10_3389_fonc_2019_01417 crossref_primary_10_1093_hmg_ddaa188 crossref_primary_10_7717_peerj_13455 crossref_primary_10_1038_s41388_021_02040_9 crossref_primary_10_1080_14737159_2016_1198258 crossref_primary_10_1093_bioinformatics_btu640 crossref_primary_10_1016_j_devcel_2018_07_005 crossref_primary_10_7554_eLife_34817 crossref_primary_10_1016_j_scr_2023_103086 crossref_primary_10_7717_peerj_15077 crossref_primary_10_1177_1744806918816462 crossref_primary_10_1371_journal_pone_0311379 crossref_primary_10_1186_s12872_021_02280_3 crossref_primary_10_1053_j_gastro_2019_02_027 crossref_primary_10_7717_peerj_11020 crossref_primary_10_3389_fimmu_2019_02081 crossref_primary_10_1371_journal_pcbi_1005457 crossref_primary_10_1177_10998004211003980 crossref_primary_10_3390_ani13071199 crossref_primary_10_1167_iovs_17_23599 crossref_primary_10_1101_gr_267070_120 crossref_primary_10_7717_peerj_16949 crossref_primary_10_1007_s13592_017_0542_2 crossref_primary_10_7717_peerj_18327 crossref_primary_10_1002_acn3_272 crossref_primary_10_3390_ijms241411669 crossref_primary_10_1007_s40291_020_00494_3 crossref_primary_10_1242_dmm_047225 crossref_primary_10_3390_children9111764 crossref_primary_10_1186_s12872_020_01629_4 crossref_primary_10_1093_bib_bbx061 crossref_primary_10_1136_lupus_2022_000698 crossref_primary_10_1007_s00360_024_01591_z crossref_primary_10_1186_s12859_016_0994_9 crossref_primary_10_1007_s41060_024_00534_9 crossref_primary_10_3389_fpls_2018_00108 crossref_primary_10_1038_s41598_023_45317_7 crossref_primary_10_1038_s41467_022_31436_8 crossref_primary_10_1186_s12920_018_0379_1 crossref_primary_10_3389_fnmol_2017_00304 crossref_primary_10_3389_fimmu_2022_968991 crossref_primary_10_1134_S207905971703011X crossref_primary_10_1371_journal_pone_0153782 crossref_primary_10_1101_gr_277397_122 crossref_primary_10_4137_CIN_S17688 crossref_primary_10_1186_s12920_017_0270_5 crossref_primary_10_1097_HCO_0000000000000275 crossref_primary_10_1186_s12929_021_00718_6 crossref_primary_10_1002_pmic_202200414 crossref_primary_10_1093_bioinformatics_btaa832 crossref_primary_10_1111_pcn_12550 crossref_primary_10_3390_jcm10030523 crossref_primary_10_1371_journal_pone_0308711 crossref_primary_10_1186_s13148_015_0052_x crossref_primary_10_7717_peerj_14342 crossref_primary_10_1515_sagmb_2016_0008 crossref_primary_10_1111_rssc_12330 crossref_primary_10_3389_fimmu_2021_773070 crossref_primary_10_1177_0022034518761644 crossref_primary_10_1186_s13063_021_05442_y crossref_primary_10_1111_mec_13526 crossref_primary_10_1371_journal_pone_0251868 crossref_primary_10_1111_ceo_13554 crossref_primary_10_1136_bmjopen_2022_067002 crossref_primary_10_7554_eLife_09800 crossref_primary_10_1186_s12940_019_0535_x crossref_primary_10_1016_j_jaut_2024_103255 crossref_primary_10_1038_s41366_018_0303_y crossref_primary_10_1038_s41467_021_26779_7 crossref_primary_10_1093_bioinformatics_btaa607 crossref_primary_10_3389_fgene_2021_700489 crossref_primary_10_1186_s12864_022_08306_0 crossref_primary_10_1038_gene_2015_23 crossref_primary_10_1186_s12859_018_2445_2 crossref_primary_10_1097_HJH_0000000000003226 crossref_primary_10_1186_s12859_018_2191_5 crossref_primary_10_1371_journal_pone_0191407 crossref_primary_10_3390_genes15030344 crossref_primary_10_1128_mBio_00100_16 crossref_primary_10_7717_peerj_13591 crossref_primary_10_1186_s12865_014_0040_5 crossref_primary_10_1021_acs_chemrestox_0c00368 crossref_primary_10_1126_scitranslmed_aav7816 crossref_primary_10_1016_j_vaccine_2015_04_096 crossref_primary_10_3389_fimmu_2022_1093242 crossref_primary_10_3389_frph_2024_1329760 crossref_primary_10_1158_0008_5472_CAN_15_1629 crossref_primary_10_1177_2515841419835460 crossref_primary_10_1523_JNEUROSCI_1929_17_2017 crossref_primary_10_1002_ece3_5956 crossref_primary_10_3389_fmolb_2024_1368372 crossref_primary_10_1038_s41598_018_36500_2 crossref_primary_10_3390_fermentation9080697 crossref_primary_10_1016_j_jaci_2024_01_014 crossref_primary_10_1080_21541264_2019_1704128 crossref_primary_10_1016_j_fm_2019_03_008 crossref_primary_10_1093_bioadv_vbad152 crossref_primary_10_1186_s12887_019_1564_x
ContentType	Journal Article
DBID	CGR CUY CVF ECM EIF NPM
DOI	10.1089/cmb.2012.0283
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid)
DatabaseTitleList	MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	no_fulltext_linktorsrc
Discipline	Biology Mathematics
EISSN	1557-8666
ExternalDocumentID	23961961
Genre	Research Support, Non-U.S. Gov't Journal Article Research Support, N.I.H., Extramural
GrantInformation_xml	– fundername: NIAID NIH HHS grantid: U01 AI089859
GroupedDBID	--- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1
ID	FETCH-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662
IngestDate	Thu Apr 03 07:08:14 EDT 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	12
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662
OpenAccessLink	https://www.ncbi.nlm.nih.gov/pmc/articles/3842884
PMID	23961961
ParticipantIDs	pubmed_primary_23961961
PublicationCentury	2000
PublicationDate	2013-12-01
PublicationDateYYYYMMDD	2013-12-01
PublicationDate_xml	– month: 12 year: 2013 text: 2013-12-01 day: 01
PublicationDecade	2010
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	Journal of computational biology
PublicationTitleAlternate	J Comput Biol
PublicationYear	2013
References	20979621 - Genome Biol. 2010;11(10):R106 19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40 19289445 - Bioinformatics. 2009 May 1;25(9):1105-11 21176179 - Genome Biol. 2010;11(12):220 19371405 - Biol Direct. 2009;4:14 21536721 - Genome Res. 2011 Jun;21(6):991-8 22383036 - Nat Protoc. 2012 Mar;7(3):562-78 19261174 - Genome Biol. 2009;10(3):R25 22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5 20436464 - Nat Biotechnol. 2010 May;28(5):511-5 22769017 - BMC Genomics. 2012;13:304
References_xml	– reference: 19261174 - Genome Biol. 2009;10(3):R25 – reference: 19371405 - Biol Direct. 2009;4:14 – reference: 20436464 - Nat Biotechnol. 2010 May;28(5):511-5 – reference: 21176179 - Genome Biol. 2010;11(12):220 – reference: 21536721 - Genome Res. 2011 Jun;21(6):991-8 – reference: 22383036 - Nat Protoc. 2012 Mar;7(3):562-78 – reference: 19289445 - Bioinformatics. 2009 May 1;25(9):1105-11 – reference: 20979621 - Genome Biol. 2010;11(10):R106 – reference: 22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5 – reference: 19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40 – reference: 22769017 - BMC Genomics. 2012;13:304
SSID	ssj0013607
Score	2.4996135
Snippet	Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly...
SourceID	pubmed
SourceType	Index Database
StartPage	970
SubjectTerms	Algorithms Animals Gene Expression Profiling - methods High-Throughput Nucleotide Sequencing - methods Humans Models, Biological RNA, Messenger - genetics Sample Size
Title	Calculating sample size estimates for RNA sequencing data
URI	https://www.ncbi.nlm.nih.gov/pubmed/23961961
Volume	20
hasFullText
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-EPQgvt-Sg7el2qTZtD2KKCK4B1HQkyRpCoq7K-t60F_vTJI-fKJeQmm6S8n37exkMvMNIXvgVZQmkUVU5lkRiSTrRtoWKioE7HCLVKvMpfyf9-TplTi77l43oWxXXTLW--b1y7qS_6AK9wBXrJL9A7L1l8INuAZ8YQSEYfwVxkfqwbj2WxgTUKjz23m6e7UdVM7ooxPpkggveoedkDGND4ZqtK9cUuNaPFThwaDP1NgoX93j-6A1RzhAtNHAqmeHvB2NXpoAax2Nvnm-v6uN8LBKpgz1MSGcGkIPLGmlcdhgLrvwHyd935TKnvK4zRveso657xHyyWrHGYqemr7GVDu-jy5P-zlY9Me-g5AnOWz4vHr7z7MfRLSrqUkymaZov3sY1KkOm2ScBvlVeJODd--BYtHhsx82Hs4BuVwg8wEmeuhpsEgm7GCJzPheoi9LZO68FuB9WiZ5ixrUU4MiNWhNDQrUoEAN2lCDIjVWyNXJ8eXRaRS6ZERG8HgcCW7zspS5FqkqWK41N1oYZlOjYAAfRcRKKYEBw8QywYwtu8zJulmWKSn5KpkaDAd2nVCZMpvLVGsWW5ElpeZ4rGo0Figbpu0GWfMrcPvopVBuq7XZ_HZmi8w21Nkm0yX89uwOOHJjvetgeAMNYkeW
linkProvider	National Library of Medicine
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Calculating+sample+size+estimates+for+RNA+sequencing+data&rft.jtitle=Journal+of+computational+biology&rft.au=Hart%2C+Steven+N&rft.au=Therneau%2C+Terry+M&rft.au=Zhang%2C+Yuji&rft.au=Poland%2C+Gregory+A&rft.date=2013-12-01&rft.eissn=1557-8666&rft.volume=20&rft.issue=12&rft.spage=970&rft_id=info:doi/10.1089%2Fcmb.2012.0283&rft_id=info%3Apmid%2F23961961&rft_id=info%3Apmid%2F23961961&rft.externalDocID=23961961