Calculating sample size estimates for RNA sequencing data
Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into c...
Saved in:
Published in | Journal of computational biology Vol. 20; no. 12; p. 970 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
United States
01.12.2013
|
Subjects | |
Online Access | Get more information |
ISSN | 1557-8666 |
DOI | 10.1089/cmb.2012.0283 |
Cover
Abstract | Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?
Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.
Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments. |
---|---|
AbstractList | Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?
Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.
Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments. |
Author | Kocher, Jean-Pierre Zhang, Yuji Hart, Steven N Poland, Gregory A Therneau, Terry M |
Author_xml | – sequence: 1 givenname: Steven N surname: Hart fullname: Hart, Steven N organization: 1 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic , Rochester, Minnesota – sequence: 2 givenname: Terry M surname: Therneau fullname: Therneau, Terry M – sequence: 3 givenname: Yuji surname: Zhang fullname: Zhang, Yuji – sequence: 4 givenname: Gregory A surname: Poland fullname: Poland, Gregory A – sequence: 5 givenname: Jean-Pierre surname: Kocher fullname: Kocher, Jean-Pierre |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/23961961$$D View this record in MEDLINE/PubMed |
BookMark | eNo1j09LxDAUxIMo7h89epV8gXbz0jRNjkvRVVhWED0vL-mLVNpubdqDfnorKgwzl2GY34qdd6eOGLsBkYIwduNbl0oBMhXSZGdsCXleJEZrvWCrGN-FgEyL4pItZGY1zFoyW2LjpwbHunvjEdu-IR7rL-IUx7rFkSIPp4E_H7Y80sdEnf8pVjjiFbsI2ES6_ss1e72_eykfkv3T7rHc7hOvpBgTJcmGoK1TBVZgnZPeKQ9UeJxNQaYEIiqhC5sRKPAUcjAwUxAY1Fqu2e3vbj-5lqpjP8y_hs_jP4P8BoLBR88 |
CitedBy_id | crossref_primary_10_7717_peerj_18627 crossref_primary_10_1186_s13148_023_01445_5 crossref_primary_10_3390_molecules23051136 crossref_primary_10_7717_peerj_18069 crossref_primary_10_7717_peerj_14262 crossref_primary_10_2196_resprot_6024 crossref_primary_10_1016_j_bbi_2024_07_008 crossref_primary_10_1021_acsomega_3c09171 crossref_primary_10_1177_1744806920936502 crossref_primary_10_1016_j_bbi_2017_04_002 crossref_primary_10_1177_1744806919878088 crossref_primary_10_1186_s13059_016_0881_8 crossref_primary_10_1002_glia_23785 crossref_primary_10_1128_mBio_00575_18 crossref_primary_10_3390_vetsci11070299 crossref_primary_10_7717_peerj_14136 crossref_primary_10_7717_peerj_14259 crossref_primary_10_1261_rna_046011_114 crossref_primary_10_7717_peerj_14814 crossref_primary_10_1136_bmjopen_2024_084081 crossref_primary_10_1016_j_csbj_2015_08_004 crossref_primary_10_1093_intbio_zyab007 crossref_primary_10_1002_0471142905_hg1113s83 crossref_primary_10_1038_s41598_019_45371_0 crossref_primary_10_1097_CEJ_0000000000000664 crossref_primary_10_7717_peerj_13167 crossref_primary_10_1136_bmjopen_2021_053397 crossref_primary_10_2196_39252 crossref_primary_10_1093_bib_bbw144 crossref_primary_10_1093_humrep_dead256 crossref_primary_10_3390_biomedicines11030742 crossref_primary_10_1111_vcp_13412 crossref_primary_10_7717_peerj_14369 crossref_primary_10_1172_JCI140723 crossref_primary_10_1016_j_csbj_2020_05_018 crossref_primary_10_1038_s41467_024_44828_9 crossref_primary_10_2460_ajvr_77_7_693 crossref_primary_10_1186_s12864_024_10362_7 crossref_primary_10_1109_TCBB_2018_2873010 crossref_primary_10_1016_j_jtho_2022_11_006 crossref_primary_10_1667_RR15310_1 crossref_primary_10_1158_1078_0432_CCR_21_0267 crossref_primary_10_3389_fnmol_2017_00045 crossref_primary_10_3390_ijms25179322 crossref_primary_10_1007_s10815_023_02911_y crossref_primary_10_1186_s12967_018_1749_3 crossref_primary_10_1016_j_lfs_2021_119121 crossref_primary_10_7717_peerj_16654 crossref_primary_10_1016_j_nbd_2019_05_006 crossref_primary_10_1038_s41598_019_43935_8 crossref_primary_10_1371_journal_pone_0225062 crossref_primary_10_1111_prd_12350 crossref_primary_10_1038_srep22416 crossref_primary_10_7554_eLife_90135 crossref_primary_10_2147_JIR_S363538 crossref_primary_10_3389_fendo_2022_888948 crossref_primary_10_1093_hropen_hox022 crossref_primary_10_7717_peerj_16006 crossref_primary_10_3390_nu16070955 crossref_primary_10_1142_S0219720015500183 crossref_primary_10_1177_1751143720966286 crossref_primary_10_1093_nar_gkw545 crossref_primary_10_7717_peerj_16368 crossref_primary_10_1007_s00109_020_01922_x crossref_primary_10_1155_2018_3028290 crossref_primary_10_3390_biom13020221 crossref_primary_10_1007_s43032_019_00093_6 crossref_primary_10_1038_nrg3934 crossref_primary_10_1093_bib_bbab566 crossref_primary_10_1186_s12864_017_3797_0 crossref_primary_10_3390_cancers13205143 crossref_primary_10_1016_j_ebiom_2023_104772 crossref_primary_10_1016_j_bbr_2018_12_047 crossref_primary_10_1371_journal_pone_0140049 crossref_primary_10_1111_tpj_15886 crossref_primary_10_1371_journal_pone_0152274 crossref_primary_10_1093_biostatistics_kxz016 crossref_primary_10_3389_falgy_2024_1349741 crossref_primary_10_3389_fonc_2023_1200387 crossref_primary_10_7554_eLife_90135_3 crossref_primary_10_1113_JP279646 crossref_primary_10_7717_peerj_14603 crossref_primary_10_3390_ijms24098175 crossref_primary_10_1177_0333102419851812 crossref_primary_10_1186_s12920_018_0370_x crossref_primary_10_7717_peerj_14968 crossref_primary_10_1186_s13073_015_0153_3 crossref_primary_10_1016_j_celrep_2020_01_026 crossref_primary_10_1038_s41593_019_0490_4 crossref_primary_10_1530_ERC_17_0470 crossref_primary_10_3390_genes10100801 crossref_primary_10_1111_eci_14140 crossref_primary_10_1002_hep_30156 crossref_primary_10_3389_fped_2020_00197 crossref_primary_10_3390_biomedinformatics1020004 crossref_primary_10_1016_j_cmet_2017_04_003 crossref_primary_10_1038_mp_2015_167 crossref_primary_10_7717_peerj_15123 crossref_primary_10_1016_j_exer_2015_05_009 crossref_primary_10_7717_peerj_15001 crossref_primary_10_1016_j_ejphar_2024_176426 crossref_primary_10_1038_s41467_024_47793_5 crossref_primary_10_1038_s41598_018_26700_1 crossref_primary_10_1093_cz_zoaa007 crossref_primary_10_18632_oncotarget_16664 crossref_primary_10_1093_molehr_gaaa060 crossref_primary_10_1016_j_jagp_2022_02_003 crossref_primary_10_3389_fonc_2019_01417 crossref_primary_10_1093_hmg_ddaa188 crossref_primary_10_7717_peerj_13455 crossref_primary_10_1038_s41388_021_02040_9 crossref_primary_10_1080_14737159_2016_1198258 crossref_primary_10_1093_bioinformatics_btu640 crossref_primary_10_1016_j_devcel_2018_07_005 crossref_primary_10_7554_eLife_34817 crossref_primary_10_1016_j_scr_2023_103086 crossref_primary_10_7717_peerj_15077 crossref_primary_10_1177_1744806918816462 crossref_primary_10_1371_journal_pone_0311379 crossref_primary_10_1186_s12872_021_02280_3 crossref_primary_10_1053_j_gastro_2019_02_027 crossref_primary_10_7717_peerj_11020 crossref_primary_10_3389_fimmu_2019_02081 crossref_primary_10_1371_journal_pcbi_1005457 crossref_primary_10_1177_10998004211003980 crossref_primary_10_3390_ani13071199 crossref_primary_10_1167_iovs_17_23599 crossref_primary_10_1101_gr_267070_120 crossref_primary_10_7717_peerj_16949 crossref_primary_10_1007_s13592_017_0542_2 crossref_primary_10_7717_peerj_18327 crossref_primary_10_1002_acn3_272 crossref_primary_10_3390_ijms241411669 crossref_primary_10_1007_s40291_020_00494_3 crossref_primary_10_1242_dmm_047225 crossref_primary_10_3390_children9111764 crossref_primary_10_1186_s12872_020_01629_4 crossref_primary_10_1093_bib_bbx061 crossref_primary_10_1136_lupus_2022_000698 crossref_primary_10_1007_s00360_024_01591_z crossref_primary_10_1186_s12859_016_0994_9 crossref_primary_10_1007_s41060_024_00534_9 crossref_primary_10_3389_fpls_2018_00108 crossref_primary_10_1038_s41598_023_45317_7 crossref_primary_10_1038_s41467_022_31436_8 crossref_primary_10_1186_s12920_018_0379_1 crossref_primary_10_3389_fnmol_2017_00304 crossref_primary_10_3389_fimmu_2022_968991 crossref_primary_10_1134_S207905971703011X crossref_primary_10_1371_journal_pone_0153782 crossref_primary_10_1101_gr_277397_122 crossref_primary_10_4137_CIN_S17688 crossref_primary_10_1186_s12920_017_0270_5 crossref_primary_10_1097_HCO_0000000000000275 crossref_primary_10_1186_s12929_021_00718_6 crossref_primary_10_1002_pmic_202200414 crossref_primary_10_1093_bioinformatics_btaa832 crossref_primary_10_1111_pcn_12550 crossref_primary_10_3390_jcm10030523 crossref_primary_10_1371_journal_pone_0308711 crossref_primary_10_1186_s13148_015_0052_x crossref_primary_10_7717_peerj_14342 crossref_primary_10_1515_sagmb_2016_0008 crossref_primary_10_1111_rssc_12330 crossref_primary_10_3389_fimmu_2021_773070 crossref_primary_10_1177_0022034518761644 crossref_primary_10_1186_s13063_021_05442_y crossref_primary_10_1111_mec_13526 crossref_primary_10_1371_journal_pone_0251868 crossref_primary_10_1111_ceo_13554 crossref_primary_10_1136_bmjopen_2022_067002 crossref_primary_10_7554_eLife_09800 crossref_primary_10_1186_s12940_019_0535_x crossref_primary_10_1016_j_jaut_2024_103255 crossref_primary_10_1038_s41366_018_0303_y crossref_primary_10_1038_s41467_021_26779_7 crossref_primary_10_1093_bioinformatics_btaa607 crossref_primary_10_3389_fgene_2021_700489 crossref_primary_10_1186_s12864_022_08306_0 crossref_primary_10_1038_gene_2015_23 crossref_primary_10_1186_s12859_018_2445_2 crossref_primary_10_1097_HJH_0000000000003226 crossref_primary_10_1186_s12859_018_2191_5 crossref_primary_10_1371_journal_pone_0191407 crossref_primary_10_3390_genes15030344 crossref_primary_10_1128_mBio_00100_16 crossref_primary_10_7717_peerj_13591 crossref_primary_10_1186_s12865_014_0040_5 crossref_primary_10_1021_acs_chemrestox_0c00368 crossref_primary_10_1126_scitranslmed_aav7816 crossref_primary_10_1016_j_vaccine_2015_04_096 crossref_primary_10_3389_fimmu_2022_1093242 crossref_primary_10_3389_frph_2024_1329760 crossref_primary_10_1158_0008_5472_CAN_15_1629 crossref_primary_10_1177_2515841419835460 crossref_primary_10_1523_JNEUROSCI_1929_17_2017 crossref_primary_10_1002_ece3_5956 crossref_primary_10_3389_fmolb_2024_1368372 crossref_primary_10_1038_s41598_018_36500_2 crossref_primary_10_3390_fermentation9080697 crossref_primary_10_1016_j_jaci_2024_01_014 crossref_primary_10_1080_21541264_2019_1704128 crossref_primary_10_1016_j_fm_2019_03_008 crossref_primary_10_1093_bioadv_vbad152 crossref_primary_10_1186_s12887_019_1564_x |
ContentType | Journal Article |
DBID | CGR CUY CVF ECM EIF NPM |
DOI | 10.1089/cmb.2012.0283 |
DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed |
DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) |
DatabaseTitleList | MEDLINE |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | no_fulltext_linktorsrc |
Discipline | Biology Mathematics |
EISSN | 1557-8666 |
ExternalDocumentID | 23961961 |
Genre | Research Support, Non-U.S. Gov't Journal Article Research Support, N.I.H., Extramural |
GrantInformation_xml | – fundername: NIAID NIH HHS grantid: U01 AI089859 |
GroupedDBID | --- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 |
ID | FETCH-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662 |
IngestDate | Thu Apr 03 07:08:14 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 12 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c420t-42e9ff69b47ad19bb2cb4c1e7ca1e741340aaa406793e141cef5181012e18a662 |
OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/3842884 |
PMID | 23961961 |
ParticipantIDs | pubmed_primary_23961961 |
PublicationCentury | 2000 |
PublicationDate | 2013-12-01 |
PublicationDateYYYYMMDD | 2013-12-01 |
PublicationDate_xml | – month: 12 year: 2013 text: 2013-12-01 day: 01 |
PublicationDecade | 2010 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | Journal of computational biology |
PublicationTitleAlternate | J Comput Biol |
PublicationYear | 2013 |
References | 20979621 - Genome Biol. 2010;11(10):R106 19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40 19289445 - Bioinformatics. 2009 May 1;25(9):1105-11 21176179 - Genome Biol. 2010;11(12):220 19371405 - Biol Direct. 2009;4:14 21536721 - Genome Res. 2011 Jun;21(6):991-8 22383036 - Nat Protoc. 2012 Mar;7(3):562-78 19261174 - Genome Biol. 2009;10(3):R25 22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5 20436464 - Nat Biotechnol. 2010 May;28(5):511-5 22769017 - BMC Genomics. 2012;13:304 |
References_xml | – reference: 19261174 - Genome Biol. 2009;10(3):R25 – reference: 19371405 - Biol Direct. 2009;4:14 – reference: 20436464 - Nat Biotechnol. 2010 May;28(5):511-5 – reference: 21176179 - Genome Biol. 2010;11(12):220 – reference: 21536721 - Genome Res. 2011 Jun;21(6):991-8 – reference: 22383036 - Nat Protoc. 2012 Mar;7(3):562-78 – reference: 19289445 - Bioinformatics. 2009 May 1;25(9):1105-11 – reference: 20979621 - Genome Biol. 2010;11(10):R106 – reference: 22165852 - BMC Bioinformatics. 2011;12 Suppl 10:S5 – reference: 19910308 - Bioinformatics. 2010 Jan 1;26(1):139-40 – reference: 22769017 - BMC Genomics. 2012;13:304 |
SSID | ssj0013607 |
Score | 2.4996135 |
Snippet | Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly... |
SourceID | pubmed |
SourceType | Index Database |
StartPage | 970 |
SubjectTerms | Algorithms Animals Gene Expression Profiling - methods High-Throughput Nucleotide Sequencing - methods Humans Models, Biological RNA, Messenger - genetics Sample Size |
Title | Calculating sample size estimates for RNA sequencing data |
URI | https://www.ncbi.nlm.nih.gov/pubmed/23961961 |
Volume | 20 |
hasFullText | |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-EPQgvt-Sg7el2qTZtD2KKCK4B1HQkyRpCoq7K-t60F_vTJI-fKJeQmm6S8n37exkMvMNIXvgVZQmkUVU5lkRiSTrRtoWKioE7HCLVKvMpfyf9-TplTi77l43oWxXXTLW--b1y7qS_6AK9wBXrJL9A7L1l8INuAZ8YQSEYfwVxkfqwbj2WxgTUKjz23m6e7UdVM7ooxPpkggveoedkDGND4ZqtK9cUuNaPFThwaDP1NgoX93j-6A1RzhAtNHAqmeHvB2NXpoAax2Nvnm-v6uN8LBKpgz1MSGcGkIPLGmlcdhgLrvwHyd935TKnvK4zRveso657xHyyWrHGYqemr7GVDu-jy5P-zlY9Me-g5AnOWz4vHr7z7MfRLSrqUkymaZov3sY1KkOm2ScBvlVeJODd--BYtHhsx82Hs4BuVwg8wEmeuhpsEgm7GCJzPheoi9LZO68FuB9WiZ5ixrUU4MiNWhNDQrUoEAN2lCDIjVWyNXJ8eXRaRS6ZERG8HgcCW7zspS5FqkqWK41N1oYZlOjYAAfRcRKKYEBw8QywYwtu8zJulmWKSn5KpkaDAd2nVCZMpvLVGsWW5ElpeZ4rGo0Figbpu0GWfMrcPvopVBuq7XZ_HZmi8w21Nkm0yX89uwOOHJjvetgeAMNYkeW |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Calculating+sample+size+estimates+for+RNA+sequencing+data&rft.jtitle=Journal+of+computational+biology&rft.au=Hart%2C+Steven+N&rft.au=Therneau%2C+Terry+M&rft.au=Zhang%2C+Yuji&rft.au=Poland%2C+Gregory+A&rft.date=2013-12-01&rft.eissn=1557-8666&rft.volume=20&rft.issue=12&rft.spage=970&rft_id=info:doi/10.1089%2Fcmb.2012.0283&rft_id=info%3Apmid%2F23961961&rft_id=info%3Apmid%2F23961961&rft.externalDocID=23961961 |