SOTorrent reconstructing and analyzing the evolution of stack overflow posts
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets a...
Saved in:
Published in | 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 319 - 330 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
New York, NY, USA
ACM
28.05.2018
|
Series | ACM Conferences |
Subjects | |
Online Access | Get full text |
ISBN | 9781450357166 1450357164 |
ISSN | 2574-3864 |
DOI | 10.1145/3196398.3196430 |
Cover
Loading…
Abstract | Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. |
---|---|
AbstractList | Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. |
Author | Dumani, Lorik Diehl, Stephan Treude, Christoph Baltes, Sebastian |
Author_xml | – sequence: 1 givenname: Sebastian surname: Baltes fullname: Baltes, Sebastian email: research@sbaltes.com organization: University of Trier, Germany – sequence: 2 givenname: Lorik surname: Dumani fullname: Dumani, Lorik email: dumani@uni-trier.de organization: University of Trier, Germany – sequence: 3 givenname: Christoph surname: Treude fullname: Treude, Christoph email: christoph.treude@adelaide.edu.au organization: University of Adelaide, Australia – sequence: 4 givenname: Stephan surname: Diehl fullname: Diehl, Stephan email: diehl@uni-trier.de organization: University of Trier, Germany |
BookMark | eNqNj7tOxDAURM1LIiypKfgBmgRfX18_SrTiJa20BUtt2Y4tBdgEJdvw9yTaVFQ0M8XRjHSu2HnXd4mxG-A1gKR7BKvQmnpuifyElVabCXAkDUqdskKQlhUaJc_-sEtWjuMH51woIwF0wYq37a4fhtQdrtlF9l9jKpdesfenx936pdpsn1_XD5vKC6kPlfCZIGZBkZpmSiszkuGEAaIU1iYTojZIIBpvAgbJo8iWK-WjVl5lXLHb42-bUnLfQ7v3w48zZEkATbQ-Uh_3LvT95-iAu9nbLd5u8XZhaNN8d_fPAf4CWWFSCw |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
Copyright | 2018 ACM |
Copyright_xml | – notice: 2018 ACM |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1145/3196398.3196430 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9781450357166 1450357164 |
EISSN | 2574-3864 |
EndPage | 330 |
ExternalDocumentID | 8595215 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IL 6IN AAJGR ABLEC ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI IEGSK LHSKQ OCL RIB RIC RIE RIL AAWTH ADZIZ CHZPO |
ID | FETCH-LOGICAL-a247t-2af51cf25c5dd25c94f358053b1c4299e8bc783512da8b3b40c2f9066ac76a6f3 |
IEDL.DBID | RIE |
ISBN | 9781450357166 1450357164 |
IngestDate | Wed Aug 27 02:59:15 EDT 2025 Fri Sep 13 11:04:49 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | true |
Keywords | stack overflow open dataset code snippets software evolution |
Language | English |
License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. |
LinkModel | DirectLink |
MeetingName | ICSE '18: 40th International Conference on Software Engineering |
MergedId | FETCHMERGED-LOGICAL-a247t-2af51cf25c5dd25c94f358053b1c4299e8bc783512da8b3b40c2f9066ac76a6f3 |
PageCount | 12 |
ParticipantIDs | ieee_primary_8595215 acm_books_10_1145_3196398_3196430 acm_books_10_1145_3196398_3196430_brief |
PublicationCentury | 2000 |
PublicationDate | 20180528 2018-May |
PublicationDateYYYYMMDD | 2018-05-28 2018-05-01 |
PublicationDate_xml | – month: 05 year: 2018 text: 20180528 day: 28 |
PublicationDecade | 2010 |
PublicationPlace | New York, NY, USA |
PublicationPlace_xml | – name: New York, NY, USA |
PublicationSeriesTitle | ACM Conferences |
PublicationTitle | 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) |
PublicationTitleAbbrev | MSR |
PublicationYear | 2018 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssj0002684117 ssj0003211714 |
Score | 2.4414756 |
Snippet | Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a... |
SourceID | ieee acm |
SourceType | Publisher |
StartPage | 319 |
SubjectTerms | code snippets Computer bugs Data mining History Indexes Measurement open dataset Software Software and its engineering -- Software creation and management -- Software post-development issues -- Software evolution software evolution stack overflow |
Subtitle | reconstructing and analyzing the evolution of stack overflow posts |
Title | SOTorrent |
URI | https://ieeexplore.ieee.org/document/8595215 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwFD7MPfk0LxPnjQiCL7Zbc2lTX2VjCHOCG-ytNGkiMm3FbQr79SZpO1EEfSlN6UM454R8JznfdwAuskAbHECoR0IRmQSF2iKATHghV0EUhSnmTr54dBcOp_R2xmYNuNpwYZRSrvhM-fbV3eVnhVzZo7Ku1eLCllG-ZcKs5GptzlOsaknNmbRjYjKbKKCVmk9AWdcFW8z9UoPKMtpS-fKtqYrbUwYtGNWzKUtJ5v5qKXy5_iHU-N_p7kD7i72H7jf70i40VL4Hrbp9A6pW8z6MHsaTwqkzXSObhdZasvkjSvMMObmStR0ZjIj671WMokIjg1DlHI3NKtDPxQeyDX8XbZgO-pOboVd1V_BSTKOlh1PNAqkxkyzLzDOm2l6JMiICaTcpxYW0x0IBzlIuiKA9iXVsEEoqjQ9DTQ6gmRe5OgRkMAWhWHKJVUw56Qne0zEVXEkaqKjHOnBuTJ3YtGGRlExollTuSCp3dODyz38S8fakdAf2ra2T11KOI6nMfPT752PYNhiHlzWKJ9A0llSnBkcsxZkLoE9xXL_f |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH6MedDT1E2cPyMIXuy2pkmbepWNqesmuMFupUkTkWkrblPYX2_SHxNF0EtpSg_hvYT3veR93wM4j22lcYBDLMflnk5QiCkCiLnlMml7nhthlskXB0O3PyG3UzqtwOWaCyOlzIrPZMu8Znf5cSqW5qisbbS4sGGUb-i4T2jO1lqfqBjdkpI1acaOzm08mxR6Pjah7Wy5-ayVq1AZTlskXr61VcmiSq8GQTmfvJhk1loueEusfkg1_nfC29D44u-h-3Vk2oGKTHahVjZwQMV-rkPwMBqnmT7TFTJ5aKkmmzyiKIlRJliyMiONElH3vVilKFVIY1QxQyO9D9Rz-oFMy995Aya97vi6bxX9FawIE29h4UhRWyhMBY1j_fSJMpei1OG2MGFKMi7MwZCN44hxh5OOwMrXGCUS2ouucvagmqSJ3AekUYVDsGACS58wp8NZR_mEMymILb0ObcKZNnVoEod5mHOhaVi4Iyzc0YSLP_8J-duTVE2oG1uHr7kgR1iY-eD3z6ew2R8Hg3BwM7w7hC2NeFhesXgEVW1VeaxRxYKfZIvpE68kwyw |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=SOTorrent%3A+Reconstructing+and+Analyzing+the+Evolution+of+Stack+Overflow+Posts&rft.au=Baltes%2C+Sebastian&rft.au=Dumani%2C+Lorik&rft.au=Treude%2C+Christoph&rft.au=Diehl%2C+Stephan&rft.date=2018-05-01&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=319&rft.epage=330&rft_id=info:doi/10.1145%2F3196398.3196430&rft.externalDocID=8595215 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true |