SOTorrent reconstructing and analyzing the evolution of stack overflow posts

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets a...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 319 - 330
Main Authors Baltes, Sebastian, Dumani, Lorik, Treude, Christoph, Diehl, Stephan
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 28.05.2018
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN9781450357166
1450357164
ISSN2574-3864
DOI10.1145/3196398.3196430

Cover

Loading…
Abstract Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.
AbstractList Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.
Author Dumani, Lorik
Diehl, Stephan
Treude, Christoph
Baltes, Sebastian
Author_xml – sequence: 1
  givenname: Sebastian
  surname: Baltes
  fullname: Baltes, Sebastian
  email: research@sbaltes.com
  organization: University of Trier, Germany
– sequence: 2
  givenname: Lorik
  surname: Dumani
  fullname: Dumani, Lorik
  email: dumani@uni-trier.de
  organization: University of Trier, Germany
– sequence: 3
  givenname: Christoph
  surname: Treude
  fullname: Treude, Christoph
  email: christoph.treude@adelaide.edu.au
  organization: University of Adelaide, Australia
– sequence: 4
  givenname: Stephan
  surname: Diehl
  fullname: Diehl, Stephan
  email: diehl@uni-trier.de
  organization: University of Trier, Germany
BookMark eNqNj7tOxDAURM1LIiypKfgBmgRfX18_SrTiJa20BUtt2Y4tBdgEJdvw9yTaVFQ0M8XRjHSu2HnXd4mxG-A1gKR7BKvQmnpuifyElVabCXAkDUqdskKQlhUaJc_-sEtWjuMH51woIwF0wYq37a4fhtQdrtlF9l9jKpdesfenx936pdpsn1_XD5vKC6kPlfCZIGZBkZpmSiszkuGEAaIU1iYTojZIIBpvAgbJo8iWK-WjVl5lXLHb42-bUnLfQ7v3w48zZEkATbQ-Uh_3LvT95-iAu9nbLd5u8XZhaNN8d_fPAf4CWWFSCw
CODEN IEEPAD
ContentType Conference Proceeding
Copyright 2018 ACM
Copyright_xml – notice: 2018 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3196398.3196430
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450357166
1450357164
EISSN 2574-3864
EndPage 330
ExternalDocumentID 8595215
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
ABLEC
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
IEGSK
LHSKQ
OCL
RIB
RIC
RIE
RIL
AAWTH
ADZIZ
CHZPO
ID FETCH-LOGICAL-a247t-2af51cf25c5dd25c94f358053b1c4299e8bc783512da8b3b40c2f9066ac76a6f3
IEDL.DBID RIE
ISBN 9781450357166
1450357164
IngestDate Wed Aug 27 02:59:15 EDT 2025
Fri Sep 13 11:04:49 EDT 2024
IsPeerReviewed false
IsScholarly true
Keywords stack overflow
open dataset
code snippets
software evolution
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org.
LinkModel DirectLink
MeetingName ICSE '18: 40th International Conference on Software Engineering
MergedId FETCHMERGED-LOGICAL-a247t-2af51cf25c5dd25c94f358053b1c4299e8bc783512da8b3b40c2f9066ac76a6f3
PageCount 12
ParticipantIDs ieee_primary_8595215
acm_books_10_1145_3196398_3196430
acm_books_10_1145_3196398_3196430_brief
PublicationCentury 2000
PublicationDate 20180528
2018-May
PublicationDateYYYYMMDD 2018-05-28
2018-05-01
PublicationDate_xml – month: 05
  year: 2018
  text: 20180528
  day: 28
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)
PublicationTitleAbbrev MSR
PublicationYear 2018
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0002684117
ssj0003211714
Score 2.4414756
Snippet Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a...
SourceID ieee
acm
SourceType Publisher
StartPage 319
SubjectTerms code snippets
Computer bugs
Data mining
History
Indexes
Measurement
open dataset
Software
Software and its engineering -- Software creation and management -- Software post-development issues -- Software evolution
software evolution
stack overflow
Subtitle reconstructing and analyzing the evolution of stack overflow posts
Title SOTorrent
URI https://ieeexplore.ieee.org/document/8595215
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwFD7MPfk0LxPnjQiCL7Zbc2lTX2VjCHOCG-ytNGkiMm3FbQr79SZpO1EEfSlN6UM454R8JznfdwAuskAbHECoR0IRmQSF2iKATHghV0EUhSnmTr54dBcOp_R2xmYNuNpwYZRSrvhM-fbV3eVnhVzZo7Ku1eLCllG-ZcKs5GptzlOsaknNmbRjYjKbKKCVmk9AWdcFW8z9UoPKMtpS-fKtqYrbUwYtGNWzKUtJ5v5qKXy5_iHU-N_p7kD7i72H7jf70i40VL4Hrbp9A6pW8z6MHsaTwqkzXSObhdZasvkjSvMMObmStR0ZjIj671WMokIjg1DlHI3NKtDPxQeyDX8XbZgO-pOboVd1V_BSTKOlh1PNAqkxkyzLzDOm2l6JMiICaTcpxYW0x0IBzlIuiKA9iXVsEEoqjQ9DTQ6gmRe5OgRkMAWhWHKJVUw56Qne0zEVXEkaqKjHOnBuTJ3YtGGRlExollTuSCp3dODyz38S8fakdAf2ra2T11KOI6nMfPT752PYNhiHlzWKJ9A0llSnBkcsxZkLoE9xXL_f
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH6MedDT1E2cPyMIXuy2pkmbepWNqesmuMFupUkTkWkrblPYX2_SHxNF0EtpSg_hvYT3veR93wM4j22lcYBDLMflnk5QiCkCiLnlMml7nhthlskXB0O3PyG3UzqtwOWaCyOlzIrPZMu8Znf5cSqW5qisbbS4sGGUb-i4T2jO1lqfqBjdkpI1acaOzm08mxR6Pjah7Wy5-ayVq1AZTlskXr61VcmiSq8GQTmfvJhk1loueEusfkg1_nfC29D44u-h-3Vk2oGKTHahVjZwQMV-rkPwMBqnmT7TFTJ5aKkmmzyiKIlRJliyMiONElH3vVilKFVIY1QxQyO9D9Rz-oFMy995Aya97vi6bxX9FawIE29h4UhRWyhMBY1j_fSJMpei1OG2MGFKMi7MwZCN44hxh5OOwMrXGCUS2ouucvagmqSJ3AekUYVDsGACS58wp8NZR_mEMymILb0ObcKZNnVoEod5mHOhaVi4Iyzc0YSLP_8J-duTVE2oG1uHr7kgR1iY-eD3z6ew2R8Hg3BwM7w7hC2NeFhesXgEVW1VeaxRxYKfZIvpE68kwyw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=SOTorrent%3A+Reconstructing+and+Analyzing+the+Evolution+of+Stack+Overflow+Posts&rft.au=Baltes%2C+Sebastian&rft.au=Dumani%2C+Lorik&rft.au=Treude%2C+Christoph&rft.au=Diehl%2C+Stephan&rft.date=2018-05-01&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=319&rft.epage=330&rft_id=info:doi/10.1145%2F3196398.3196430&rft.externalDocID=8595215
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true