Convex space learning for tabular synthetic data generation

Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the co...

Full description

Saved in:
Bibliographic Details
Main Authors Mahendra, Manjunath, Umesh, Chaithra, Bej, Saptarshi, Schultz, Kristian, Wolkenhauer, Olaf
Format Journal Article
LanguageEnglish
Published 13.07.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.
AbstractList Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.
Author Bej, Saptarshi
Mahendra, Manjunath
Wolkenhauer, Olaf
Umesh, Chaithra
Schultz, Kristian
Author_xml – sequence: 1
  givenname: Manjunath
  surname: Mahendra
  fullname: Mahendra, Manjunath
– sequence: 2
  givenname: Chaithra
  surname: Umesh
  fullname: Umesh, Chaithra
– sequence: 3
  givenname: Saptarshi
  surname: Bej
  fullname: Bej, Saptarshi
– sequence: 4
  givenname: Kristian
  surname: Schultz
  fullname: Schultz, Kristian
– sequence: 5
  givenname: Olaf
  surname: Wolkenhauer
  fullname: Wolkenhauer, Olaf
BackLink https://doi.org/10.48550/arXiv.2407.09789$$DView paper in arXiv
BookMark eNqFzT0OgkAQQOEttPDvAFbOBVxXhQCxJBoPYL8ZccBNcJYMK4HbG4m91Wte8s3VhD2TUuu90VEax2aH0rtOHyKTaJMlaTZTp9xzRz20DRYENaGw4wpKLxDw_q5RoB04PCm4Ah4YECpiEgzO81JNS6xbWv26UJvL-ZZftyNjG3EvlMF-OTtyx__HBzIpNzk
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by-sa/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by-sa/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2407.09789
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2407_09789
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2407_097893
IEDL.DBID GOX
IngestDate Wed Jul 17 12:20:28 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2407_097893
OpenAccessLink https://arxiv.org/abs/2407.09789
ParticipantIDs arxiv_primary_2407_09789
PublicationCentury 2000
PublicationDate 2024-07-13
PublicationDateYYYYMMDD 2024-07-13
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-13
  day: 13
PublicationDecade 2020
PublicationYear 2024
Score 3.8560405
SecondaryResourceType preprint
Snippet Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently,...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Title Convex space learning for tabular synthetic data generation
URI https://arxiv.org/abs/2407.09789
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LTwJBDG6AkxejQYOI0oPX1XF2VTacDBEIiXDBZG-b7TyIF2JgNfDvbWeX6IVr25lpOpn060zbAbgz1luGFWmklPccoBBFRGQjTTRw3hijjQSK7_Pn6Ucyy56yBuChFqbY7D5_qv7AtH2QcONeKg3SJjS1lpStySKrHidDK65a_k-OMWYg_XMS4zM4rdEdvlbbcQ4Nt27DcCSZ3Tvks2sc1t80rJDRIpYFSRoobvdrBmI8CCVjE1ehF7SY7AL647flaBqF5fKvqjdELprkQZP4ElocwbsOoB-Q13HhTGrjhFnklNXJi1FCVNpdQefYLN3jrGs40exh5aLxMe5Bq9x8uxv2kCXdBjP9AtlZbQQ
link.rule.ids 228,230,781,886
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Convex+space+learning+for+tabular+synthetic+data+generation&rft.au=Mahendra%2C+Manjunath&rft.au=Umesh%2C+Chaithra&rft.au=Bej%2C+Saptarshi&rft.au=Schultz%2C+Kristian&rft.date=2024-07-13&rft_id=info:doi/10.48550%2Farxiv.2407.09789&rft.externalDocID=2407_09789