Towards classification of email through selection of informative features

Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in...

Full description

Saved in:
Bibliographic Details
Published in2020 First International Conference on Power, Control and Computing Technologies (ICPC2T) pp. 316 - 320
Main Authors Sharaff, Aakanksha, Srinivasarao, Ulligaddala
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.01.2020
Subjects
Online AccessGet full text
DOI10.1109/ICPC2T48082.2020.9071488

Cover

Abstract Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in classification of emails. Feature selection method plays an important role in identifying most relevant features to classify emails by inventing new techniques. The features are extracted based on the relationship established between the words of various classes and it helps to increase the probability of the words to represent informative features. The words can be either of positive or negative in nature. In this paper, relationship among the words present in subject and content of emails has been used to determine the nature of the word and then selected the most related words to form informative features from set of all words. These words are then used to generate the N-grams. Four different classifiers namely Decision tree, Multinomial Naive Bayes, Random Forest classifiers, Linear Support Vector Machine classifiers have been used to determine the performance of the selected N-Grams features. The experimental analysis has been performed over Ling-Spam email dataset.
AbstractList Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in classification of emails. Feature selection method plays an important role in identifying most relevant features to classify emails by inventing new techniques. The features are extracted based on the relationship established between the words of various classes and it helps to increase the probability of the words to represent informative features. The words can be either of positive or negative in nature. In this paper, relationship among the words present in subject and content of emails has been used to determine the nature of the word and then selected the most related words to form informative features from set of all words. These words are then used to generate the N-grams. Four different classifiers namely Decision tree, Multinomial Naive Bayes, Random Forest classifiers, Linear Support Vector Machine classifiers have been used to determine the performance of the selected N-Grams features. The experimental analysis has been performed over Ling-Spam email dataset.
Author Sharaff, Aakanksha
Srinivasarao, Ulligaddala
Author_xml – sequence: 1
  givenname: Aakanksha
  surname: Sharaff
  fullname: Sharaff, Aakanksha
  organization: National Institute of Technology,Department of Computer Science and Engineering,Raipur,India
– sequence: 2
  givenname: Ulligaddala
  surname: Srinivasarao
  fullname: Srinivasarao, Ulligaddala
  organization: National Institute of Technology,Department of Computer Science and Engineering,Raipur,India
BookMark eNo1j8tKBDEURCPowhn9Ajf5gW7z7CRLaXwMDOiiXQ-3kxsn0A9JehT_3gbHTRXUKQpqQy6neUJCKGc158zd79q3VnTKMitqwQSrHTNcWXtBNtwIy5VzRl-TXTd_Qw6F-gFKSTF5WNI80TlSHCENdDnm-fRxpAUH9P8oTXHO49r8QhoRllPGckOuIgwFb8--Je9Pj137Uu1fn3ftw75KnNulcrxxyljTOxuijraR3vsgRezXZBUUwfG-CVYzDahZgICgFEDURmqJckvu_nYTIh4-cxoh_xzO7-Qvp19MwA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICPC2T48082.2020.9071488
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1728149975
9781728149974
EndPage 320
ExternalDocumentID 9071488
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i118t-91694787b98df5f863cccd32fbb98fbbe2d91b6d8505ae50dadea44aaf57353e3
IEDL.DBID RIE
IngestDate Thu Jun 29 18:38:01 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i118t-91694787b98df5f863cccd32fbb98fbbe2d91b6d8505ae50dadea44aaf57353e3
PageCount 5
ParticipantIDs ieee_primary_9071488
PublicationCentury 2000
PublicationDate 2020-Jan.
PublicationDateYYYYMMDD 2020-01-01
PublicationDate_xml – month: 01
  year: 2020
  text: 2020-Jan.
PublicationDecade 2020
PublicationTitle 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T)
PublicationTitleAbbrev ICPC2T
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7119316
Snippet Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is...
SourceID ieee
SourceType Publisher
StartPage 316
SubjectTerms CHI Square
Classifier
Decision trees
Feature extraction
Informative features
N-grams
Naive Bayes methods
Random forests
Relevant words
Support vector machines
Unsolicited e-mail
Title Towards classification of email through selection of informative features
URI https://ieeexplore.ieee.org/document/9071488
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09a8MwED2STJ3akpR-o6Fj7fhDtqU5NCSFlAwJZAv6OEFoa5fGWfLrK8lOSkuHLsbIBpuTrXeS3r0H8JA6-_LYyIBzmgeU6TgQIjYBQ4vVKLVAb_c2e8knS_q8ylYdeDzWwiCiJ59h6E79Xr6u1M4tlQ25q7ZhrAtd-5k1tVoHck7Eh9PRfJQsKLOgZud9SRS2t__wTfGwMT6F2eGBDVvkNdzVMlT7X1qM_32jMxh8F-iR-RF6zqGDZR-mC0-B3RLlMmJHAfJRJ5Uh-C42b6T15CFb733TXmqVU92oRwx6mc_tAJbjp8VoErROCcHGThBqO2Ll3KnsSM60yQzLU6WUThMjbYs9YKJ5LHPNbL4jMIu00CgoFcJkRZqlmF5Ar6xKvARi_0nnyCOoKgwtIil0zpTNIhlPXLFlfAV9F4b1RyOGsW4jcP138w2cuK5o1ixuoVd_7vDOongt7333fQEgNZ90
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwED2VMsAEqEV844GRpPlwUnuuqFpoqw6p1K3yx1mqgBbRdOHXYzspCMTAEkWOrERn2e_svHcP4C519uWxkQHnNA8o03EgRGwChharUWqB3u5tPMkHM_o4z-YNuP_SwiCiJ59h6G79v3y9Vlt3VNbhTm3D2B7sW9ynWaXW2tFzIt4Z9qa9pKDMwprd-SVRWHf44ZzigaN_BOPdKyu-yHO4LWWoPn5VY_zvNx1D-1uiR6Zf4HMCDVy1YFh4EuyGKJcTOxKQjztZG4KvYvlCalcesvHuN_WjunaqW_eIQV_oc9OGWf-h6A2C2ishWNotQmnXrJy7OjuSM20yw_JUKaXTxEjbYi-YaB7LXDOb8QjMIi00CkqFMFk3zVJMT6G5Wq_wDIidlc6TR1DVNbQbSaFzpmweyXji5JbxObRcGBZvVTmMRR2Bi7-bb-FgUIxHi9Fw8nQJh25YqhOMK2iW71u8tpheyhs_lJ-o66LB
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+First+International+Conference+on+Power%2C+Control+and+Computing+Technologies+%28ICPC2T%29&rft.atitle=Towards+classification+of+email+through+selection+of+informative+features&rft.au=Sharaff%2C+Aakanksha&rft.au=Srinivasarao%2C+Ulligaddala&rft.date=2020-01-01&rft.pub=IEEE&rft.spage=316&rft.epage=320&rft_id=info:doi/10.1109%2FICPC2T48082.2020.9071488&rft.externalDocID=9071488