Towards classification of email through selection of informative features

Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in...

Full description

Saved in:

Bibliographic Details
Published in	2020 First International Conference on Power, Control and Computing Technologies (ICPC2T) pp. 316 - 320
Main Authors	Sharaff, Aakanksha, Srinivasarao, Ulligaddala
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2020
Subjects	CHI Square Classifier Decision trees Feature extraction Informative features N-grams Naive Bayes methods Random forests Relevant words Support vector machines Unsolicited e-mail
Online Access	Get full text
DOI	10.1109/ICPC2T48082.2020.9071488

Cover

Abstract	Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in classification of emails. Feature selection method plays an important role in identifying most relevant features to classify emails by inventing new techniques. The features are extracted based on the relationship established between the words of various classes and it helps to increase the probability of the words to represent informative features. The words can be either of positive or negative in nature. In this paper, relationship among the words present in subject and content of emails has been used to determine the nature of the word and then selected the most related words to form informative features from set of all words. These words are then used to generate the N-grams. Four different classifiers namely Decision tree, Multinomial Naive Bayes, Random Forest classifiers, Linear Support Vector Machine classifiers have been used to determine the performance of the selected N-Grams features. The experimental analysis has been performed over Ling-Spam email dataset.
AbstractList	Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is one of the major characteristic of email classification. To train a classifier without decreasing its capability of prediction is important in classification of emails. Feature selection method plays an important role in identifying most relevant features to classify emails by inventing new techniques. The features are extracted based on the relationship established between the words of various classes and it helps to increase the probability of the words to represent informative features. The words can be either of positive or negative in nature. In this paper, relationship among the words present in subject and content of emails has been used to determine the nature of the word and then selected the most related words to form informative features from set of all words. These words are then used to generate the N-grams. Four different classifiers namely Decision tree, Multinomial Naive Bayes, Random Forest classifiers, Linear Support Vector Machine classifiers have been used to determine the performance of the selected N-Grams features. The experimental analysis has been performed over Ling-Spam email dataset.
Author	Sharaff, Aakanksha Srinivasarao, Ulligaddala
Author_xml	– sequence: 1 givenname: Aakanksha surname: Sharaff fullname: Sharaff, Aakanksha organization: National Institute of Technology,Department of Computer Science and Engineering,Raipur,India – sequence: 2 givenname: Ulligaddala surname: Srinivasarao fullname: Srinivasarao, Ulligaddala organization: National Institute of Technology,Department of Computer Science and Engineering,Raipur,India
BookMark	eNo1j8tKBDEURCPowhn9Ajf5gW7z7CRLaXwMDOiiXQ-3kxsn0A9JehT_3gbHTRXUKQpqQy6neUJCKGc158zd79q3VnTKMitqwQSrHTNcWXtBNtwIy5VzRl-TXTd_Qw6F-gFKSTF5WNI80TlSHCENdDnm-fRxpAUH9P8oTXHO49r8QhoRllPGckOuIgwFb8--Je9Pj137Uu1fn3ftw75KnNulcrxxyljTOxuijraR3vsgRezXZBUUwfG-CVYzDahZgICgFEDURmqJckvu_nYTIh4-cxoh_xzO7-Qvp19MwA
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICPC2T48082.2020.9071488
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1728149975 9781728149974
EndPage	320
ExternalDocumentID	9071488
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i118t-91694787b98df5f863cccd32fbb98fbbe2d91b6d8505ae50dadea44aaf57353e3
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:38:01 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i118t-91694787b98df5f863cccd32fbb98fbbe2d91b6d8505ae50dadea44aaf57353e3
PageCount	5
ParticipantIDs	ieee_primary_9071488
PublicationCentury	2000
PublicationDate	2020-Jan.
PublicationDateYYYYMMDD	2020-01-01
PublicationDate_xml	– month: 01 year: 2020 text: 2020-Jan.
PublicationDecade	2020
PublicationTitle	2020 First International Conference on Power, Control and Computing Technologies (ICPC2T)
PublicationTitleAbbrev	ICPC2T
PublicationYear	2020
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.7119316
Snippet	Dodging spam filters by spammers has become a serious issue in classifying emails as spam or ham. An extremely high dimensionality of emails sent/received is...
SourceID	ieee
SourceType	Publisher
StartPage	316
SubjectTerms	CHI Square Classifier Decision trees Feature extraction Informative features N-grams Naive Bayes methods Random forests Relevant words Support vector machines Unsolicited e-mail
Title	Towards classification of email through selection of informative features
URI	https://ieeexplore.ieee.org/document/9071488
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09a8MwED2STJ3akpR-o6Fj7fhDtqU5NCSFlAwJZAv6OEFoa5fGWfLrK8lOSkuHLsbIBpuTrXeS3r0H8JA6-_LYyIBzmgeU6TgQIjYBQ4vVKLVAb_c2e8knS_q8ylYdeDzWwiCiJ59h6E79Xr6u1M4tlQ25q7ZhrAtd-5k1tVoHck7Eh9PRfJQsKLOgZud9SRS2t__wTfGwMT6F2eGBDVvkNdzVMlT7X1qM_32jMxh8F-iR-RF6zqGDZR-mC0-B3RLlMmJHAfJRJ5Uh-C42b6T15CFb733TXmqVU92oRwx6mc_tAJbjp8VoErROCcHGThBqO2Ll3KnsSM60yQzLU6WUThMjbYs9YKJ5LHPNbL4jMIu00CgoFcJkRZqlmF5Ar6xKvARi_0nnyCOoKgwtIil0zpTNIhlPXLFlfAV9F4b1RyOGsW4jcP138w2cuK5o1ixuoVd_7vDOongt7333fQEgNZ90
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwED2VMsAEqEV844GRpPlwUnuuqFpoqw6p1K3yx1mqgBbRdOHXYzspCMTAEkWOrERn2e_svHcP4C519uWxkQHnNA8o03EgRGwChharUWqB3u5tPMkHM_o4z-YNuP_SwiCiJ59h6G79v3y9Vlt3VNbhTm3D2B7sW9ynWaXW2tFzIt4Z9qa9pKDMwprd-SVRWHf44ZzigaN_BOPdKyu-yHO4LWWoPn5VY_zvNx1D-1uiR6Zf4HMCDVy1YFh4EuyGKJcTOxKQjztZG4KvYvlCalcesvHuN_WjunaqW_eIQV_oc9OGWf-h6A2C2ishWNotQmnXrJy7OjuSM20yw_JUKaXTxEjbYi-YaB7LXDOb8QjMIi00CkqFMFk3zVJMT6G5Wq_wDIidlc6TR1DVNbQbSaFzpmweyXji5JbxObRcGBZvVTmMRR2Bi7-bb-FgUIxHi9Fw8nQJh25YqhOMK2iW71u8tpheyhs_lJ-o66LB
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+First+International+Conference+on+Power%2C+Control+and+Computing+Technologies+%28ICPC2T%29&rft.atitle=Towards+classification+of+email+through+selection+of+informative+features&rft.au=Sharaff%2C+Aakanksha&rft.au=Srinivasarao%2C+Ulligaddala&rft.date=2020-01-01&rft.pub=IEEE&rft.spage=316&rft.epage=320&rft_id=info:doi/10.1109%2FICPC2T48082.2020.9071488&rft.externalDocID=9071488