SPWalk: Similar Property Oriented Feature Learning for Phishing Detection

Detecting phishing webpages is an essential task that protects legitimate websites and their users from various malicious activities. To classify the suspect webpage as phishing or legitimate, robust and effective features used for classification are in demand. However, recent phishing attacks usual...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 8; pp. 87031 - 87045
Main Authors	Liu, Xiuwen, Fu, Jianming
Format	Journal Article
Language	English
Published	Piscataway IEEE 2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Classification Feature extraction Feature learning Hypertext systems Machine learning network embedding Nodes Phishing phishing detection Random walk Robustness Robustness (mathematics) Search engines similar property Uniform resource locators Visual aspects Visualization Websites
Online Access	Get full text

Cover

Loading…

Abstract	Detecting phishing webpages is an essential task that protects legitimate websites and their users from various malicious activities. To classify the suspect webpage as phishing or legitimate, robust and effective features used for classification are in demand. However, recent phishing attacks usually make phishing webpages resemble the legitimate webpages in visual and functional aspects. This poses a greater difficulty for feature extraction. We herein propose SPWalk , an unsupervised feature learning algorithm for phishing detection. In SPWalk , similar property nodes refer to a collection of phishing webpages or legitimate webpages. We first construct a weblink network with nodes representing webpages. The edges between nodes represent the reference relationships that connect webpages through hyperlinks or similar textual content. Then, SPWalk applies the network embedding technique to mapping nodes into a low-dimensional vector space. A biased random walk procedure efficiently integrates both structural information between nodes and URL information of each node. The effectiveness and robustness of SPWalk come from three points. (1). Phishing attackers do not have full control over reference relationships . (2). The structural regularities generated by diverse reference relationships can be exploited to discriminate between phishing and legitimate webpages. (3). Node URL information makes the learned node representations more suited for phishing detection. Using node as numeric features, we conduct experiments to classify webpages as legitimate or phishing. We demonstrate the superiority of SPWalk over state-of-the-art techniques on phishing detection, especially in terms of precision (over 95%). Even in the case that phishing webpages are well camouflaged by attackers for evading detection, SPwalk exhibits better classification efficacy consistently.
AbstractList	Detecting phishing webpages is an essential task that protects legitimate websites and their users from various malicious activities. To classify the suspect webpage as phishing or legitimate, robust and effective features used for classification are in demand. However, recent phishing attacks usually make phishing webpages resemble the legitimate webpages in visual and functional aspects. This poses a greater difficulty for feature extraction. We herein propose SPWalk, an unsupervised feature learning algorithm for phishing detection. In SPWalk, similar property nodes refer to a collection of phishing webpages or legitimate webpages. We first construct a weblink network with nodes representing webpages. The edges between nodes represent the reference relationships that connect webpages through hyperlinks or similar textual content. Then, SPWalk applies the network embedding technique to mapping nodes into a low-dimensional vector space. A biased random walk procedure efficiently integrates both structural information between nodes and URL information of each node. The effectiveness and robustness of SPWalk come from three points. (1). Phishing attackers do not have full control over reference relationships. (2). The structural regularities generated by diverse reference relationships can be exploited to discriminate between phishing and legitimate webpages. (3). Node URL information makes the learned node representations more suited for phishing detection. Using node as numeric features, we conduct experiments to classify webpages as legitimate or phishing. We demonstrate the superiority of SPWalk over state-of-the-art techniques on phishing detection, especially in terms of precision (over 95%). Even in the case that phishing webpages are well camouflaged by attackers for evading detection, SPwalk exhibits better classification efficacy consistently.
Author	Fu, Jianming Liu, Xiuwen
Author_xml	– sequence: 1 givenname: Xiuwen orcidid: 0000-0002-6202-1937 surname: Liu fullname: Liu, Xiuwen organization: Key Laboratory of Aerospace Information Security and Trusted Computing of Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China – sequence: 2 givenname: Jianming orcidid: 0000-0002-4639-5824 surname: Fu fullname: Fu, Jianming email: jmfu@whu.edu.cn organization: Key Laboratory of Aerospace Information Security and Trusted Computing of Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
BookMark	eNp9kUFvGyEUhFGVSnVT_4JcVurZLrAsC71FbtJYsmRLbtUjegtvY9z14rL4kH8fnE2jKIdwAZ7mGwbNZ3LRhx4JuWJ0zhjV364Xi5vtds4pp3OuNS8V-0AmnEk9K6tSXrw6fyLTYdjTvFQeVfWELLebP9D9_V5s_cF3EItNDEeM6aFYR499QlfcIqRTxGKFEHvf3xdtyLKdH3bnyw9MaJMP_RfysYVuwOnzfkl-3978WtzNVuufy8X1amYFVWnmnKCIDa2EsBxY41qQAiRSBQ5bJ6V1rq6bSrNaMtZypyAH107U2lIJsrwky9HXBdibY_QHiA8mgDdPgxDvDcTkbYdGWNcwaAQH0ebHmwYa3mqNgitEzln2-jp6HWP4d8IhmX04xT7HN1xUglGma5VVelTZGIYhYmusT3D-c4rgO8OoORdhxiLMuQjzXERmyzfs_8TvU1cj5RHxhdBUKS10-QhF35Yz
CODEN	IAECCG
CitedBy_id	crossref_primary_10_1177_18724981251321395 crossref_primary_10_3233_JIFS_223569 crossref_primary_10_1016_j_comnet_2024_110398 crossref_primary_10_1016_j_eswa_2023_119723 crossref_primary_10_1109_ACCESS_2022_3166474 crossref_primary_10_1016_j_jksuci_2023_01_004
Cites_doi	10.3758/BF03193020 10.1016/j.eswa.2018.09.040 10.1007/978-3-319-66402-6_22 10.1145/1314389.1314391 10.1145/1557019.1557153 10.1145/2939672.2939754 10.1007/s11280-013-0250-4 10.1145/2736277.2741093 10.1145/2806416.2806512 10.1109/INFCOM.2011.5934995 10.1016/j.eswa.2016.01.028 10.1109/INFCOM.2010.5462216 10.1145/2623330.2623732 10.1109/ACCESS.2019.2893980 10.1016/j.future.2009.07.012 10.1145/2976749.2978387 10.1145/1553374.1553462 10.24963/ijcai.2017/544 10.1016/j.eswa.2010.04.044 10.1016/j.dss.2016.05.005 10.1016/j.cose.2015.07.006 10.18653/v1/P17-1158 10.1109/TDSC.2006.50 10.1145/1242572.1242659 10.3115/v1/D14-1162 10.1109/TNN.2011.2161999 10.1109/SP.2011.25
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020
DBID	97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D DOA
DOI	10.1109/ACCESS.2020.2992381
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts METADEX Technology Research Database Materials Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Directory of Open Access Journals - May need to register for free articles
DatabaseTitle	CrossRef Materials Research Database Engineered Materials Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace METADEX Computer and Information Systems Abstracts Professional
DatabaseTitleList	Materials Research Database
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2169-3536
EndPage	87045
ExternalDocumentID	oai_doaj_org_article_4cdb1ab42a4f408bbab2f99e428ee221 10_1109_ACCESS_2020_2992381 9088949
Genre	orig-research
GrantInformation_xml	– fundername: National Science Foundation of China grantid: 61972297; U1636107 funderid: 10.13039/501100001809
GroupedDBID	0R~ 4.4 5VS 6IK 97E AAJGR ABAZT ABVLG ACGFS ADBBV AGSQL ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD ESBDL GROUPED_DOAJ IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RNS AAYXX CITATION RIG 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c408t-dd40eeb0544c2a1bdfa64a6e08adefd66cdd77b5917611f2d8a5369d479c06a63
IEDL.DBID	DOA
ISSN	2169-3536
IngestDate	Wed Aug 27 01:29:17 EDT 2025 Mon Jun 30 06:19:28 EDT 2025 Tue Jul 01 01:22:33 EDT 2025 Thu Apr 24 22:51:56 EDT 2025 Wed Aug 27 02:41:42 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
License	https://creativecommons.org/licenses/by/4.0/legalcode
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c408t-dd40eeb0544c2a1bdfa64a6e08adefd66cdd77b5917611f2d8a5369d479c06a63
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-4639-5824 0000-0002-6202-1937
OpenAccessLink	https://doaj.org/article/4cdb1ab42a4f408bbab2f99e428ee221
PQID	2454101978
PQPubID	4845423
PageCount	15
ParticipantIDs	crossref_primary_10_1109_ACCESS_2020_2992381 proquest_journals_2454101978 ieee_primary_9088949 doaj_primary_oai_doaj_org_article_4cdb1ab42a4f408bbab2f99e428ee221 crossref_citationtrail_10_1109_ACCESS_2020_2992381
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	20200000 2020-00-00 20200101 2020-01-01
PublicationDateYYYYMMDD	2020-01-01
PublicationDate_xml	– year: 2020 text: 20200000
PublicationDecade	2020
PublicationPlace	Piscataway
PublicationPlace_xml	– name: Piscataway
PublicationTitle	IEEE access
PublicationTitleAbbrev	Access
PublicationYear	2020
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref34 ref12 ref36 ref30 ref33 ref11 ref10 ref2 goldberg (ref38) 2014 ref1 ref17 ref16 mikolov (ref39) 2013 ref19 mikolov (ref29) 2013 (ref41) 2017 levy (ref35) 2014; 3 yang (ref15) 2015 moshchuk (ref3) 2006 mnih (ref45) 2008 ref24 berners-lee (ref37) 2017 ref23 ref48 ref26 ref47 ref25 mikolov (ref31) 2013 ref20 (ref43) 2017 rong (ref32) 2014 stallman (ref44) 2017 saidi (ref22) 2009 ref28 ref27 ref7 (ref40) 2017 ref9 ref4 ref6 tu (ref14) 2016 ref5 (ref42) 2017 sun (ref18) 2016 phillip (ref21) 2009 whittaker (ref8) 2010 pedregosa (ref46) 2016; 12
References_xml	– start-page: 1 year: 2010 ident: ref8 article-title: Large-scale automatic classification of phishing pages publication-title: Proc Network and Distributed System Security Symp (NDSS) – year: 2013 ident: ref29 article-title: Efficient estimation of word representations in vector space publication-title: Proc Workshop 1st Int Conf Learn Represent (ICLR) – year: 2017 ident: ref44 publication-title: Gnu wget – ident: ref34 doi: 10.3758/BF03193020 – ident: ref47 doi: 10.1016/j.eswa.2018.09.040 – year: 2014 ident: ref32 article-title: Word2vec parameter learning explained publication-title: arXiv 1411 2738 – ident: ref28 doi: 10.1007/978-3-319-66402-6_22 – start-page: 2111 year: 2015 ident: ref15 article-title: Network representation learning with rich text information publication-title: Proc Int Conf Artif Intell – year: 2017 ident: ref42 publication-title: Alexa the Web Information Company – ident: ref10 doi: 10.1145/1314389.1314391 – ident: ref4 doi: 10.1145/1557019.1557153 – year: 2017 ident: ref41 publication-title: Phishing Intelligence Feeds – ident: ref33 doi: 10.1145/2939672.2939754 – ident: ref23 doi: 10.1007/s11280-013-0250-4 – year: 2017 ident: ref43 publication-title: Open Directory Project – ident: ref19 doi: 10.1145/2736277.2741093 – ident: ref16 doi: 10.1145/2806416.2806512 – ident: ref9 doi: 10.1109/INFCOM.2011.5934995 – start-page: 18 year: 2006 ident: ref3 article-title: A crawler-based study of spyware in the Web publication-title: Proc Network and Distributed System Security Symp (NDSS) – ident: ref20 doi: 10.1016/j.eswa.2016.01.028 – ident: ref2 doi: 10.1109/INFCOM.2010.5462216 – year: 2014 ident: ref38 article-title: word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method publication-title: arXiv 1402 3722 – start-page: 3889 year: 2016 ident: ref14 article-title: Max-margin deepwalk: Discriminative learning of network representation publication-title: Proc Int Joint Conf Artif Intell – ident: ref13 doi: 10.1145/2623330.2623732 – volume: 12 start-page: 2825 year: 2016 ident: ref46 article-title: Scikit-learn: Machine learning in pythonGnu Wget publication-title: J Mach Learn Res – start-page: 1081 year: 2008 ident: ref45 article-title: A scalable hierarchical distributed language model publication-title: Proc Int Conf Neural Inf Process – ident: ref48 doi: 10.1109/ACCESS.2019.2893980 – volume: 3 start-page: 2177 year: 2014 ident: ref35 article-title: Neural word embedding as implicit matrix factorization publication-title: Advances in neural information processing systems – ident: ref11 doi: 10.1016/j.future.2009.07.012 – ident: ref1 doi: 10.1145/2976749.2978387 – year: 2016 ident: ref18 article-title: A general framework for content-enhanced network representation learning publication-title: arXiv 1610 02906 – ident: ref5 doi: 10.1145/1553374.1553462 – start-page: 3111 year: 2013 ident: ref39 article-title: Distributed representations of words and phrases and their compositionality publication-title: Proc Int Conf Neural Inf Process – ident: ref17 doi: 10.24963/ijcai.2017/544 – ident: ref7 doi: 10.1016/j.eswa.2010.04.044 – year: 2017 ident: ref40 publication-title: Phishtank Developer Information – ident: ref12 doi: 10.1016/j.dss.2016.05.005 – year: 2009 ident: ref21 article-title: Conficker C P2P protocol and implementation – start-page: 296 year: 2013 ident: ref31 article-title: Linguistic regularities in continuous space word representations publication-title: Proc HLT-NAACL – ident: ref27 doi: 10.1016/j.cose.2015.07.006 – ident: ref36 doi: 10.18653/v1/P17-1158 – ident: ref24 doi: 10.1109/TDSC.2006.50 – ident: ref26 doi: 10.1145/1242572.1242659 – year: 2009 ident: ref22 article-title: An analysis of confickers logic and rendezvous points – ident: ref30 doi: 10.3115/v1/D14-1162 – ident: ref25 doi: 10.1109/TNN.2011.2161999 – year: 2017 ident: ref37 publication-title: Uniform Resource Locators (URL) – ident: ref6 doi: 10.1109/SP.2011.25
SSID	ssj0000816957
Score	2.2620046
Snippet	Detecting phishing webpages is an essential task that protects legitimate websites and their users from various malicious activities. To classify the suspect...
SourceID	doaj proquest crossref ieee
SourceType	Open Website Aggregation Database Enrichment Source Index Database Publisher
StartPage	87031
SubjectTerms	Algorithms Classification Feature extraction Feature learning Hypertext systems Machine learning network embedding Nodes Phishing phishing detection Random walk Robustness Robustness (mathematics) Search engines similar property Uniform resource locators Visual aspects Visualization Websites
SummonAdditionalLinks	– databaseName: IEEE Electronic Library (IEL) dbid: RIE link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwELWAU3voB7TqtoB84EgW23GcmBtdQLQSdCWKys3yx6RF0KXaZg_013fseKNSUNVbFNmW42d75k3Gz4TsMOtqV5WuCKEqC-lrVlinoPB16XjLeVnzeFD49EydXMiPl9XlCtkdzsIAQEo-g3F8TP_yw61fxFDZXsrJkXqVrCJx689qDfGUeIGEruosLMSZ3juYTPAbkAIKNsZNN9qme8YnafTnS1Ue7MTJvBw_J6fLjvVZJdfjRefG_tdfmo3_2_MX5Fn2M-lBPzFekhWYrZOnf6gPbpAP59Mv9uZ6n55ffb9CgkunMTA_7-7opyh-jK4ojQ7iYg40q7B-peji0um3Pm5FD6FLiVyzV-Ti-Ojz5KTINysUXrKmQ2AkA3DorkkvLHehtUpaBayxAdqglA-hRgyRyynOWxEaW5VKB1lrz5RV5WuyNrudwRtCBYemCU3LrNay9ki20WOsgkbrGHwDekTEcsiNz7Lj8faLG5PoB9Omx8lEnEzGaUR2h0o_etWNfxd_H7EcikbJ7PQCMTB5BRrpg-PWSWFli4PgnHWi1RqQfwEIgY1sRNyGRjJkI7K5nBkmL--fRshK4l6GDPzt47XekSexg32sZpOsdfMFbKH30rntNG1_A65K6_4 priority: 102 providerName: IEEE
Title	SPWalk: Similar Property Oriented Feature Learning for Phishing Detection
URI	https://ieeexplore.ieee.org/document/9088949 https://www.proquest.com/docview/2454101978 https://doaj.org/article/4cdb1ab42a4f408bbab2f99e428ee221
Volume	8
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LT8QgECbGkx6Mz7g-Nhw8WgVKafGmq0ZN1E3U6I3wqhp1NWv34L93oLhZY6IXrw2l8DGd-YbANwhtEW1KU-Qmc67IM25LkmkjfGbL3NCa0ryk4aLw-YU4ueFnd8XdRKmvcCaslQdugdvl1hmqDWea15xUxmjDaik90GbvWbxCziDmTSRT0QdXVMiiTDJDlMjd_V4PZgQJISM74IJDpPoWiqJifyqx8sMvx2BzPI_mEkvE--3oFtCUHyyi2QntwCV0etW_1c9Pe_jq8eUR0lPcD9vqw-YDXwbpYiCSONC70dDjpKF6j4Gg4v5Du-uED30Tj2ENltHN8dF17yRLdREyCwg0ACsn3hsgW9wyTY2rteBaeFJp52snhHWuhBWATExQWjNX6SIX0vFSWiK0yFfQ9OB14FcRZtRXlatqoqXkpYVUGfhe4STENmcrLzuIfUGkbBIND7UrnlVMHohULa4q4KoSrh20PX7prdXM-L35QcB-3DQIXscHYAYqmYH6yww6aCms3LiTeHyLw_g3vlZSpZ_zXTFecPBEkD-v_cen19FMmE67L7OBppvhyG8CU2lMNxplN14q_ARen-Rp
linkProvider	Directory of Open Access Journals
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1Lb9QwELZKOQAHXgWxUMAHuJGt7ThOjMShbKl26YOV2oreXD8mULVs0TYrVH4Lf4X_xjjxrniJWyVuUWRbzvjLeGY8_oaQZ8y60hW5y0Io8kz6kmXWKch8mTtec56XPF4U3tlVwwP59rA4XCLfFndhAKBNPoN-fGzP8sOZn8VQ2VqbkyN1SqHcgosv6KCdvxpt4Go-F2Lzzf5gmKUaApmXrGpwCpIBODRMpBeWu1BbJa0CVtkAdVDKh1DibNFrUZzXIlS2yJUOstSeKatyHPcKuYp2RiG622GLCE4sWaGLMlEZcabX1gcDlBo6nYL1Uc3H3fCX7a6tCpDKuPyh-9sNbfMW-T4XRZfHctKfNa7vv_7GEvm_yuo2uZksabreQf8OWYLJXXLjJ37FFTLaG7-3pycv6d7xp2N04ek4Hj1Mmwv6LtI7o7FNowk8mwJNPLMfKBrxdPyxi8zRDWjaVLXJPXJwKd9ynyxPzibwgFDBoapCVTOrtSy90pEyrwga9__gK9A9IuZLbHwiVo_1PU5N62AxbTpcmIgLk3DRIy8WnT53vCL_bv46YmfRNJKCty9wzU3SMUb64Lh1UlhZoxCcs07UWgN6mABC4CArESeLQRJEemR1jkSTFNi5EbKQqK11WT38e6-n5Npwf2fbbI92tx6R63GyXWRqlSw30xk8RlutcU_aX4aSo8vG3Q_Jtkt1
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SPWalk%3A+Similar+Property+Oriented+Feature+Learning+for+Phishing+Detection&rft.jtitle=IEEE+access&rft.au=Liu%2C+Xiuwen&rft.au=Fu%2C+Jianming&rft.date=2020&rft.issn=2169-3536&rft.eissn=2169-3536&rft.volume=8&rft.spage=87031&rft.epage=87045&rft_id=info:doi/10.1109%2FACCESS.2020.2992381&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_ACCESS_2020_2992381
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon