Dealing with the Data Imbalance Problem in Pulsar Candidate Sifting Based on Feature Selection

Pulsar detection has become an active research topic in radio astronomy recently. One of the essential procedures for pulsar detection is pulsar candidate sifting (PCS), a procedure for identifying potential pulsar signals in a survey. However, pulsar candidates are always class-imbalanced, as most...

Full description

Saved in:

Bibliographic Details
Published in	Research in astronomy and astrophysics Vol. 24; no. 2; pp. 25010 - 137
Main Authors	Lin, Haitao, Li, Xiangru
Format	Journal Article
Language	English
Published	Beijing National Astromonical Observatories, CAS and IOP Publishing 01.02.2024 IOP Publishing School of Mathematics and Statistics,Hanshan Normal University,Chaozhou 521000,China%School of Computer Science,South China Normal University,Guangzhou 510631,China
Subjects	(stars:) pulsars: general Algorithms Astronomical models Astronomy Candidates Feature selection Greedy algorithms Machine learning methods: data analysis methods: statistical Performance enhancement Pulsars Radio astronomy (stars:)pulsars:general methods:statistical methods:data analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Pulsar detection has become an active research topic in radio astronomy recently. One of the essential procedures for pulsar detection is pulsar candidate sifting (PCS), a procedure for identifying potential pulsar signals in a survey. However, pulsar candidates are always class-imbalanced, as most candidates are non-pulsars such as RFI and only a tiny part of them are from real pulsars. Class imbalance can greatly affect the performance of machine learning (ML) models, resulting in a heavy cost as some real pulsars are misjudged. To deal with the problem, techniques of choosing relevant features to discriminate pulsars from non-pulsars are focused on, which is known as feature selection . Feature selection is a process of selecting a subset of the most relevant features from a feature pool. The distinguishing features between pulsars and non-pulsars can significantly improve the performance of the classifier even if the data are highly imbalanced. In this work, an algorithm for feature selection called the K-fold Relief-Greedy (KFRG) algorithm is designed. KFRG is a two-stage algorithm. In the first stage, it filters out some irrelevant features according to their K-fold Relief scores, while in the second stage, it removes the redundant features and selects the most relevant features by a forward greedy search strategy. Experiments on the data set of the High Time Resolution Universe survey verified that ML models based on KFRG are capable of PCS, correctly separating pulsars from non-pulsars even if the candidates are highly class-imbalanced.
Bibliography:	RAA-2023-0240.R1 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1674-4527 2397-6209
DOI:	10.1088/1674-4527/ad0c26