Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets

The increasing availability of longitudinal data (repeated numerical observations of the same units at different times) requires the development of flexible techniques to automatically detect errors in such data. Besides standard types of errors, which can be treated with generic error correction te...

Full description

Saved in:

Bibliographic Details
Published in	Symmetry (Basel) Vol. 16; no. 5; p. 529
Main Authors	Bruni, Renato, Daraio, Cinzia, Di Leo, Simone
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.05.2024
Subjects	Analysis Artificial intelligence big data data quality Datasets Education parks Education, Higher Empirical analysis Error correction Error correction & detection Higher education institutions information processing information reconstruction longitudinal data sequences Longitudinal studies Methods School facilities Time series Germany
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The increasing availability of longitudinal data (repeated numerical observations of the same units at different times) requires the development of flexible techniques to automatically detect errors in such data. Besides standard types of errors, which can be treated with generic error correction techniques, large longitudinal datasets may present specific problems not easily traceable by the generic techniques. In particular, after applying those generic techniques, time series in the data may contain trends, natural fluctuations and possible surviving errors. To study the data evolution, one main issue is distinguishing those elusive errors from the rest, which should be kept as they are and not flattened or altered. This work responds to this need by identifying some types of elusive errors and by proposing a statistical-mathematical approach to capture their complexity that can be applied after the above generic techniques. The proposed approach is based on a system of indicators and works at the formal level by studying the differences between consecutive values of data series and the symmetries and asymmetries of these differences. It operates regardless of the specific meaning of the data and is thus applicable in a variety of contexts. We implement this approach in a relevant database of European Higher Education institutions (ETER) by analyzing two key variables: “Total academic staff” and “Total number of enrolled students”, which are two of the most important variables, often used in empirical analysis as a proxy for size, and are considered by policymakers at the European level. The results are very promising.
ISSN:	2073-8994 2073-8994
DOI:	10.3390/sym16050529