Data Schema to Formalize Education Research & Development Using Natural Language Processing

Our work aims to aid in the development of an open source data schema for educational interventions by implementing natural language processing (NLP) techniques on publications within What Works Clearinghouse (WWC) and the Education Resources Information Center (ERIC). A data schema demonstrates the...

Full description

Saved in:

Bibliographic Details
Published in	2021 Systems and Information Engineering Design Symposium (SIEDS) pp. 1 - 6
Main Authors	Frederick, Hannah, Hong, Haizhu, Williams, Margaret, West, Amanda, Wright, Brian
Format	Conference Proceeding
Language	English
Published	IEEE 30.04.2021
Subjects	Coherence Data collection Data Schema Dictionaries Education Education Research Natural language processing Semantics Vocabulary
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Our work aims to aid in the development of an open source data schema for educational interventions by implementing natural language processing (NLP) techniques on publications within What Works Clearinghouse (WWC) and the Education Resources Information Center (ERIC). A data schema demonstrates the relationships between individual elements of interest (in this case, research in education) and collectively documents elements in a data dictionary. To facilitate the creation of this educational data schema, we first run a two-topic latent Dirichlet allocation (LDA) model on the titles and abstracts of papers that met WWC standards without reservation against those of papers that did not, separated by math and reading subdomains. We find that the distributions of allocation to these two topics suggest structural differences between WWC and non-WWC literature. We then implement Term Frequency-Inverse Document Frequency (TF-IDF) scoring to study the vocabulary within WWC titles and abstracts and determine the most relevant unigrams and bigrams currently present in WWC. Finally, we utilize an LDA model again to cluster WWC titles and abstracts into topics, or sets of words, grouped by underlying semantic similarities. We find that 11 topics are the optimal number of subtopics in WWC with an average coherence score of 0.4096 among the 39 out of 50 models that returned 11 as the optimal number of topics. Based on the TF-IDF and LDA methods presented, we can begin to identify core themes of high-quality literature that will better inform the creation of a universal data schema within education research.
DOI:	10.1109/SIEDS52267.2021.9483781