Unstructured Data Processing using Spark for Topics Modelling

Information Technology domain is facing changes day by day. Furthermore, the size of data increases, as well as the demand to process them. There are two types of data: structured and unstructured data. The multiple sources and the variety of data today involve the use of “Big data” instead of data....

Full description

Saved in:
Bibliographic Details
Published inInternational journal of engineering and advanced technology Vol. 9; no. 5; pp. 1060 - 1063
Main Authors Sokegbe, Adjovi Irène, Nainwal, Ayushi
Format Journal Article
LanguageEnglish
Published 30.06.2020
Online AccessGet full text
ISSN2249-8958
2249-8958
DOI10.35940/ijeat.E9992.069520

Cover

Loading…
Abstract Information Technology domain is facing changes day by day. Furthermore, the size of data increases, as well as the demand to process them. There are two types of data: structured and unstructured data. The multiple sources and the variety of data today involve the use of “Big data” instead of data. It is related that 80% of enteUprise’s data is unstructured [1]. However, the procedures to handle unstructured data are more complex than those for structured data. Thus, it becomes necessary to have a clear idea about this type of data and to know how to extract useful information from this data set. In this paper we will study how to retrieve useful information from unstructured data in E-commerce area using data analysis tools: Spark. To solve this issue, first an overview on structured and unstructured data and data analysis is provided, then information retrieval algorithm will be implemented using Spark MLlib tool in order to determine for a set of reviews, negative or positive, which subjects are more discussed by the customers. This study is needed in order to improve business based on customer satisfaction reviews. In that case, Unsupervised Machine Learning Latent Dirichlet Allocation (LDA) algorithm constitutes our model. Finally, the evaluation of the model will be given based on some parameters.
AbstractList Information Technology domain is facing changes day by day. Furthermore, the size of data increases, as well as the demand to process them. There are two types of data: structured and unstructured data. The multiple sources and the variety of data today involve the use of “Big data” instead of data. It is related that 80% of enteUprise’s data is unstructured [1]. However, the procedures to handle unstructured data are more complex than those for structured data. Thus, it becomes necessary to have a clear idea about this type of data and to know how to extract useful information from this data set. In this paper we will study how to retrieve useful information from unstructured data in E-commerce area using data analysis tools: Spark. To solve this issue, first an overview on structured and unstructured data and data analysis is provided, then information retrieval algorithm will be implemented using Spark MLlib tool in order to determine for a set of reviews, negative or positive, which subjects are more discussed by the customers. This study is needed in order to improve business based on customer satisfaction reviews. In that case, Unsupervised Machine Learning Latent Dirichlet Allocation (LDA) algorithm constitutes our model. Finally, the evaluation of the model will be given based on some parameters.
Author Sokegbe, Adjovi Irène
Nainwal, Ayushi
Author_xml – sequence: 1
  givenname: Adjovi Irène
  surname: Sokegbe
  fullname: Sokegbe, Adjovi Irène
– sequence: 2
  givenname: Ayushi
  surname: Nainwal
  fullname: Nainwal, Ayushi
BookMark eNp9kMtOwzAQRS1UJErpF7DJD6T4ncyCBSrlIRWBRFlHE8dBLiGObHfB3xOlLBALZnFnpNG5i3NOZr3vLSGXjK6EAkmv3N5iWm0AgK-oBsXpCZlzLiEvQZWzX_cZWca4p-MUigvK5uT6rY8pHEw6BNtkt5gwewne2Bhd_54dpnwdMHxkrQ_Zzg_OxOzJN7brxtcFOW2xi3b5sxdkd7fZrR_y7fP94_pmmxsmgOZSaIa1ZoBS66IwgMDrxipEkIpq0ahCc1M2QrRYNwxLhJpRiYWurTRCLIg41prgYwy2rYbgPjF8VYxWk4NqclBNDqqjg5GCP5RxCZPzfQroun_Zb5fJZW4
CitedBy_id crossref_primary_10_35940_ijeat_C4564_14030225
ContentType Journal Article
CorporateAuthor Computer science and Engineering, Alakh Prakash Goyal Shimla University, Shimla, India
CorporateAuthor_xml – name: Computer science and Engineering, Alakh Prakash Goyal Shimla University, Shimla, India
DBID AAYXX
CITATION
DOI 10.35940/ijeat.E9992.069520
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2249-8958
EndPage 1063
ExternalDocumentID 10_35940_ijeat_E9992_069520
GroupedDBID AAYXX
ALMA_UNASSIGNED_HOLDINGS
CITATION
M~E
ID FETCH-LOGICAL-c1390-4361ab619a46677c9a92bde5aa945063d5762c8d33fabd1a8a9b104a76be4c33
ISSN 2249-8958
IngestDate Tue Jul 01 00:38:24 EDT 2025
Thu Apr 24 23:05:24 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 5
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c1390-4361ab619a46677c9a92bde5aa945063d5762c8d33fabd1a8a9b104a76be4c33
OpenAccessLink https://doi.org/10.35940/ijeat.e9992.069520
PageCount 4
ParticipantIDs crossref_primary_10_35940_ijeat_E9992_069520
crossref_citationtrail_10_35940_ijeat_E9992_069520
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2020-6-30
PublicationDateYYYYMMDD 2020-06-30
PublicationDate_xml – month: 06
  year: 2020
  text: 2020-6-30
  day: 30
PublicationDecade 2020
PublicationTitle International journal of engineering and advanced technology
PublicationYear 2020
SSID ssj0000752301
Score 2.1113486
Snippet Information Technology domain is facing changes day by day. Furthermore, the size of data increases, as well as the demand to process them. There are two types...
SourceID crossref
SourceType Enrichment Source
Index Database
StartPage 1060
Title Unstructured Data Processing using Spark for Topics Modelling
Volume 9
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LTxsxELYCvfRCaWlVKK186K1surG9Dx8RUEEluBAkbis_aQAlESRU5cBvZ2zvw0CEoBcra-2ONp5P9nzj2c8IfdeyTLUVNpGFkgmzLE0klTYRlnLGmZOHcTu6h0f5_gn7fZqd9no3UdXSfCb76nbhdyX_41XoA7-6r2Rf4dnWKHTAb_AvtOBhaF_k45Na_nXuish3xUw0df-O_899ezwVVxe-lnA4mTpFZnf4mdfhjsPSh3nBSE3CdHKFQda1KRmYPUnJH08uzJkM-VF9PrkZ_TgI2_Blt3N_JEbjvyLUBfybX_8ZxUkHkjYVcs3cBAs_T0oeVNf7ZkFfPbnyCENZNFECE02jRRcu6aIJnWacuRLI0TmsTP09iGZJP815RtJu_Wr27B8ta22xIdAcb6byRipvpApGltAbAvTCnXxxeNfl5iCMAmbmuHr7n4Jglbfz8-nLREFNFJ0MV9FKTSvwdsDIe9Qz4w_oXXNkB65n8DX0ADLYQQZ3kMEeMthDBgNkcIAMbiHzEQ1_7Q139pP6BI1EQWSfJozmAyGBIwuW50WhuOBEapMJwVkGQ66BbRJVakqtkHogSsEl8HNR5NIwRekntDyejM1nhBVJVcENtVoCJS8LCGwGRhGaW8t1SeQ6Is0IVKpWl3eHnFxWzwz_OtpqH5oGcZXnbt943e1f0NsOuptoGUbXfIUIcia_eXffAy7McSM
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Unstructured+Data+Processing+using+Spark+for+Topics+Modelling&rft.jtitle=International+journal+of+engineering+and+advanced+technology&rft.au=Sokegbe%2C+Adjovi+Ir%C3%A8ne&rft.au=Nainwal%2C+Ayushi&rft.date=2020-06-30&rft.issn=2249-8958&rft.eissn=2249-8958&rft.volume=9&rft.issue=5&rft.spage=1060&rft.epage=1063&rft_id=info:doi/10.35940%2Fijeat.E9992.069520&rft.externalDBID=n%2Fa&rft.externalDocID=10_35940_ijeat_E9992_069520
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2249-8958&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2249-8958&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2249-8958&client=summon