Urdu Word Segmentation using Machine Learning Approaches

Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word segmentation are IR, POS, NER, sentiment analysis, etc. Urdu Word Segmentation is a challenging task. There can be a number of reasons but Space Ins...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of advanced computer science & applications Vol. 9; no. 6
Main Authors Khan, Sadiq Nawaz, Khan, Khairullah, Khan, Wahab, Khan, Asfandyar, Subhan, Fazali, Ullah, Aman, Ullah, Burhan
Format Journal Article
LanguageEnglish
Published West Yorkshire Science and Information (SAI) Organization Limited 2018
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word segmentation are IR, POS, NER, sentiment analysis, etc. Urdu Word Segmentation is a challenging task. There can be a number of reasons but Space Insertion Problem and Space Omission Problems are the major ones. Compared to Urdu, the tools and resources developed for word segmentation of English and English like other western languages have record-setting performance. Some languages provide a clear indication for words just like English which having space or capitalization of the first character in a word. But there are many languages which do not have proper delimitation in between words e.g. Thai, Lao, Urdu, etc. The objective of this research work is to present a machine learning based approach for Urdu word segmentation. We adopted the use of conditional random fields (CRF) to achieve the subject task. Some other challenges faced in Urdu text are compound words and reduplicated words. In this paper, we tried to overcome such challenges in Urdu text by machine learning methodology.
AbstractList Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word segmentation are IR, POS, NER, sentiment analysis, etc. Urdu Word Segmentation is a challenging task. There can be a number of reasons but Space Insertion Problem and Space Omission Problems are the major ones. Compared to Urdu, the tools and resources developed for word segmentation of English and English like other western languages have record-setting performance. Some languages provide a clear indication for words just like English which having space or capitalization of the first character in a word. But there are many languages which do not have proper delimitation in between words e.g. Thai, Lao, Urdu, etc. The objective of this research work is to present a machine learning based approach for Urdu word segmentation. We adopted the use of conditional random fields (CRF) to achieve the subject task. Some other challenges faced in Urdu text are compound words and reduplicated words. In this paper, we tried to overcome such challenges in Urdu text by machine learning methodology.
Author Ullah, Aman
Khan, Asfandyar
Subhan, Fazali
Ullah, Burhan
Khan, Sadiq Nawaz
Khan, Wahab
Khan, Khairullah
Author_xml – sequence: 1
  givenname: Sadiq Nawaz
  surname: Khan
  fullname: Khan, Sadiq Nawaz
– sequence: 2
  givenname: Khairullah
  surname: Khan
  fullname: Khan, Khairullah
– sequence: 3
  givenname: Wahab
  surname: Khan
  fullname: Khan, Wahab
– sequence: 4
  givenname: Asfandyar
  surname: Khan
  fullname: Khan, Asfandyar
– sequence: 5
  givenname: Fazali
  surname: Subhan
  fullname: Subhan, Fazali
– sequence: 6
  givenname: Aman
  surname: Ullah
  fullname: Ullah, Aman
– sequence: 7
  givenname: Burhan
  surname: Ullah
  fullname: Ullah, Burhan
BookMark eNp9UD1PwzAQtVCRKKW_gCUSc4I_YsceowpoURBDqWCzHMcuqVqn2MnAv8dtmBi44e709N493bsGE9c5A8AtghnKKRP3q-dysS4zDBHPoIAM8wswxYiylNICTs47TxEsPq7APIQdjEUEZpxMAd_4ZkjeO98ka7M9GNervu1cMoTWbZMXpT9bZ5LKKO9OQHk8-i6CJtyAS6v2wcx_5wxsHh_eFsu0en1aLcoq1QTjPm20ybHOGdRGFMrGBhUqbE2MUrWocUMgKWquaoSs4IzbAjaWacY4qhGBOZmBu_FuNP4aTOjlrhu8i5YSM8pyRCjFkUVGlvZdCN5YefTtQflviaA8pyTHlOQpJTmmFFXij0q34_-9V-3-X-0PjYBuBw
CitedBy_id crossref_primary_10_7717_peerj_cs_1704
ContentType Journal Article
Copyright 2018. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2018. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
3V.
7XB
8FE
8FG
8FK
8G5
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
GNUQQ
GUQSH
HCIFZ
JQ2
K7-
M2O
MBDVC
P5Z
P62
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
Q9U
DOI 10.14569/IJACSA.2018.090628
DatabaseName CrossRef
ProQuest Central (Corporate)
ProQuest Central (purchase pre-March 2016)
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Research Library
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Databases
Technology Collection
ProQuest One Community College
ProQuest Central Korea
ProQuest Central Student
ProQuest Research Library
SciTech Premium Collection
ProQuest Computer Science Collection
Computer Science Database
Research Library
Research Library (Corporate)
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest Central Basic
DatabaseTitle CrossRef
Publicly Available Content Database
Research Library Prep
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
Research Library (Alumni Edition)
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Central Korea
ProQuest Research Library
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
ProQuest Central Basic
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest SciTech Collection
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2156-5570
ExternalDocumentID 10_14569_IJACSA_2018_090628
GroupedDBID .DC
5VS
8G5
AAYXX
ABUWG
ADMLS
AFKRA
ALMA_UNASSIGNED_HOLDINGS
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
CITATION
DWQXO
EBS
EJD
GNUQQ
GUQSH
HCIFZ
K7-
KQ8
M2O
OK1
PHGZM
PHGZT
PIMPY
RNS
3V.
7XB
8FE
8FG
8FK
JQ2
MBDVC
P62
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
Q9U
ID FETCH-LOGICAL-c322t-dce42c460ce97afe970a17fb3eaab9b2d3037b8ab11f9868f70df6c6681b13043
IEDL.DBID BENPR
ISSN 2158-107X
IngestDate Fri Jul 25 08:11:45 EDT 2025
Tue Jul 01 04:11:32 EDT 2025
Thu Apr 24 23:10:48 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 6
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c322t-dce42c460ce97afe970a17fb3eaab9b2d3037b8ab11f9868f70df6c6681b13043
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink https://www.proquest.com/docview/2656413552?pq-origsite=%requestingapplication%
PQID 2656413552
PQPubID 5444811
ParticipantIDs proquest_journals_2656413552
crossref_primary_10_14569_IJACSA_2018_090628
crossref_citationtrail_10_14569_IJACSA_2018_090628
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2018-00-00
PublicationDateYYYYMMDD 2018-01-01
PublicationDate_xml – year: 2018
  text: 2018-00-00
PublicationDecade 2010
PublicationPlace West Yorkshire
PublicationPlace_xml – name: West Yorkshire
PublicationTitle International journal of advanced computer science & applications
PublicationYear 2018
Publisher Science and Information (SAI) Organization Limited
Publisher_xml – name: Science and Information (SAI) Organization Limited
SSID ssj0000392683
Score 2.0922832
Snippet Word Segmentation is considered a basic NLP task and in diverse NLP areas, it plays a significant role. The main areas which can be benefited from Word...
SourceID proquest
crossref
SourceType Aggregation Database
Enrichment Source
Index Database
SubjectTerms Conditional random fields
Data mining
Effectiveness
Languages
Machine learning
Segmentation
Sentiment analysis
Title Urdu Word Segmentation using Machine Learning Approaches
URI https://www.proquest.com/docview/2656413552
Volume 9
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3LUsIwFL0jsHHj2xFFJguXVvpI03TlVAZEZmAckZFdJ6-y0YI8_t-kTXHYsMmmbRYnzT0nt7fnAjyIjLlYhcJR1JhqexI7XIjAoTwSxJN6VCahPxqTwRQPZ-HMJtzWtqyyiolFoJYLYXLkHV8LDx1ww9B_Xv46pmuU-bpqW2jUoKFDMKV1aLz0xu8fuyyLq-mfFF6cmtqMj2k0s9ZDWjjEnbdh0p0kpsCLPrnGsZfu09N-dC4op38GJ1YroqRc3HM4UvkFnFZ9GJDdlpdApyu5RV_6GIkmav5j_ybKkalpn6NRUS6pkHVSnaPE2oir9RVM-73P7sCxHREcoTfexpFCYV9g4goVRyzTg8u8KOOBYozH3JeakCJOGfe8LKaEZpErMyII0eJUkxUOrqGeL3J1A0iFhMZar2VccKzn4xmNsQwo87n0WMib4FdApMLahZuuFd-pOTYY9NISvdSgl5boNeFx99CydMs4fHurQji1W2ed_i_07eHLd3Bs5irzIS2ob1Zbda8Vwoa3oUb7r237MvwBoGS5Sg
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV25UsMwEN2BUEDDzRAIoAI6THzIslwwjAcICRAayJDOWIfTQIAcw_BTfCMrW4ahSUejxraK5_Xu03r3LcChzDOX6lA6mhtRbU9RR0gZOFxEknkKV20S-t071u7R637Yn4OvqhfGlFVWPrFw1OpVmhx500figQ43DP2zt3fHTI0yf1erERqlWdzozw88so1POxf4fo98v3X5cN527FQBR6LxThwlNfUlZa7UcZTluLiZF-Ui0FkmYuErdOqR4JnwvDzmjOeRq3ImGUOChw6fBrjvPCzQACO56UxvXf3kdFwkG6xQ_sRAalRTo74VOkKaEjc718n5fWLKyfiJa_SB-d9g-DcWFAGutQrLlpmSpDSlNZjTw3VYqaY-EOsENoD3RmpKHhEBcq8HL7Z3aUhMBf2AdIviTE2sbuuAJFa0XI83ofcvSG1Bbfg61NtAdMh4jOwwF1JQ3E_kPKYq4JkvlJeFog5-BUQqrTi5mZHxnJpDikEvLdFLDXppiV4djn8eeiu1OWbf3qgQTu2HOk5_zWpn9uUDWGw_dG_T287dzS4smX3LTEwDapPRVO8hN5mI_cIgCDz9twV-A7lZ9Lg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Urdu+Word+Segmentation+using+Machine+Learning+Approaches&rft.jtitle=International+journal+of+advanced+computer+science+%26+applications&rft.au=Khan%2C+Sadiq+Nawaz&rft.au=Khan%2C+Khairullah&rft.au=Khan%2C+Wahab&rft.au=Khan%2C+Asfandyar&rft.date=2018&rft.issn=2158-107X&rft.eissn=2156-5570&rft.volume=9&rft.issue=6&rft_id=info:doi/10.14569%2FIJACSA.2018.090628&rft.externalDBID=n%2Fa&rft.externalDocID=10_14569_IJACSA_2018_090628
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2158-107X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2158-107X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2158-107X&client=summon