Crawling social media to create morphological resource of under-resourced language: Melanau language

Suhaila Saee, Ranaivo Malancon Bali, Lay Ki Soon, Tek Yong Lim

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

1 Citation (Scopus)

Abstract

To build a morphological analyser for under-resourced language, a creation of morphological resource is required. With a limitation of morphological resource in digital format, a digitisation process, which is timeconsuming and a tedious task, is used to create the resources. An objective of this work is to develop new steps in creating the morphological resources from social media. The steps comprise of crawling of the blogs and tweets. A limited list of words of the under-resourced language was used to reduce the number of crawled web pages. Then, the crawled pages and tweets were normalised. This step cleaned and transformed the crawled data with informal and noisy nature into a cleaned wordlist for the next process, which is dictionary lookup validation. Lastly, the validation of wordlist was carried out due to languages mixing that caused uncertainty of spelling standard. At this stage, edit distance algorithms, namely, Jaro-Winkler is applied to determine an accuracy of the spelling standard by comparing with the dictionary. The findings suggest that the availability of huge amount of dictionary word entries could improve the accuracy of the poor results. It is recommended that the developed steps can assist other researchers to create validated morphological resources or even language resources for the under-resourced languages.

Original languageEnglish
Title of host publication2017 2nd International Conference on Information in Business and Technology Management (I2BM)
EditorsZuwairie Ibrahim
Place of PublicationUSA
PublisherAmerican Scientific Publishers
Pages11503-11507
Number of pages5
Volume23
Edition11
DOIs
Publication statusPublished - 2017
Externally publishedYes
EventInternational Conference on Information in Business and Technology Management 2017 - Penang, Malaysia
Duration: 18 Apr 201720 Apr 2017
Conference number: 2nd
https://i2bm.wordpress.com

Publication series

NameAdvanced Science Letters
PublisherAmerican Scientific Publishers
ISSN (Print)1936-6612
ISSN (Electronic)1936-7317

Conference

ConferenceInternational Conference on Information in Business and Technology Management 2017
Abbreviated titleI2BM
Country/TerritoryMalaysia
CityPenang
Period18/04/1720/04/17
Internet address

Keywords

  • Morphological resource
  • Social media
  • Under-resourced language

Cite this