Abstract
To build a morphological analyser for under-resourced language, a creation of morphological resource is required. With a limitation of morphological resource in digital format, a digitisation process, which is timeconsuming and a tedious task, is used to create the resources. An objective of this work is to develop new steps in creating the morphological resources from social media. The steps comprise of crawling of the blogs and tweets. A limited list of words of the under-resourced language was used to reduce the number of crawled web pages. Then, the crawled pages and tweets were normalised. This step cleaned and transformed the crawled data with informal and noisy nature into a cleaned wordlist for the next process, which is dictionary lookup validation. Lastly, the validation of wordlist was carried out due to languages mixing that caused uncertainty of spelling standard. At this stage, edit distance algorithms, namely, Jaro-Winkler is applied to determine an accuracy of the spelling standard by comparing with the dictionary. The findings suggest that the availability of huge amount of dictionary word entries could improve the accuracy of the poor results. It is recommended that the developed steps can assist other researchers to create validated morphological resources or even language resources for the under-resourced languages.
Original language | English |
---|---|
Title of host publication | 2017 2nd International Conference on Information in Business and Technology Management (I2BM) |
Editors | Zuwairie Ibrahim |
Place of Publication | USA |
Publisher | American Scientific Publishers |
Pages | 11503-11507 |
Number of pages | 5 |
Volume | 23 |
Edition | 11 |
DOIs | |
Publication status | Published - 2017 |
Externally published | Yes |
Event | International Conference on Information in Business and Technology Management 2017 - Penang, Malaysia Duration: 18 Apr 2017 → 20 Apr 2017 Conference number: 2nd https://i2bm.wordpress.com |
Publication series
Name | Advanced Science Letters |
---|---|
Publisher | American Scientific Publishers |
ISSN (Print) | 1936-6612 |
ISSN (Electronic) | 1936-7317 |
Conference
Conference | International Conference on Information in Business and Technology Management 2017 |
---|---|
Abbreviated title | I2BM |
Country/Territory | Malaysia |
City | Penang |
Period | 18/04/17 → 20/04/17 |
Internet address |
Keywords
- Morphological resource
- Social media
- Under-resourced language