Academia versus social media: a psycho-linguistic analysis

Thin Nguyen, Svetha Venkatesh, Dinh Phung

Research output: Contribution to journalArticleResearchpeer-review

1 Citation (Scopus)


Publication pressure has influenced the way scientists report their experimental results. Recently it has been found that scientific outcomes have been exaggerated or distorted (spin) to hopefully be published. Apart from investigating the content to look for spins, language styles has been proven to be the good traces. For example, the use of words in emotion lexicons has been used to interpret exaggeration and overstatement in academia. This work adapts a data-driven approach to explore a comprehensive set of psycho-linguistic features for a large corpus of PubMed papers published for the last four decades. The language features for other media – online encyclopedia (Wikipedia), online diaries (web-logs), online forums (Reddit), and micro-blogs (Twitter) – are also extracted. Several binary classifications are employed to discover linguistic predictors of scientific abstracts versus other media as well as strong predictors of scientific articles in different cohorts of impact factors and author affiliations. Trends of language styles expressed in scientific articles over the course of 40 years has also been discovered, providing the evolution of academic writing for the period of time. The study demonstrates advances in lightning-fast cluster computing on dealing with large scale data, consisting of 5.8 terabytes of data containing 3.6 billion records from all the media. The good performance of the advanced cluster computing framework suggests the potential of pattern recognition in data at scale.

Original languageEnglish
Pages (from-to)228-237
Number of pages10
JournalJournal of Computational Science
Publication statusPublished - Mar 2018
Externally publishedYes


  • Academia
  • Computer-mediated communication
  • Large-scale computing
  • Psycho-linguistics
  • Social media analytics

Cite this