Performing analytics on SNOMED CT coded database - Serdang Hospital use-case

Khalil Bouzekri, Md Khadzir Sheik Ahmad, Wael Hamdan, Dickson Lukose, Nur Shaema Darus, Syirahaniza Mohd Salleh

Research output: Contribution to conferenceAbstractpeer-review


SNOMED CT is a worldwide recognized terminology for healthcare, with the main purpose to standardize the definition and usage of clinical terms. It is represented as an ontology, a set of concepts structured in a hierarchy of specialization (e.g., Angina is a specialization of Ischemic Heart Disease), and other meaningful relationships between the concepts (e.g., Ischaemic Heart Disease has finding site Heart Structure). In a SNOMED CT codified database, the database content has been standardized by adding explicit “linkages” to SNOMED CT concepts. Those “linkages” offer a great advantage in terms of lifting up the accuracy of the analytics: (1) the data gathering will move from searching for the occurrences of pieces of text (e.g., Ischaemic Heart Disease) to searching for standardized forms, SNOMED CT concepts (e.g., for ischaemic heart disease, 414545008); (2) the hierarchy of SNOMED CT concepts can be leveraged upon to discover more results (e.g.,194828000 -the SNOMED CT concept for Angina- is a specialization of 414545008). In order to validate the gain in accuracy, we used an excerpt of the Serdang Hospital database (say DB) containing 100 encounters with clinical notes, and its SNOMED-CT coded counterpart (say SCTDB) which has been manually validated by domain experts. We created two simple queries Q1 and Q2 to select all the cases of Ischaemic Heart Disease (a case being defined as an encounter with Ischaemic Heart Disease as a clinical finding), where Q1 uses the phrase Ischaemic Heart Disease and is sent to DB, and Q2 uses the SNOMED CT concept 414545008 and is sent to a web-service dedicated to pre-process a query with SNOMED CT concepts and get the results from SCTDB. The outcome is very impressive: Q1 provides only 1 encounter whereas Q2 provides 51 encounters. From the obtained results, we identified that the main causes of the inaccuracy of Q1 are: (1) do not discover more specialized terms (e.g., angina, myocardial infarction); (2) informal writing of the doctors: use of short forms, abbreviations and analogous terms (e.g., N-STEMI, CCS I); and (3) misspelling. However, even with the features explained to process queries with SNOMED CT, an important question has to be raised: what will happen when we replace the version of SNOMED CT with a newer version? For example when a concept becomes inactive in the new version and a new active concept is created instead? Preliminary investigation tends to show that the links between inactive concepts and new active concepts should be used at the time of pre-processing the query, to ensure that the use of different versions of SNOMED-CT at different points in time does not impact the overall analytics. In the near future, we will conduct experiments on a larger scale, as we are part of the Malaysian Health Data Warehouse project, and we are currently working on integrating our systems to create a SNOMED CT coded data warehouse of millions of records, and to perform analytics over it. On the other hand, in order to scale up with this huge amount of data, we are currently identifying the performance-wise critical pieces of our engines that could run in parallel using GPU-accelerated computing.
Original languageEnglish
Number of pages1
Publication statusPublished - 2015
Externally publishedYes
EventSNOMED CT Expo 2015 - Montevideo, Uruguay
Duration: 29 Oct 201530 Oct 2015


ConferenceSNOMED CT Expo 2015

Cite this