Building Embeddings for SNOMED CT Part I

A Practical Introduction to Semantic Modeling for Medical Terminologies

This is the first part of a series on building embeddings for SNOMED CT. This installment focuses on data extraction and preparation for graph and text based embeddings. The platform used is Oracle Database 23ai running on Oracle Cloud Infrastructure.

Understanding Embeddings

An embedding is an n dimensional numerical vector that represents an entity. These vectors form the basis of vector search, allowing computation of similarity between entities by measuring distance in vector space.

By representing medical concepts as vectors, we can quantify their semantic proximity using cosine similarity or dot product. In practice, embeddings enable numerical expression of conceptual similarity, making them essential for semantic search, clustering, and AI assisted reasoning.

The main challenge is ensuring embeddings capture all meaningful relationships among SNOMED CT entities, including structural, semantic, and lexical relations that define the ontology's conceptual framework.

Key Terminology

Concept: a unique clinical entity identified by its conceptId.
Description: a textual representation of a concept, for example a preferred term or a synonym.
Relationship: a logical or hierarchical connection between two concepts, for example is a, finding site, or associated morphology.

When used generally, relationship refers to any meaningful association between two entities, not only those recorded in the Relationship table.

Example Semantic Equivalence

Consider the expressions "ankylosing spondylitis" and "Bechterev's disease". Both refer to the same clinical condition and share the same conceptId in SNOMED CT. The following SQL query retrieves descriptions that share that conceptId.

SELECT d2.conceptId, d2.term
FROM description d1
JOIN description d2 ON d1.conceptId = d2.conceptId
WHERE UPPER(d1.term) = UPPER('ankylosing spondylitis')
  AND d1.active = 1;

Sample output might look like this:

CONCEPTID   TERM
9631008     Ankylosing spondylitis
9631008     Marie-Strumpell spondylitis
9631008     Bechterev's disease
9631008     Rheumatoid arthritis of spine

All entries share the same conceptId, illustrating how SNOMED handles synonymy and multilingual consistency.

Decomposing SNOMED CT for Embedding Construction

To construct effective embeddings, identify which SNOMED CT components most strongly contribute to semantic similarity. The following items form the core input data.

Description terms: from description.term, provide lexical similarity and form the basis for text derived embeddings.
Concept hierarchies: is a relationships define taxonomic structure and semantic distance.
Attribute relationships: from relationship table, such as finding site or morphology, add domain specific context.
Language reference sets: enable multilingual alignment across equivalent concepts.
Reference set memberships: group concepts by clinical or functional domains.
Metadata and status attributes: moduleId and definitionStatusId reflect maturity and reliability and can influence embedding weighting.
External cross maps: mappings to ICD, LOINC, or RxNorm enrich the embedding space and enable ontology bridging.

Stage One Practical Starting Point

The first implementation phase focuses on three manageable components: description terms, concept hierarchies, and attribute relationships. The International Release of SNOMED CT is used, so multilingual alignment is not required at this stage.

This pipeline runs on Oracle Cloud Infrastructure with Oracle Database 23ai SE on Oracle Linux 9. The prototype uses a single CPU core with no GPU acceleration to demonstrate feasibility under constrained hardware.

Implementation Plan

The objective is to generate embeddings for SNOMED CT concepts and store them within Oracle Database 23ai, using its native vector search features. The relational schema follows the distribution by West Coast Informatics, Inc. For OCI adapted scripts and integration support, contact arachnet at maserna dot org.

Data Export

Three CSV files are exported using SQLcl. Files are written to the export directory in UTF 8 encoding, without headers.

concept.csv containing conceptId, definitionStatusId, moduleId, effectiveTime
description.csv containing descriptionId, conceptId, term, typeId, languageCode, moduleId, effectiveTime
relationship.csv containing sourceId, destinationId, typeId, characteristicTypeId, moduleId, effectiveTime

SQL export script for concept data

WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF

ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/concept.csv

SELECT id, definitionStatusId, moduleId, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM concept
WHERE active = 1
ORDER BY id;

SPOOL OFF

SQL export script for description data

WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF

ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/description.csv

SELECT id, conceptId, term, typeId, languageCode, moduleId, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM description
WHERE active = 1
ORDER BY conceptId, id;

SPOOL OFF

SQL export script for relationship data

WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF

ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/relationship.csv

SELECT id AS relationshipId, sourceId, destinationId, typeId, characteristicTypeId, moduleId, active, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM relationship
WHERE active = 1
ORDER BY sourceId, destinationId;

SPOOL OFF

Please ensure the export directory exists and is writable before running these scripts.

Export Procedure

Execute the following commands on the OCI host as a user with appropriate privileges. Adjust paths and user credentials to your environment.

sudo mkdir -p /tmp/snomed_exports
sudo chmod 777 /tmp/snomed_exports
cd arachnet/snomed/embeddings/
/home/opc/sqlcl/bin/sql <user>/<password>@ARADB

SQL> @concept_export.sql
SQL> @description_export.sql
SQL> @relationship_export.sql

After completion, verify the exported files exist:

/tmp/snomed_exports/concept.csv
/tmp/snomed_exports/description.csv
/tmp/snomed_exports/relationship.csv

Observed execution times in our environment were short and did not significantly load the server.

Next Steps

The next stage will be implemented in Python and will include the following steps:

Load SNOMED CT tables using Pandas or Polars.
Build the concept graph, representing both is a and attribute based relationships.
Compute graph embeddings with scalable algorithms such as Node2Vec or DeepWalk, applying relationship weighting as needed.

These embeddings will support semantic search, AI assisted diagnosis, and vector based retrieval inside Oracle Database 23ai.

Downloads and Contact

Download the export scripts and related tools from:

Download export scripts and tools

For collaboration, integration support, or OCI adapted scripts, contact arachnet at maserna dot org