Building Embeddings for SNOMED CT Part I
A Practical Introduction to Semantic Modeling for Medical Terminologies
This is the first part of a series on building embeddings for SNOMED CT. This installment focuses on data extraction and preparation for graph and text based embeddings. The platform used is Oracle Database 23ai running on Oracle Cloud Infrastructure.
Understanding Embeddings
An embedding is an n dimensional numerical vector that represents an entity. These vectors form the basis of vector search, allowing computation of similarity between entities by measuring distance in vector space.
By representing medical concepts as vectors, we can quantify their semantic proximity using cosine similarity or dot product. In practice, embeddings enable numerical expression of conceptual similarity, making them essential for semantic search, clustering, and AI assisted reasoning.
The main challenge is ensuring embeddings capture all meaningful relationships among SNOMED CT entities, including structural, semantic, and lexical relations that define the ontology's conceptual framework.
Key Terminology
- Concept: a unique clinical entity identified by its conceptId.
- Description: a textual representation of a concept, for example a preferred term or a synonym.
- Relationship: a logical or hierarchical connection between two concepts, for example is a, finding site, or associated morphology.
When used generally, relationship refers to any meaningful association between two entities, not only those recorded in the Relationship table.
Example Semantic Equivalence
Consider the expressions "ankylosing spondylitis" and "Bechterev's disease". Both refer to the same clinical condition and share the same conceptId in SNOMED CT. The following SQL query retrieves descriptions that share that conceptId.
SELECT d2.conceptId, d2.term
FROM description d1
JOIN description d2 ON d1.conceptId = d2.conceptId
WHERE UPPER(d1.term) = UPPER('ankylosing spondylitis')
AND d1.active = 1;
Sample output might look like this:
CONCEPTID TERM
9631008 Ankylosing spondylitis
9631008 Marie-Strumpell spondylitis
9631008 Bechterev's disease
9631008 Rheumatoid arthritis of spine
All entries share the same conceptId, illustrating how SNOMED handles synonymy and multilingual consistency.
Decomposing SNOMED CT for Embedding Construction
To construct effective embeddings, identify which SNOMED CT components most strongly contribute to semantic similarity. The following items form the core input data.
- Description terms: from description.term, provide lexical similarity and form the basis for text derived embeddings.
- Concept hierarchies: is a relationships define taxonomic structure and semantic distance.
- Attribute relationships: from relationship table, such as finding site or morphology, add domain specific context.
- Language reference sets: enable multilingual alignment across equivalent concepts.
- Reference set memberships: group concepts by clinical or functional domains.
- Metadata and status attributes: moduleId and definitionStatusId reflect maturity and reliability and can influence embedding weighting.
- External cross maps: mappings to ICD, LOINC, or RxNorm enrich the embedding space and enable ontology bridging.
Stage One Practical Starting Point
The first implementation phase focuses on three manageable components: description terms, concept hierarchies, and attribute relationships. The International Release of SNOMED CT is used, so multilingual alignment is not required at this stage.
This pipeline runs on Oracle Cloud Infrastructure with Oracle Database 23ai SE on Oracle Linux 9. The prototype uses a single CPU core with no GPU acceleration to demonstrate feasibility under constrained hardware.
Implementation Plan
The objective is to generate embeddings for SNOMED CT concepts and store them within Oracle Database 23ai, using its native vector search features. The relational schema follows the distribution by West Coast Informatics, Inc. For OCI adapted scripts and integration support, contact arachnet at maserna dot org.
Data Export
Three CSV files are exported using SQLcl. Files are written to the export directory in UTF 8 encoding, without headers.
- concept.csv containing conceptId, definitionStatusId, moduleId, effectiveTime
- description.csv containing descriptionId, conceptId, term, typeId, languageCode, moduleId, effectiveTime
- relationship.csv containing sourceId, destinationId, typeId, characteristicTypeId, moduleId, effectiveTime
SQL export script for concept data
WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF
ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/concept.csv
SELECT id, definitionStatusId, moduleId, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM concept
WHERE active = 1
ORDER BY id;
SPOOL OFF
SQL export script for description data
WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF
ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/description.csv
SELECT id, conceptId, term, typeId, languageCode, moduleId, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM description
WHERE active = 1
ORDER BY conceptId, id;
SPOOL OFF
SQL export script for relationship data
WHENEVER SQLERROR EXIT -1
SET HEADING OFF COLSEP ',' SQLFORMAT csv FEEDBACK OFF UNDERLINE OFF
SET PAGESIZE 0 LINESIZE 32767 TRIMSPOOL ON TERMOUT OFF VERIFY OFF
ALTER SESSION SET NLS_LENGTH_SEMANTICS='CHAR';
SPOOL /tmp/snomed_exports/relationship.csv
SELECT id AS relationshipId, sourceId, destinationId, typeId, characteristicTypeId, moduleId, active, TO_CHAR(effectiveTime, 'YYYYMMDD')
FROM relationship
WHERE active = 1
ORDER BY sourceId, destinationId;
SPOOL OFF
Please ensure the export directory exists and is writable before running these scripts.
Export Procedure
Execute the following commands on the OCI host as a user with appropriate privileges. Adjust paths and user credentials to your environment.
sudo mkdir -p /tmp/snomed_exports
sudo chmod 777 /tmp/snomed_exports
cd arachnet/snomed/embeddings/
/home/opc/sqlcl/bin/sql <user>/<password>@ARADB
SQL> @concept_export.sql
SQL> @description_export.sql
SQL> @relationship_export.sql
After completion, verify the exported files exist:
/tmp/snomed_exports/concept.csv
/tmp/snomed_exports/description.csv
/tmp/snomed_exports/relationship.csv
Observed execution times in our environment were short and did not significantly load the server.
Next Steps
The next stage will be implemented in Python and will include the following steps:
- Load SNOMED CT tables using Pandas or Polars.
- Build the concept graph, representing both is a and attribute based relationships.
- Compute graph embeddings with scalable algorithms such as Node2Vec or DeepWalk, applying relationship weighting as needed.
These embeddings will support semantic search, AI assisted diagnosis, and vector based retrieval inside Oracle Database 23ai.
Downloads and Contact
Download the export scripts and related tools from:
Download export scripts and tools
For collaboration, integration support, or OCI adapted scripts, contact arachnet at maserna dot org