General Structure and Use of Vocabularies

OMOP Implementation Specification

Standard Vocabularies in Observational Data Analysis (October 2013)

3. General Structure and Use of Vocabularies
OMOP defines Standard Vocabularies as terminologies and classification systems. As a general rule, vocabularies are imported from external national or international existing standards, and only created by OMOP if no suitable standard is available. Each domain consists of one or many vocabularies, which are organized in figure 1.

Figure 1: General Organization of Vocabulary Domains

Each vocabulary belongs only to one domain, with the exception of SNOMED-CT, which contains terms spanning a variety of domains. Vocabularies can be of two types:

  • Above the horizontal line (figure 1) are Standard Vocabularies, which are used within OMOP to define concepts. These concepts represent the meaning of the data in the observational data in the Common Data Model (CDM). Standard Vocabularies can be terminologies (controlled lists of concepts) or classifications (higher-level concepts) or both, where no distinction between low-level
    concepts and higher classes is made. For example, drug products are terminologies (lists), and pharmacological classes or indications are hierarchical concepts. Some vocabularies are strictly organized into fixed hierarchical layers (or levels), while others such as SNOMED-CT organize their
    content such that each concept can be related to any other concept without fixed hierarchical levels.
  • Below the line (figure 1) are Source Vocabularies. These vocabularies contain codes of the same domain as the Standard Vocabularies, but are not used as concepts and are therefore not used for representing the data in the CDM. Source codes are used in the source data and need to be
    translated to the concepts of the Standard Vocabularies during the transformation process to the CDM. For example, NDCs are used as codes in many data sources and are mapped to the concepts of the Standard Drug Vocabulary based on RxNorm (see below).

In the CDM, the various entities are stored as following (also see "CDM Version 4.0 Specification document"):
  • Concepts are stored in the CONCEPT table. Data tables in the CDM contain fields whose names end in "concept_id". These are the key content fields for each record. For convenience, many tables also contain the original source codes (ending in "source_code") that were translated to the concepts, but
    these are not intended to be used in standardized analytics across networks of observational databases.
    Concepts have levels and classes. Concept_level 0 indicates a concept that is not part of the Standard Vocabulary (but loaded for convenience as it is part of the original vocabulary imported from the vendor organization). Concept_level 1 is the lowest level in the hierarchy (leaf concepts). All higher concept_levels are assigned differently in the various vocabulary domains. Concept classes are also domain-specific and organize concepts within a domain.
    All concepts have a time period in which they are defined as valid. The default valid_start_date is 1-Jan-1980, and the default valid_end_date is 31-Dec-2099. If a concept is no longer used by the original standard, the valid_end_date is set to the release date after it became obsolete. Invalid
    concepts still might be used in older data. If the vocabulary vendor provides replacement terms for obsolete terms concept relationships between the obsolete and replacement concept indicate this fact.
  • Direct relationships between concepts are stored in the CONCEPT_RELATIONSHIP table. This table contains the ID of concept 1, concept 2 and the relationship_id defining the nature of the relationship and referring to a record in the RELATIONSHIP table (see below). All relationships exist bi-directionally with concept_id_1 and concept_id_2 reversed. For example, a higher-level concept A might have a hierarchical relationship to the lower-level concept B. In this case, two records exist in the CONCEPT_RELATIONSHIP table: (i) concept A in concept_id_1, concept B in concept_id_2 and
    relationship_id 10 ("Subsumes") and (ii) concept A in concept_id_2, concept B in concept_id_1 and relationship_id 144 ("Is a").
    Relationships also have periods in which they are valid between valid_start_date and valid_end_date. Generally, relationships do not automatically become obsolete when one or both participating concepts are obsolete.
  • Relationships between concepts for which a chain of concepts and direct hierarchical relationships can be traced are stored in the CONCEPT_ANCESTOR table. This is used to link higher-level concepts (classes) to lower level concepts. For example, a CONCEPT_ANCESTOR record might link a drug class concept to all its drug product concepts, irrespective of how many intermediate concepts convey this relationship. Not all relationship types can participate in these chains of concepts and relationships. The field "defines_ancestry" in the relationship table indicates which relationships are used for ancestry construction.
    Note that all concepts are also ancestors of themselves (records exist in the table with identical ancestor_concept_id and descendant_concept_id) if there is at least one non-self ancestry record for the concept. This is done for the convenience of using this table in queries that are collecting the
    entire semantic space of a concept.
  • Maps between Source and Standard Vocabularies are stored in the SOURCE_TO_CONCEPT_MAP table. This table is used to provide for each source code of a Source Vocabulary a single Target Concept of a Standard Vocabulary (one-to-one mapping). However, a number of restrictions must be observed to obtain an unambiguous map:
    • One source code might have a valid target concept in different domains. For example, a
      procedure code that represents the administration of a drug or vaccine might have a target concept in the Procedure domain and Drug domain. To distinguish the two, the field mapping_type contains the target domain.
    • Mapping records, like concepts, have a validity period. Maps should be selected for the period the code was used in the source data. For example, an NDC code that was actively used in 1-Jan-2005 should be translated using a SOURCE_TO_CONCEPT_MAP record valid for that time.
    • For some source codes, multiple mappings exist in parallel, which are all valid. This is true for only a few domains (such as Condition). The field primary_map indicates which of the alternatives should be used for data transformation. The other maps are used for special analysis cases where alternative maps are explicitly needed.

Not all source codes have a mapping to a concept in the Standard Vocabulary. This is usually the case if the Standard Vocabulary truly lacks a concept or if the map hasn't been established. This situation is handled in the SOURCE_TO_CONCEPT_MAP table in two different ways:
  1. No record exists for this source code (in the context of mapping type). This is typically the case when only a small subset of the source codes have a mapping, like Procedure Drugs (see below)
  2. A record exists for this source code with target_concept_id=0, target_vocabulary_id=0 and mapping_type="" (null). Such records were established for every source code of a Source Vocabulary that is available as a comprehensive list.

  • Vocabulary names are stored in the VOCABULARY table. It contains all the vocabularies discussed in this document.
  • Relationship types are stored in the RELATIONSHIP table (see above). Relationship_ids are backward compatible to previous versions, but relationship_names have been updated and now contain the principal source of a relationship in parenthesis.


3.1. Using the Vocabulary
The Standard Vocabulary was compiled from a large array of different source vocabularies of different structures, design principles and formats with the intention of its exploitation in a uniform way. In other words, it should be possible to carry out data manipulation and information retrieval of observational data in a standard fashion, irrespective of the domain and the origin of the vocabularies used. Standardized concepts allow to interpret the data records in observational data and search of such records by description (concept_name), vocabulary used (vocabulary_id), class (concept_class) and hierarchical level (concept_level). For example, observational data can be queried for drug exposure to a drug with a certain description, say, Acetaminophen 500 mg tablets, no matter how the original data called this product, by looking up concept_id 19020053 Acetaminophen 500 MG Oral Tablet" and then querying for the concept_id in the data. The concept_class, concept_level and vocabulary_id fields allow to further specify these queries.

The availability of standardized classes and a CONCEPT_ANCESTOR table for iterative hierarchical relationships allows for aggregate queries. For example, after identifying the VA Class product for ACE inhibitors as 4279041 "ACE INHIBITORS", one could draw all drug products in that class by looking up in the CONCEPT_ANCESTOR table the concepts that have a concept_level of 1 (drug product) and are descendants of that class concept_id 4279041. The same query for concept_level 2 would yield the ingredients that are marketed to inhibit the Angiotensin converting enzyme (ACE). Likewise, each drug can be interrogated for its membership in various classes, such as indications or Contraindications or mechanisms of action.

In order to enable these queries, all source data have to be converted to the concepts of the Standard Vocabularies (only these are referenced in the CONCEPT_ANCESTOR and CONCEPT_RELATIONSHIP tables) during the Extract, Load and Transformation (ETL) process of the observational data to the CDM. This is easily done by lookup in the SOURCE_TO_CONCEPT_MAP, as it provides the equivalent concept for each source code. In cases where there is more than one mapping available, one mapping is designated as the primary mapping in the is_primary field.


3.2. Domain Concepts
Domain concepts are those that represent the semantic content of the data. With the exception of the Visit and Cohort concepts they are all based on external vocabularies obtained from specialized organizations. As described above, these vocabularies can be terminologies, classifications or both, and they can be used as Standard Concepts or Source Codes. Special analysis concepts are used to define cohorts. For a detailed list of vocabularies, classes and counts in the CONCEPT table see Appendix A.
Table 9: Standard Vocabularies and Vocabulary Domains for OMOP CDM

Vocabulary Domain Vocabulary Name Vocabulary Type Vocabulary ID Used As Used in CDM Table
Any No matching concept Any 0   Any
Condition, Observation, Procedure SNOMED-CT Terminology, Classification 1 Standard CONDITION_OCCURRENCE, CONDITION_ERA, PROCEDURE_OCCURRENCE, OBSERVATION
Drug NDF-RT Classification 7 Standard  
Drug RxNorm Terminology, Classification 8 Standard DRUG_EXPOSURE, DRUG_ERA
Drug NDC Terminology 9 Source SOURCE_TO_CONCEPT_MAP
Drug GPI Terminology 10 Source SOURCE_TO_CONCEPT_MAP
Drug Multum Terminology 16 Source SOURCE_TO_CONCEPT_MAP
Drug FDB Indication/Contraindication Terminology 19 Standard  
Drug FDB ETC Classification 20 Standard  
Drug WHO ATC Classification 21 Standard  
Drug Multilex Terminology 22 Source SOURCE_TO_CONCEPT_MAP
Drug VA Product Terminology 28 Source SOURCE_TO_CONCEPT_MAP
Drug VA Class Classification 32 Standard  
Drug NLM Mesh Terminology 46 Source SOURCE_TO_CONCEPT_MAP
Drug FDA SPL Terminology 50 Source SOURCE_TO_CONCEPT_MAP
Drug FDB Genseqno Terminology 53 Source SOURCE_TO_CONCEPT_MAP
Condition ICD-9-CM Terminology 2 Source SOURCE_TO_CONCEPT_MAP
Condition MeDRA Terminology, Classification 15    
Condition Read Terminology 17 Source SOURCE_TO_CONCEPT_MAP
Condition OXMIS Terminology 18 Source SOURCE_TO_CONCEPT_MAP
Condition ICD-10-CM Terminology, Classification 34 Source CONDITION_OCCURRENCE, CONDITION_ERA
Procedure ICD-9-Procedure Terminology 3 Standard PROCEDURE_OCCURRENCE
Procedure CPT-4 Terminology 4 Standard PROCEDURE_OCCURRENCE
Procedure HCPCS Terminology 5 Standard PROCEDURE_OCCURRENCE
Procedure ICD-10-PCS Terminology 35 Source PROCEDURE_OCCURRENCE
Provider NUCC Terminology 47 Standard PROVIDER
Provider CMS Specialty Terminology 48 Standard PROVIDER
Demographic HL7 Administrative Sex Terminology 2 Standard PERSON
Demographic CDC Race Terminology 13 Standard PERSON
Demographic Ethnicity Terminology 44 Standard PERSON
Observation LOINC Terminology 6 Standard OBSERVATION
Observation UCUM Terminology 11 Standard OBSERVATION
Observation LOINC Multidimensional Classification Classification 49 Standard OBSERVATION
Visit CMS Place of Service Terminology 14 Standard VISIT_OCCURRENCE
Visit OMOP Visit Terminology 24 Standard VISIT_OCCURRENCE
Cohort SMQ Terminology, Classification 31 Analysis COHORT
Cohort Cohort Terminology, Classification 33 Analysis COHORT
Cost DRG Terminology 40 Standard PROCEDURE_COST
Cost MDC Classification 41 Standard PROCEDURE_COST
Cost APC Terminology 42 Standard PROCEDURE_COST
Cost Revenue Code Terminology 43 Standard PROCEDURE_COST

*Some vocabularies are also available as concepts though they are not used as Standard. In these cases, the concept_level is set to 0, indicating their non-standard nature. ** VA product, HCPCS, LOINC and ICD-9-Procedure are concepts but also represented as source codes in the source_to_concept_map.


3.3. Type Concepts
Type Concepts are special concepts defined by OMOP. These are metadata concepts about the origin of the data in the data source. They are only used in fields ending in "type_concept_id". Type concepts generally have no relationships or ancestor relationships.
Table 10: Type Concepts

Vocabulary Domain
Vocabulary Name Vocabulary ID Used in CDM Table
Drug OMOP Drug Exposure Type 36 DRUG_EXPOSURE, DRUG_ERA
Condition OMOP Condition Occurrence Type 37 CONDITION_OCCURRENCE, CONDITION_ERA
Procedure OMOP Procedure Occurrence Type 38 PROCEDURE_OCCURRENCE
Observation OMOP Observation Type 39 OBSERVATION
Death OMOP Death Type 45 DEATH