Simulated Observational Data - OSIM2 Available for CDM V2 only

This web page presents the Observational Medical Dataset Simulator (OSIM) Version 2
(updated April 11, 2012).

The initial Observational Medical Dataset Simulator was released in 2009 and used to generate datasets with millions of hypothetical patients with drug exposure, background conditions, and known adverse events for the purpose of benchmarking methods performance. OSIM has provided large-scale datasets to methodologists and facilitated the establishment of the OMOP Cup Competition. It also advanced the OMOP Research Team's insights about the complex interdependencies between clinical observations in real data, and how those relationships may influence a method's behavior in identifying true associations and discerning from false positive findings.

Based on these insights, continued research has resulted in the development of a second-generation simulated dataset procedure, known as OSIM2. OSIM2 represents an alternative design to accommodate additional complexities observed in real-world data, including advanced modeling of the correlations between drugs and conditions. OSIM2 allows for more direct comparisons between simulated data and real observational databases, and should enable greater methods evaluation by allowing assessment of how methods accommodate these complex interrelationships. At OMOP, OSIM2 is used to benchmark the performance of methods to estimate the strength of association between drug treatment and outcome.

OSIM2 source code, documentation, and databases are available for download:

  • OSIM2 Introduction (without audio)
  • OSIM2 Introduction (with audio narration)
  • OSIM2 Architecture and Execution
  • OSIM2 Source Code and Documentation
  • OSIM2 validation dashboard procedures
  • Download of OSIM2 Datasets
    We have generated 16 OSIM2 datasets that are now available for download. Each dataset is a 10m person dataset modeled after Thomson Reuters MarketScan® Lab Database (MSLR), one without any signals injected, and then the other 15 databases have different size/types of signals (relative risk: 1.25, 1.5, 2, 4, 10; and risk type: acute onset (equals 'any exposure' events occurring within 30d of exposure start), insidious, and accumulative). MSLR, covering 2003 – 2009, represents privately-insured population, with administrative claims from inpatient, outpatient, and pharmacy services supplemented by laboratory results.

    The datasets listed below are freely available for download through OMOP’s anonymous FTP server. For example, you can download: OSIM2_10M_MSLR_MEDDRA_6, which has a set of signals injected at RR=1.50 and with insidious onset (during exposure or 30d afterwards).

    OSIM2 Datasets Injected Signals at Relative Risk Equals Risk Type Size
    OSIM2_10M_MSLR_MEDDRA_0 None None 3.5GB
    OSIM2_10M_MSLR_MEDDRA_3 1.25 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_6 1.5 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_9 2 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_12 4 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_15 10 Insidious 3.8GB
    OSIM2_10M_MSLR_MEDDRA_2 1.25 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_5 1.5 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_8 2 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_11 4 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_14 10 Any Exposure 3.6GB
    OSIM2_10M_MSLR_MEDDRA_1 1.25 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_4 1.5 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_7 2 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_10 4 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_13 10 Accumulative 3.7GB

    Please note that these are very large files. We have tested the OSIM2 dataset downloads using FileZilla and WS-FTP. FileZilla is free open source client software that can be downloaded from: http://filezilla-project.org/download.php

    To log in to the anonymous FTP server use the following credentials:

    Login: anonymous
    Password: blank
    Our FTP server supports SFTP protocol (port 22)

    On the server, there are two main folders:
    ● MedDRA: All data in this folder use MedDRA based condition concepts.
    ○ Transition Matrices. Currently there are transition matrices available for the following databases: GE, MDCD, MDCR, MSLR
    ○ OSIM2 dataset. All 16 OSIM2 datasets are available in individual directories. These folders contain simulated data in Common Data Model Version 2. OSIM2 is not available in CDM V3, only in V2 format.

    ● SNOMED: All data in this folder use SNOMED-CT based condition concepts.
    ○ Transition Matrices. Currently there are transition matrices available for the following databases: CCAE, MDCD, MDCR, MSLR
    ○ IN THE FUTURE: OSIM2 data will be available in SNOMED format.

    Please contact OMOP to share with us your experience with OSIM2 datasets.