Simulated Data

Methodological research typically requires some benchmark or ‘gold standard’ against which to measure performance. In this context, a desired gold standard would be a true causal relationship between a drug and a health outcome. Unfortunately, most observational data sources are poorly characterized, clinical observations may be insufficiently recorded or poorly validated, and actual ‘truth’ may not be absolutely determined. True relationships between drugs and outcomes may be difficult to ascertain as these ‘known associations’ may be affected by issues including sample size, adequacy of data capture, and confounding.

Because of these issues and the desire to have a common, acceptable test set, OMOP designed and developed an automated procedure to construct simulated datasets to supplement the methods evaluation. The simulated datasets (OSIM - Observational Medical Dataset Simulator) are modeled after real observational data sources, and comprised of hypothetical persons with fictional drug exposure and health outcomes occurrence, but representative of the types of relationships expected to be observed within real observational data sources. Because the simulated data will represent hypothetical patients, fictional drug classes and outcomes types, there can be no clinical interpretations drawn from the data.

The simulated datasets will only be used to perform statistical evaluations of the analytical methods offered to identify drug-outcome associations. The performance characteristics (sensitivity, specificity, positive and negative predictive value) of the analytical methods can then be empirically measured in terms of the known characteristics of the data will enable the classification of the drug-outcome relationships as ‘true’ or ‘false’ and methods will be executed to classify the drug-outcome pairs as ‘positive’ or ‘negatives’.

OSIM Publications
Murray RE, Ryan PB, & Reisinger SJ. (2011). Design and Validation of a Data Simulation Model for Longitudinal Healthcare Data. AMIA Annu Symp Proc., USA, 2011: 1176–1185.


Simulated Observational Data - OSIM2 Available for CDM V2 only

This web page presents the Observational Medical Dataset Simulator (OSIM) Version 2
(updated April 11, 2012).

The initial Observational Medical Dataset Simulator was released in 2009 and used to generate datasets with millions of hypothetical patients with drug exposure, background conditions, and known adverse events for the purpose of benchmarking methods performance. OSIM has provided large-scale datasets to methodologists and facilitated the establishment of the OMOP Cup Competition. It also advanced the OMOP Research Team's insights about the complex interdependencies between clinical observations in real data, and how those relationships may influence a method's behavior in identifying true associations and discerning from false positive findings.

Based on these insights, continued research has resulted in the development of a second-generation simulated dataset procedure, known as OSIM2. OSIM2 represents an alternative design to accommodate additional complexities observed in real-world data, including advanced modeling of the correlations between drugs and conditions. OSIM2 allows for more direct comparisons between simulated data and real observational databases, and should enable greater methods evaluation by allowing assessment of how methods accommodate these complex interrelationships. At OMOP, OSIM2 is used to benchmark the performance of methods to estimate the strength of association between drug treatment and outcome.

OSIM2 source code, documentation, and databases are available for download:

  • OSIM2 Introduction (without audio)
  • OSIM2 Introduction (with audio narration)
  • OSIM2 Architecture and Execution
  • OSIM2 Source Code and Documentation
  • OSIM2 validation dashboard procedures
  • Download of OSIM2 Datasets
    We have generated 16 OSIM2 datasets that are now available for download. Each dataset is a 10m person dataset modeled after Thomson Reuters MarketScan® Lab Database (MSLR), one without any signals injected, and then the other 15 databases have different size/types of signals (relative risk: 1.25, 1.5, 2, 4, 10; and risk type: acute onset (equals 'any exposure' events occurring within 30d of exposure start), insidious, and accumulative). MSLR, covering 2003 – 2009, represents privately-insured population, with administrative claims from inpatient, outpatient, and pharmacy services supplemented by laboratory results.

    The datasets listed below are freely available for download through OMOP’s anonymous FTP server. For example, you can download: OSIM2_10M_MSLR_MEDDRA_6, which has a set of signals injected at RR=1.50 and with insidious onset (during exposure or 30d afterwards).

    OSIM2 Datasets Injected Signals at Relative Risk Equals Risk Type Size
    OSIM2_10M_MSLR_MEDDRA_0 None None 3.5GB
    OSIM2_10M_MSLR_MEDDRA_3 1.25 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_6 1.5 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_9 2 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_12 4 Insidious 3.5GB
    OSIM2_10M_MSLR_MEDDRA_15 10 Insidious 3.8GB
    OSIM2_10M_MSLR_MEDDRA_2 1.25 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_5 1.5 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_8 2 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_11 4 Any Exposure 3.5GB
    OSIM2_10M_MSLR_MEDDRA_14 10 Any Exposure 3.6GB
    OSIM2_10M_MSLR_MEDDRA_1 1.25 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_4 1.5 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_7 2 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_10 4 Accumulative 3.5GB
    OSIM2_10M_MSLR_MEDDRA_13 10 Accumulative 3.7GB

    Please note that these are very large files. We have tested the OSIM2 dataset downloads using FileZilla and WS-FTP. FileZilla is free open source client software that can be downloaded from:

    To log in to the anonymous FTP server use the following credentials:

    Login: anonymous
    Password: blank
    Our FTP server supports SFTP protocol (port 22)

    On the server, there are two main folders:
    ● MedDRA: All data in this folder use MedDRA based condition concepts.
    ○ Transition Matrices. Currently there are transition matrices available for the following databases: GE, MDCD, MDCR, MSLR
    ○ OSIM2 dataset. All 16 OSIM2 datasets are available in individual directories. These folders contain simulated data in Common Data Model Version 2. OSIM2 is not available in CDM V3, only in V2 format.

    ● SNOMED: All data in this folder use SNOMED-CT based condition concepts.
    ○ Transition Matrices. Currently there are transition matrices available for the following databases: CCAE, MDCD, MDCR, MSLR
    ○ IN THE FUTURE: OSIM2 data will be available in SNOMED format.

    Please contact OMOP to share with us your experience with OSIM2 datasets.


    The Observational Medical Dataset Simulator (OSIM) Generation 1 is an open-source software application, written in R, that allows users to create simulated datasets that conform to the OMOP Common Data Model.  The simulation creates hypothetical persons with fictitious drug exposure and condition occurrence, with known characteristics that represent the types of scenarios expected in real observational sources.  The procedure is being used to create simulated datasets to support OMOP's central methods development activities, as well as to facilitate the OMOP Cup methods competition.  OMOP hopes OSIM will provide the broader research community a valuable tool to support the implementation and evaluation of alternative approaches for observational analyses.  

    Click on the document titles below to obtain the following: