Print Email

Medicare Claims Data


Key web links 

Home page

Data file descriptions:

Dataset Summary

Medicare provides claims data (i.e., data generated by billing) for all Medicare patients across a wide variety of care settings including outpatient, inpatient, skilled nursing facility, hospice, home health agency, and more.  Recently, data from Medicare Part D (prescription drugs) has become available as well.  Within each care setting, 3 types of files are generally available:  (1) files with data that can allow individual patients to be identified (“RIF” files); (2) limited dataset files, which contain patient-level data but with identifying characteristics stripped from the data (“LDS” files); and (3) non-identifiable data files, which contain aggregate data without any patient- or provider-level data.   Most Medicare claims data is complex and requires extensive training and support to use, but provides a valuable venue for assessing health care utilization and outcomes.  Of note, data is generally available about the provision of a service rather than the outcome of that service (for example, that a lab test or surgical procedure occurred, without directly knowing the actual lab value or outcome of the procedure).  In addition, Medicare data can be linked to a variety of other datasets using unique patient identifier numbers.  Data is available with an application process; the complexity of the application process and the extent of fees charged vary by the type of data requested.

Expert comments  
Medicare data has as its core strengths its generalizability and potential for enormous power - for example, in one proposal we were able to say that we will be able to examine 42% of all prostatectomies performed in the United States.  Once Medicine Part D data is available (the first investigators are getting this data in spring 2009), the possible research topics that can be addressed will grow enormously.  Many analyses have been performed with these data, and algorithms are available from CMS or published papers on a number of disease entities.  However, for a number of reasons care must be taken with using these algorithms and templates.  For example, a common problem is that codes for a clinical entity or disease are used on patients where the clinician has not truly diagnosed the condition, but rather is “ruling out” the condition or trying to get a test paid for (especially, but not only, tests that have restrictions on coverage).  A second common problem is finding incident disease.  For example, it can be easy to figure out that a patient has breast cancer (i.e., prevalent disease), but more challenging to determine when the condition was first diagnosed and treated (i.e., incident disease).  Some of the strongest research designs using Medicine data are focused on procedure.  For example, such designs use a procedure that is used for only one disease as a way to identify “incident” disease (e.g., surgically treated prostate cancer), or to examine costs and hard outcomes like mortality after procedures.  The Research Data Assistance Center (ResDAC) can provide frequency tables for requested codes (within limits) and some are available on the web.  Mortality information with approximate date of death is very reliable, but this information does not include cause of death.

Beginning work with the datasets can be daunting both because of the computing power needed and the unfamiliar-looking data.  ResDAC has 2-3 day introductory seminars which can be helpful, but a programmer with experience with claims data is often necessary as well.  As shown in the database description, data can be obtained as limited datasets or research-identifiable files, and care should be taken in making the decision about which to use.  Among these, there are usually several common options, such as the whole United States, a 5% sample of the U.S., or a single state sample.  The application process for the RIF files is fairly involved and can take months, but it offers some distinct advantages.  For example, linkages can be made with other datasets (e.g., the American Hospital Association).

Dataset Details

Dataset owner / manager  

Centers for Medicare and Medicaid Services (CMS)
Study and sample characteristics     Ongoing data collection for all billed services by patients participating in the Medicare program, which includes persons age 65 years and older, persons with end-stage renal disease or amyotrophic lateral sclerosis (regardless of age), and some persons with disability (regardless of age).   This includes services in the inpatient setting, in outpatient settings, in skilled nursing facilities, hospices and home care agencies, charges for durable medical equipment, and most recently data on drugs purchased under the Medicare Part D prescription drug benefit.

For each care setting, three general types of data are available:

Research Identifiable Files (RIFs) – data 1991 to present
RIFs include data that allows individual patients and providers to be identified, for example by name, date of birth, Unique Physician Identification Number, and so forth.  They are the most tightly restricted of the files.

Limited Data Set Files (LDS) – data 2000 to present
LDS files contain the same information as RIF files but with all personally-identifiable information removed or encrypted.  In most case, these files are available with a 100% national sample, a 5% national sample (i.e. quasi-randomly selected 5%), or state-specific data.  Advantages of the smaller data files (e.g. the 5% file) are smaller file size and lower costs.

Non-Identifiable Files
These files contain aggregate data with no physician- or patient-level data.   Some of these files contain summaries of patient-level data from RIF or LDS files; others contain unique data, for example facility-specific information.

Major foci  

Standard analytic files (SAFs) provide data on individual claims submitted by institutions (e.g., hospitals or home care agencies) and non-institutional providers (e.g., physicians).   These data generally include claim-level information on diagnoses, procedures, Diagnosis Related Groups (DRGs), dates of service, reimbursement amounts, providers, and patient demographic information.

The following SAFs are available:
•    Inpatient
•    Skilled Nursing Facility
•    Outpatient – data from institutional outpatient providers (e.g., hospital outpatient departments, outpatient rehabilitation facilities, and so forth)
•    Home Health Agency
•    Hospice
•    Carrier – data from non-institutional outpatient providers (e.g., physicians, social workers, independent clinical laboratories, and so forth)
•    Durable Medical Equipment

MedPAR files provide an alternate view of inpatient and skilled nursing facility data insofar as they contain “final action stay” data on institutional stays.  This “final action stay” includes items such as discharge diagnoses, discharge status, and other data that summarize a patient’s institutional stay (rather than the claim-by-claim data format of the SAFs).

Denominator files provide demographic and enrollment information about Medicare beneficiaries.

The Vital Status file includes information on whether patients are alive or dead.

Medicare Part D (prescription drug) files provide information about patient enrollment in Medicare Part D (the prescription drug benefit program) and information on dispensed drugs (including drug identifier, amount dispensed, costs, and so forth).

Most of the files listed above are available as “Research Identifiable Files” and “Limited Dataset Files.”  In addition, a list of “Non-identifiable Files” (i.e., summative files) can be found at the web links below)

For descriptions of each file type, see:

Special supplements and resources

Links to other datasets    Data from Medicare claims files can be linked to other Medicare datasets that use the same unique identifier numbers for patients, providers, and institutions, for example the Medicare Current Beneficiary Survey, the Long Term Care Minimum Data Set, the American Hospital Association Annual Survey, and so forth.

Papers published  

Click here for a PubMed search for articles using this dataset.

Examples of papers published using Medicare claims data include:

Long-term outcomes and costs of ventricular assist devices among Medicare beneficiaries. Hernandez AF, Shea AM, Milano CA, Rogers JG, Hammill BG, O'Connor CM, Schulman KA, Peterson ED, Curtis LH.
JAMA. 2008 Nov 26;300(20):2398-406.

Frequency of stress testing to document ischemia prior to elective percutaneous coronary intervention.
Lin GA, Dudley RA, Lucas FL, Malenka DJ, Vittinghoff E, Redberg RF.
JAMA. 2008 Oct 15;300(15):1765-73.

Association between the Medicare Modernization Act of 2003 and patient wait times and travel distance for chemotherapy.
Shea AM, Curtis LH, Hammill BG, DiMartino LD, Abernethy AP, Schulman KA.
JAMA. 2008 Jul 9;300(2):189-96.

Osteoporosis medication use in nursing home patients with fractures in 1 US state.
Parikh S, Mogun H, Avorn J, Solomon DH.
Arch Intern Med. 2008 May 26;168(10):1111-5.

Exploring the surgeon volume outcome relationship among women with breast cancer.
Nattinger AB, Laud PW, Sparapani RA, Zhang X, Neuner JM, Gilligan MA.
Arch Intern Med. 2007 Oct 8;167(18):1958-63.

Dataset accessibility and cost  

Data is available via an application request process through ResDAC; the extent of the application process varies according to the types of data requested.

The cost of data files ranges from several hundred dollars to more than ten thousand dollars, depending on the request.  The cost of Limited Data Set and Non-identifiable Files can be found at under the heading “Files for Order.”  To obtain cost estimates for Research Identifiable data, contact the ResDAC assistance desk at the contact information below.

In general, ResDAC can help researchers determine what files are needed and methods for extracting data.

Help Desk  

Research Data Assistance Center (ResDAC)