COMBINING DATASETS USING STATA 10.x

 

 

Merge / Append / Reclink

 

 

If you do not see the menu on the left click here to see it (look for ‘merge’ in the menu)

 

 

Combining datasets using Stata is a frequent task in data analysis

 

 

MERGE

 

 

You merge when you want to add more variables to an existing dataset (type help merge in the command window for more details)

 

What you need:

 

·       Both files should be in Stata format

·       Both files should have at least one variable in common (id)

 

Step1. You need to sort the data (both datasets) by the id or ids common to the files you want to merge and save the files. Type (for each dataset in turn)

 

sort [id1] [id2] …

 

save [file name], replace

 

Open one dataset (considered the “master” file) and type:

 

merge [id1] [id2] … using [path and/or name of the other dataset]

 

For example (using a hypothetical data1.dta as “master” file, already open):

 

merge id using c:\mydata\data2.dta

 

or

 

merge lastname firstname using c:\mydata\data2.dta

 

 

 

To verify the merge type

 

tab _merge

 

From Stata help, here are the codes for _merge:

 

_merge==1    obs. from master data                           

_merge==2    obs. from only one using dataset                

_merge==3    obs. from at least two datasets, master or using

 

If you want to keep the observations common to both datasets you can drop the rest by typing:

 

drop if _merge!=3     /*This will drop observations where _merge is not equal to 3 */

 

 

 

APPEND

 

You append when you want to add more cases (type help append for more details).

 

append using [path and/or name of the other dataset]

 

For a PDF updated version see here:

 

http://dss.princeton.edu/training/Merge101.pdf

 

 

 

RECLINK

 

Matching fuzzy text

 

 

Reclink stands for ‘record linkage’. It is a program written by Michael Blasnik to merge imperfect string variables. For example

 

In dataset 1 you may have

 

Princeton University

 

In dataset 2 you may have:

 

Princeton U”

 

Reclink helps you to merge the two databases by using a matching algorithm for these types of variables.

 

Since it is a user created program, you may need to install it by typing

 

ssc install reclink

 

Once installed you can type help reclink for details

 

Data preparation:

 

·       As in merge, the merging variables must have the same name: state, university, city, name, etc.

·       Both the master and the using files should have an id variable identifying each observation. Note: the name of ids must be different, for example id1 (id master) and id2 (id using).

·       Sort both files by the matching (merging) variables

 

The basic sytax is:

 

reclink var1 var2 var3 … using myusingdata, gen(myscore) idm(id1) idu(id2)

 

 

The variable myscore indicates the strength of the match; a perfect match will have a score of 1.

 

Description (from reclink help pages):

 

   reclink uses record linkage methods to match observations between two datasets where no perfect key fields exist -- essentially a fuzzy merge.  reclink allows for user-defined matching and non-matching weights for each variable and employs a bigram string comparator to assess imperfect string matches.

 

    The master and using datasets must each have a variable that uniquely identifies observations.  Two new variables are created, one to hold the matching score (scaled 0-1) and one for the merge variable.  In addition, all of the matching variables from the using dataset are brought into the master dataset (with newly prefixed names) to allow for manual review of matches.”

 

 

 

The following links may be of interest for now:

 

http://www.ats.ucla.edu/stat/stata/modules/combine.htm

 

http://www.ats.ucla.edu/STAT/stata/faq/multmerge.htm

 

For a list of frequently used Stata commands see the following

 

http://www.ats.ucla.edu/stat/stata/notes2/commands.htm