COMBINING DATASETS USING STATA 10.x
If you
do not see the menu on the left click here to see it (look for
‘merge’ in the menu)
Combining
datasets using Stata is a frequent task in data analysis
You
merge when you want to add more variables to an existing dataset
(type help
merge in
the command window for more details)
What
you need:
·
Both
files should be in Stata format
·
Both
files should have at least one variable in common (id)
Step1.
You need to sort the data (both datasets) by the id or ids common to the files
you want to merge and save the files. Type (for each dataset in turn)
sort [id1] [id2] …
save [file name], replace
Open one dataset (considered the
“master” file) and type:
merge [id1] [id2] … using [path
and/or name of the other dataset]
For
example (using a hypothetical data1.dta as “master” file, already open):
merge id using
c:\mydata\data2.dta
or
merge lastname
firstname using c:\mydata\data2.dta
To
verify the merge type
tab _merge
From
Stata help, here are the codes for _merge:
_merge==1
obs. from master data
_merge==2
obs. from only one using dataset
_merge==3
obs. from at least two datasets, master or using
If
you want to keep the observations common to both datasets you can drop the rest
by typing:
drop if _merge!=3 /*This will drop observations where _merge
is not equal to 3 */
You
append when you want to add more cases (type help append for more details).
append using [path and/or name of
the other dataset]
For
a PDF updated version see here:
http://dss.princeton.edu/training/Merge101.pdf
Matching
fuzzy text
Reclink stands for ‘record linkage’. It is a program
written by Michael Blasnik to merge imperfect string
variables. For example
In
dataset 1 you may have
“
In
dataset 2 you may have:
“
Reclink helps you to merge the two databases by
using a matching algorithm for these types of variables.
Since
it is a user created program, you may need to install it by typing
ssc install reclink
Once
installed you can type help
reclink for details
Data
preparation:
·
As
in merge, the merging variables must have the same name: state, university,
city, name, etc.
·
Both
the master and the using files should have an id variable identifying each
observation. Note: the name of ids
must be different, for example id1 (id master) and id2 (id using).
·
Sort
both files by the matching (merging) variables
The
basic sytax is:
reclink var1 var2 var3 … using myusingdata, gen(myscore) idm(id1) idu(id2)
The
variable myscore indicates the strength of the match; a perfect match
will have a score of 1.
Description
(from reclink help pages):
“reclink uses
record linkage methods to match observations between two datasets where no
perfect key fields exist -- essentially a fuzzy merge. reclink allows for
user-defined matching and non-matching weights for each variable and employs a
bigram string comparator to assess imperfect string matches.
The master and using datasets must each
have a variable that uniquely identifies observations. Two new variables are created, one to hold
the matching score (scaled 0-1) and one for the merge variable. In addition, all of the matching variables
from the using dataset are brought into the master dataset (with newly prefixed
names) to allow for manual review of matches.”
The
following links may be of interest for now:
http://www.ats.ucla.edu/stat/stata/modules/combine.htm
http://www.ats.ucla.edu/STAT/stata/faq/multmerge.htm
For
a list of frequently used Stata commands see the following
http://www.ats.ucla.edu/stat/stata/notes2/commands.htm