If you do not see the menu on the left click here to see it



Before we introduce you to programming in Stata we need to make sure you know how to enter data into Stata and learn some basic commands along the way[1]. We will download real data in their original formats and we will proceed from there. Once you are familiar with the basic principles of Stata we will move to learn something about the famous ‘do-files”. If you have some experience with Stata you can go straight to the programming part of the course.


Stata is a multipurpose statistical package to help you perform data analysis, data manipulation and graphics.


There are several ways to enter data into Stata, in this course we will learn four different ways:


·         The ‘infile way from ASCII files using dictionary (*.dat, *.txt)

·         The ‘use’ way to open Stata files (*.dta, and exporting from other stats packages like SPSS)

·         The ‘quickie’ way from excel (copy-and-paste)

·         The ‘insheet way from CSV files (*.csv)

·         The ‘infix’ way for ASCII data (you need the layout or codebook to read columns)


Depending on your experience with Stata you should go ahead and study the following sections at your own pace.


To start using Stata you need two things: a properly licensed version of the software and data.


Let’s take a brief moment to introduce Stata to those who have never used it before. Stata users can skip this part.


When you first run Stata you will see this screen


Description: Description: Description: Description: Stata01


With some brief comments.


Description: Description: Description: Description: image007


Stata has four windows.


·         The review window (upper left) will show the history of the commands you submit (“_rc” means error code)

·         The results window (upper right) will show the output of every procedure you run

·         The variables window (lower left) will show the variables of your dataset

·         The command window (lower right) is where you type your Stata code and also DOS commands.


You can always use the “point-and-click” method by using the menu. We recommend however, for most of the procedures, to use the command line.


When you work with Stata there are three basic procedures you may want to do first: create a log file, set your working directory, and set the correct memory allocation for your data.


The log file first. Go to File - Log - Begin


Description: Description: Description: Description: Stata02



Stata will ask you where to save the log file (you have the option of appending new output to an existing log), choose the directory where you want to save the log. For this course we will use the following directory in your H: drive




I will call the log file as Log1 and save it as a log file (which can be read by any word processor).


Click on “Save as type:” right below ‘File name:” and select Log (*.log). This will create the file called Log1.log which can be read by any word processor or by Stata (go to File – Log – View). If you save it as *.smcl (Formatted Log) only Stata can read it. It is recommended to save the log file as *.log



Description: Description: Description: Description: Stata03


When you save it, the Stata results window you will see something like this:




In the review screen you will see the list of the commands you have typed so far and in the results screen you will see the output of that command (this is useful if you need to re-type a previous command just click on it in the review window and it will appear in the command window)


An alternative is to type the following code in the command window without the ‘point-and-click’ method (it is always recommended to work with a log file open):


log using "H:\statadata\Log1.log", append


The second thing to do is to check is your working directory. In the command window type the following




pwd’ stands for ‘print working directory’. This will show you your working directory, which right now, in this example is H:\statadata.




You can also type in the command window




This will show you what is in that directory (good old DOS command).


o change directory type in the command window


cd H:\statadata\



You can see your current directory by looking at the lower left of the Stata screen.


Description: Description: Description: Description: StataPWD



The third step is to set the necessary memory allocation.


In the picture above you can see in green letters after “Notes:” that the memory allocation is 10 mb. This will be enough for a medium size database but sometimes you may need more memory space to store your dataset.


To determine the size of your dataset follow the formula:


Size (in bytes) = (8*Number of cases or rows*(Number of variables + 8))


Depending on your Stata version and computer power, you can allocate up to around 2 gigabytes. To allocate 1 g you can type:


set mem 1g



Click here for a video demonstration of these first steps.



Dealing with public opinion data


Please go to the Roper Center website:




Go to the drop-down menu, select “Dataset Collection” - “Recent Acquisitions”


Description: Description: Description: Description: Roper01



Find the following study (if you can’t find it try “search for datasets” in the same dropdown menu):


Los Angeles Times/Bloomberg News Poll # 2007-544: 2008 Democratic and Republican Presidential Primaries/Iraq (United States)

Conducted by Los Angeles Times, field dates: April 5-9, 2007, sample: National adult [USLAT2007-544]


You can search by using the study # USLAT2007-544.


Click on the link for the study description. You will see links to the documentation (codebook) and the data.


Description: Description: Description: Description: LAT544


Codebook is offered in two flavors: pdf and word, download one or the two of them if you want.


Data is in two formats: ASCII (*.dat, sometimes is *.txt or *.data)  and SPSS portable (which you can open using SPSS and later save it – file-save as—as Stata –this only work for SPSS v15 or later).


Download both files. You will be required to enter your Princeton email (no password required as long as you access it from the Princeton network)


Description: Description: Description: Description: Roper02


Please save your files in your “H:\” drive. For the course purposes we will be working in the following directory:





Reading ASCII data using Stata


Most data files are in ASCII format (*.dat, *.txt, *.data, fixed-format. It means “American Standard Code for Information Interchange”).


When dealing with large datasets it is useful to create a dictionary file (*.dct)[2] using the “data locations” in the codebook (you can find this section either the *.pdf or *.doc).


A dictionary file is useful when dealing with free format ASCII data.


The dictionary will help you to let Stata know how many columns it has to read to process each variable (it is like a map of your data)


To create a dictionary file you can use notepad, wordpad or the do-file editor in Stata (the latter highly recommended)


By “creating” I mean open the word processor, typing the code and saving the file in a proper format (with extension *.dct)


Let’s say we are interested in the following variables (see the codebook for the lat544 database):


[NOTE: The following is an extract from a codebook. Codebooks differ on how they present the layout of the data, you need to look for: variable name, start column, end column or length, and format of the variable (whether is numeric –and how many decimals if any- or string)]


Data Locations


Variable          Rec   Start     End  Format


WTVAR               1       1       7  F7.2

GWBUSHJO            1      24      25  A2

GWBECON             1      26      27  A2

ECOSTATU            1      32      33  A2

DEMORREP            1      44      45  A2


·         First column tells you the variable name,

·         Second column in which record is located (for more on this visit http://www.columbia.edu/acis/eds/stat_pak/stata/stata-write.html, or  http://www.stata.com/support/faqs/data/dict.html),

·         Third column indicates where the variable starts,

·         Fourth column “End” shows where the variable ends and the last column shows the format of the variable (numeric or alphanumeric). For example, variable WTVAR is in record 1, starts at column 1 end column 7, is numeric (letter “F”) and has 2 decimal points.


For the lat544 data example there are two ways we can do this. One is using infix and the other is using infile.


The easiest way to extract ASCII data and put it into Stata is to type directly in the command window the layout of the variables you want by using infix. Type


infix varname [start column-end column] using mydatafile.*


For example:


infix wtvar 1-7 gwbushjo 24-25 gwecon 26-27  ecostatu 32-33 str demorrep 44-45 using lat544.dat



NOTE: If a variable is a string character you should add ‘str’ before the variable name (not after) so Stata reads it as string.


This is what you will see in the output window.



And this is what you will see in the variables window (using either method).




You can also open a do-file and use it to run the code (this may be useful when dealing with a lot of variables)


infix ///

          wtvar 1-7 ///

  str     gwbushjo 24-25 ///

  str     gwecon 26-27 ///

  str     ecostatu 32-33 ///

  str     demorrep 44-45 ///

using lat544.dat


NOTE: The “///” tells Stata that all lines are part of one command (or one line)


You can also use infile to read fixed-format data (the dictionary file is a bit more complex, type help infile for further details).


Using notepad or the do-file editor the dictionary TYPE the following (do not copy-and-paste, if you do so apostrophes need to be re-typed):


dictionary using lat544.dat {

       _column(1)  wtvar         %7.2f     “Weight”

       _column(24) gwbushjo      %2s       “GW Bush Job Rating”

       _column(26) gwecon        %2s       “GW Bush and the economy”

       _column(32) ecostatu       %2s       “Status of Nat Econ”

       _column(44) demorrep      %2s       “Vote demo or rep”





·         “dictionary using” is the standard code to define the file as a Stata dictionary (this is “dictionary using [name of the file with the ASCCI data] {“). Also the curly brackets should always be in the position you see them.

·         [IMPORTANT: Always a hard return for every line]

·         “lat544.dat” is the name of the ASCII data.

·         _column(*) Indicates the position where the variable starts

·         Next write the name of the variable

·         “%7.2f” Indicates the format of the variable (the number after % shows the number of columns -7 in this case- and the decimal points -2 in this example-, the letter “f” refers to a numeric fix format –an “s” will indicate a string variable –type help format in the Stata command window for detail information)

·         After the format you have the option to add a variable name in quotations.


Save the file as “lat544.dct”. When saving, make sure to select as “All files” in ‘Save as type:’ (right below of ‘file name’). This is how you create a dictionary file. To run this code you need to open Stata.



Running a dictionary file


Using notepad or the do-file editor we wrote the following dictionary:


dictionary using lat544.dat {

       _column(1)  wtvar         %7.2f     “Weight”

       _column(24) gwbushjo      %2f       “GW Bush Job Rating”

       _column(26) gwecon        %2f       “GW Bush and the economy”

       _column(32) ecostatu      %2f       “Status of Nat Econ”

       _column(44) demorrep      %2s       “Vote demo or rep”




Save it as lat544.dct


To read data using the dictionary we need to import the data by using the command infile. If you want to use the menu go to File à Import à “ASCII data in fixed format with a data dictionary”.


With infile we run the dictionary in the following way (for a details type help infile in the command line in Stata, also you may want to check out infix and outfile):


infile using lat544


Description: Description: Description: Description: Stata07



In the “results” window you will see your dictionary and the number of observations for your data (at the end, press the ‘space-bar’ to finish the program). On the lower left side in the variables window you will see the selected variables (red arrow). In the output (results) window you will get a message that one variable has a string character (“&”). Stata will ignore this character and convert it to missing value.


To practice try to import some other variables. This skill may be useful later on during this training.


If you type browse (in the command line), another window will open showing your data in a spreadsheet format


Another way to use infile is by typing directly into the command window (you can use this method If your data is separated by comma, tabs or space)


infile [list of variables] using [name of the datafile including the extension]


This command works fine when you have few variables. If one of them is string you need to specify for example


infile string7 v1 v2 v3 using mydata.txt


In this case v1 is a string up to 7 characters long, the other variable v2 and v3 are numeric which is the default.


Sometimes your data has a more complicated structured where cases are in more than one row:



Example of a dictionary file when you have more than one record:


dictionary using tree.dat  {



             _column(1)         idnum       %4f

             _column(5)         treetype    %2f


             _column(5)         soilphn     %3.2f   "Soil PH - North Side"

             _column(8)         soilphe     %3.2f   "Soil PH - East Side"

             _column(11)        soilphs     %3.2f   "Soil PH - South Side"

             _column(14)        soilphw     %3.2f   "Soil PH - West Side"


             _column(5)         height      %5.1f

             _column(10)        circ        %5.1f


Source: http://www.columbia.edu/acis/eds/stat_pak/stata/stata-write.html


Save the dictionary with extension *.dct [for example mydictionary.dct]


Run it by typing:


infile using mydictionary




Reading ASCII data using SPSS


If you get your data in ASCII format and a setup file for SPSS (with extension *.sps) you need run SPSS, go to file à open à syntax, find and open the *.sps. A SPSS text editor will open (it is called ‘syntax’, work the same way as ‘do-files’ for Stata), you will see something like this:


DATA LIST FILE="path to data" /                                      

   wtwar   1-7

   gwbushjo 24-25

   gwecon   26-27

   ecostatu  32-33

   demorrep  44-45 (A).



In ‘DATA LIST FILE’ write the path to you data with the full name of the dataset (including the extension)


DATA LIST FILE="C:\mydata\lat544.dat" /                                      

   wtwar   1-7

   gwbushjo 24-25

   gwecon   26-27

   ecostatu  32-33

   demorrep  44-45 (A).




NOTE: The “(A)” means that ‘demorrep’ is a string variable. Alto notice the dots at the end of each command.


Select all and click on Description: Description: Description: Description: arrow to run it.


Description: Description: Description: Description: syntax



In the data window you will see your data.



Data from portable SPSS


Problems? Click here


If the data is already available in a statistical package format other than Stata, it is easier to use that format (providing you have the software) and save it or export it to Stata. This has the advantage of including the variable labels and, in some cases, the value labels of the data. You can also use DBMS/Copy (click here to learn how to use it)


Assuming that you have SPSS, go ahead and open the file by double clicking on lat544.por or by opening SPSS and using the menu:


Description: Description: Description: Description: SPSS01


Change the ‘file of type’ to “SPSS Portable (*.por)”, select the file and click “Open”



Description: Description: Description: Description: SPSS02


Once in SPSS you can save the data as Stata format. In SPSS go to File à Save As, the following screen will comes up:



Description: Description: Description: Description: SPSS03


Following the red arrow select from the list the latest version of Stata (or the version you are working with). As an option, you can select the variables you need by clicking in “Variables” (green arrow).  For now let’s get the whole dataset.


Save the data as lat544 and click “Save”. The data will be exported as lat544.dta


Going back to Stata.


A note on the log file. you can close the log file and continue using it later on.


To close the log, type


log close


To continue working on the same log file type the following


log using "H:\statadata\Log1.log", append


The option append will add new output to your existing log file.


If you still have your previous data from the dictionary, type




This will clear the data in memory so you can start with a new dataset


To read a Stata file, type


use lat544




use “C:\myfolder\mydata\lat544.dta”



You can also use the menu to read a Stata file, go to File - Open.


Description: Description: Description: Description: Stata08



The variable window will be populated with all the variables in the dataset (with the variable name, label, type and format)


In the command line type browse if you want to check the data.


A new spreadsheet-like window will come up. 


Close the window to go back to the command line in Stata (it is important to know that when you browse or edit your data you cannot use any of the other four windows until you close the data editor)


Once you have the data in Stata you can explore it by running the commands: describe, list, summarize and codebook


In the command window type




Description: Description: Description: Description: Stata14



The describe command will provide you info for the active dataset (see white arrow) and the format of the variables (“display format”). [Hit enter or spacebar to see the rest of the list]. Type help describe for further details (if the “--more-- ” message bugs you, type set more off)


The list command will list the data in a table format. Since we have many variables it will be hard to read (try it, to stop the process type the letter “q” or click on the red dot with a white “x” in the icon row below the menu). However you can list some variables as follows:


list  ecostatu ownfinan gender




The summarize command provides you with more information about your data. Type:




Description: Description: Description: Description: Stata16


Summarize table tells you the number of cases, their mean, sd and min and max values. Notice the ‘0” for most of the variables. This means that those variables are in text (or string) format not numeric. From the codebook we know they are supposed to be numeric.


To get percentiles and other statistics you can type the following[3]



Codebook is another useful Stata command to explore your data, type


Description: Description: Description: Description: Stata26


Type help describe, help list, help summarize and help codebook in the command window for further details (check also help inspect)


Numeric-string to numeric-numeric


To convert numeric variables with a string (numbers in red) format into numeric we use the command destring


destring, replace


Description: Description: Description: Description: Stata17


As you can see some variables were replaced by a numeric format but others did not (see along the red line, the screen may differ a bit from what you see) because they contain some string characters.


Let’s run a frequency of one of these variables to see what is going on. To do this we use the command tab


Description: Description: Description: Description: Stata18


Along the red line note the string character “&” but all the rest are numbers. We can still convert this by using the option ‘ignore’. Type the following:


destring, replace ignore(&)


Now all variables with the string character “&” will be converted to numeric and “&” set to missing. You can do this with any character (a, b, x, y, *, etc).




NOTE: If destring still does not work, here are some special cases you may want to check:


1)    Commas:








            In this case type destring Var1, replace ignore(,)


                              WARNING: Sometimes decimals are separated by comas, when using this make sure commas indicate thousands not decimals. If you have decimals separated by comma type: destring Var1, replace dpcomma



2)    Spaces




            1 345


             5 677


            In this case type destring Var1, replace ignore( )


3)    Dots




            1 345


             5 677


            In this case type destring Var1, replace ignore(..) or sometimes there is a space, type: destring Var1, replace ignore(.. )







You can also destring an individual variable, type


destring [variable(s)], replace



To save the data go to File - Save As or type


save lat544, replace.


Continuing with the command tab (tabulate). Go ahead and type the following:


tab ecostatu


In the results window you will se a frequency distribution of the variable ‘ecostatu’.


Description: Description: Description: Description: Stata09


If you tab two variables you will get a crosstabulation of those variables (not two different frequencies)


tab ecostatu gender


Description: Description: Description: Description: Stata10


Please note the frequency distribution and compare it with the data in the codebook. For the variable ecostatu in the codebook



The difference between the codebook frequencies and the Stata are the weights. The codebook presents weighted data. So we need to weight the data to get the right numbers. The variable “wtvar” in the dataset (first variable in the list) contains the weights. Type the following command and see the difference with the previous one:


tab  ecostatu [aw=wtvar]


Description: Description: Description: Description: Stata13


Where “aw” means ‘analytic weights’ (type help weight in the Stata command line for more details)


If you want to generate frequencies for more than one variable you use tab1 instead of tab[4]:


tab1 gwbushjo gwbecon ecostatu


Description: Description: Description: Description: Stata11


Tab is a powerful command (type help tab in the command window for more details). For example, if you want to test the hypothesis that two variables are independent and you want to have row and column percentages you can use the tab command with the following options:


tab  ecostatu ownfinan, row col chi2


Description: Description: Description: Description: Stata12



Adding value labels


By now you might be wondering what 1, 2, 3, etc. mean. When working with public opinion data you work mostly with categorical variables whose values most of the time need to be labeled[5]


Stata creates an ‘alternative’ database for labels and you will need two commands to label the values of the variables: label define and label values


Label define assigns a label to each category and label values assigns specific labels to a variable


In the case of gender, according to the codebook ‘1’ is for ‘male’ and ‘2’ is for ‘female’. Se we create a label called ‘sex’ as follows. Type:


label define sex 1 male 2 female


And we assign it to the variable gender:


label values gender sex


Type tab gender and you will see the frequency distribution with the labels


The same thing with ‘ecostatu’. According to the codebook these are the codes for the values


   1 VERY WELL                                                                 

   2 FAIRLY WELL                                                               

   3 FAIRLY BADLY                                                               

   4 VERY BADLY                                                                

   5 NOT SURE                                                                  

   6 REFUSED                                                                    


Lets create a value label called ‘scale1’.


label define scale1 1 “Very well” 2 “Fairly well” 3 “Fairly badly” 4 “Very badly” 5 “Not sure” 6 “Refused”


And apply it to variable “ecostatu


label values ecostatu scale1


Type tab ecostatu to see it with the labels.


If you do not want to see the labels, type:


tab ecostatu, nolabel


Description: Description: Description: Description: Stata19



Converting to/from missing values


--- From value to missing you type


For one value

mvdecode [name of the variable (type ‘_all’ if using all variables)], mv([# to missing])


For more than one

mvdecode [name of the variable (type _all if using all variables)], mv([# to missing] [# to missing] …)




mvdecode [name of the variable (type _all if using all variables)], mv([# to missing]=. \ [# to missing]=.a \ [# to missing]=.b)


--- From missing to value


For one value

mvencode [name of the variable (type ‘_all’ if using all variables)], mv([# assigned to missing])


For more than one

mvencode [name of the variable (type _all if using all variables)], mv(.=[# assigned to missing]\ .a=[# assigned to missing] \ .b=[# assigned to missing] )


Type help mvdecode or help mvencode for more details.




Using the previous example let’s say you want to convert the option 5 (“Not sure”) and 6 (“Refused”) to missing. Type


mvdecode  ecostatu, mv(5=. \ 6=.a)




You get this



To do the reverse type:


mvencode  ecostatu, mv(.=5 \ .a=6)




Converting text to date


Let’s say you have a date variable in either one of the following formats: “date1” and “date2” are strings (text) and “date3” is a plain number. This data goes from Jan 1st, 1995 to Jan 14, 2008. Neither of them is formally in date format. We will deal with each in turn.





For ‘date1’ use à

STATA 10 - gen datevar1 = date(date1,"DMY", 2008)

STATA 9.2 - gen datevar1 = date(date1,"dmy", 2008)


For ‘date2’ use à

STATA 10 - gen datevar2 = date(date2,"MDY", 2008)

STATA 9.2 - gen datevar2 = date(date2,"mdy", 2008)


In both cases after creating the date variable you need to format it as follows:


format  datevar1 %td[6]




Converting “date1


The structure of “date1” is day-month-two digit year. For this we will use the function date() to convert string variables into date variables. Type




gen datevar1 = date(date1,"DMY", 2008)[7]




gen datevar1 = date(date1,"dmy", 2008)[8]


Note, 2008 indicates that the date variable changes from 1999 to 2000. In this case it points to the last year in the series.


then type:


format  datevar1 %td[9]


“2008” indicates the last year of the series. Date() function only recognizes year as four digits for one century, adding 2008 forces Stata to consider the change in centuries. For more details type help date.



“Datevar1” should have the same structure as “date1”. Check the change in century.




If you need quarterly data you can transform that variable using the following:


gen quarters=qofd(datevar)


format  quarters %tq




Converting “date2


The structure of “date2” is month-day-four digit year. For this we will also use the function date() to convert string variables into date variables. Type



gen datevar2 = date(date2,"MDY", 2008)



gen datevar2 = date(date2,"mdy", 2008)


Note, 2008 indicates that the date variable changes from 1999 to 2000. In this case it points to the last year in the series.


then type:


format  datevar2 %td


Notice that we do not have to specify 2008 as the last year since year has four digits.







If you need quarterly data you can transform that variable using the following:


gen quarters=qofd(datevar)


format  quarters %tq





Converting “date3


“Date3” has the following structure: year(four digits)-month-day. It is numeric with different lengths.


We need first to separate its date components.


We will generate a string variable “date3a”.Type


gen date3a= string(date3,"%11.0g")




In “date3a” year has always the first four characters, we can extract this by using the substr() function:


gen year3=substr(date3a,1,4)




We cannot distinguish between months and days since the rest of the characters in “date3a” have different lengths. So we extract the rest after year. Type:


gen monthday3=substr(date3a, 5,.)




The maximum length of “monthday3” is 4: two-digit months and two-digit days. We will extract these first.


gen month3=substr(monthday3,1,2) if length(monthday3)==4

gen day3=substr(monthday3,3,2) if length(monthday3)==4



When “monthday3” length is 2 we could be sure the first digit represents the firs nine months and the second digit the first nine days of the month. Se can extract these in the same way but this time we will replace the missing:


replace month3=substr(monthday3,1,1) if length(monthday3)==2

replace day3=substr(monthday3,2,2) if length(monthday3)==2



You should have the following…



f you scroll down you will notice…



We will convert these partial dates to a date variable. First we need to format “year3”, “month3” and “day3” as numbers:


destring year3 month3 day3, replace



Now we generate the new date variable using the mdy() function and format as %td:


gen datevar3=mdy( month3, day3, year3)

format datevar3 %td




“Datevar3” is now a partial date variable.




Notice that one thing is the date format and another is the actual date variable. As the following table shows, dates are a special case of a numeric variable, where numbers are codes for dates in consecutive order. We will fill the missing dates by simply filling in the consecutive numbers in the series.


This is what you see

This is what the computer "sees"














































To find the computer date codes you can use the display()and the td() functions. For example:


display td(1jan1995)



The computer code for Jan 1st., 1995 is 12784. If, for example, you had the first date as missing, this is how you would replace it (this is just an example):


replace datevar3=12784 in 1





Going back to the data. Let’s make a copy of “datevar3a”


gen datevar3a=datevar3




Now we fill in the time series:


replace datevar3a= datevar3a[_n-1]+1 if datevar3a==.






Format “datevar3a”


format datevar3a %td




Deconstructing date variables


Let’s say you already have a date variable (“01jan1995”) and you need to extract days, months and years. Using date functions type:


gen day=day(datevar1)

gen month=month(datevar1)

gen year=year(datevar1)

order date1 date2 date3 datevar1 day month year






Date variables with day of the week


If your date variable looks like this



Here is a do-file to create a date variable


/*Generating date1 */

gen date1=ltrim(subinword(date,"Monday,"," ",.))

replace date1=ltrim(subinword(date1,"Tuesday,"," ",.))

replace date1=ltrim(subinword(date1,"Wednesday,"," ",.))

replace date1=ltrim(subinword(date1,"Thursday,"," ",.))

replace date1=ltrim(subinword(date1,"Friday,"," ",.))

replace date1=ltrim(subinword(date1,"Saturday,"," ",.))

replace date1=ltrim(subinword(date1,"Sunday,"," ",.))


/*Generating date2 */

gen date2=subinstr(date1,"er","er,",.)

replace date2=subinstr(date2,"y","y,",.)

replace date2=subinstr(date2,"April","April,",.)

replace date2=subinstr(date2,"March","March,",.)

replace date2=subinstr(date2,"August","August,",.)

replace date2=subinstr(date2,"June","June,",.)


/*Generating datevar */

gen datevar=date(date2,"MDY",2009)

format datevar %td



You should get the following



For further details and other formats type[10]


help date



Time variable


For details type help mf_date


Run this do-file and see what happen.


set obs 100                   /* Set the number of rows to 100 */

use "http://www.princeton.edu/~otorres/Stata/date.dta", clear

drop date1 date3

set seed 12345

gen hr=0+int((23-0+1)*uniform())     /*Generating a random variable with numbers between 0 and 24 to represent hours*/

gen min=0+int((59-0+1)*uniform())    /*Generating a random variable with numbers between 0 and 50 to represent minutes*/

gen sec=0+int((59-0+1)*uniform())    /*Generating a random variable with numbers between 0 and 60 to represent seconds*/

tostring  hr min sec, replace        /*Convert numbers to strings*/

replace sec="0" + sec if length(sec)==1 /*Adding "0" to single digits*/

replace min="0" + min if length(min)==1 /*Adding "0" to single digits*/

gen time= hr+":"+ min+":"+ sec       /*Creating ‘time’ variable as string*/

destring  hr min sec, replace        /*Convert strings to numbers*/

gen double time1=hms( hr, min, sec)         /*Generating a time variable using function hms()*/

format time1 %tcHH:MM:SS                  /*Formating the time variable as time, ignore 01jan1960*/

gen elapse= time1- time1[_n-1]       /*Creating a elapse variable in machine code*/

gen elapsehr=hours( elapse)          /*Converting elapse into hours*/

gen elapsemin=minutes( elapse)       /*Converting elapse into minutes*/

You should have something like this… well, not exactly since you are generating random numbers.


If you look at the variable “elapsehr” you will notice in this example going from the first row to the second took about 4 hrs or 237.1333 minutes.





Let’s say you have “time” in the form of “hh:mm:ss” (red column above).


To separate it into hrs, minutes and seconds as numbers (not time variables) you can use the substr function:


gen hour=substr(time,1,2)

gen min=substr(time,3,3)

gen sec=substr(time,6,3)



To create a time variable from a string variable you can use the function clock:


generate double time1 = clock(time, "hms")


Then format it as


format time2 %tcHH:MM:SS




format time2 %tcHh:MM:SSam



Combining date and time  (click to see the menu on the left)

Using the previous example


gen datevar=date(date2,"MDY", 2012) /*Date2 is a string date variable*/

format datevar %td

gen month = month(datevar) /*Extracting month from datevar*/

gen day=day(datevar) /*Extracting day from datevar*/

gen year=year(datevar) /*Extracting year from datevar*/

*Option 1: date variable and time components as numbers

gen double datetime1 = dhms(datevar,hr,min,sec)

format datetime1 %tc

*Option 2: all date and time elements as numbers

gen double datetime2 = mdyhms(month,day,year,hr,min,sec)

format datetime2 %tc

*Option 3: date and time together as string

gen daytime = date2 + " " + time /*Creating a date/time variable as a string*/

gen double datetime3 = clock(daytime,"MDY hms") /*Generate day/time value from the string version*/

format datetime3 %tc /*Format daytime as MDYhms*/





MOVING AVERAGE FOR PANEL DATA (click to see the menu on the left)



Source: http://www.stata.com/support/faqs/stat/moving.html


Use the command egenmore, you may have to install it first by typing


ssc install egenmore


For the lags to work you may need to xtset your data by typing


xtset [name of panel variable]  [time variable]


For example:


xtset country year


NOTE: If you get an error message after xtset, click here (page 5).


Example, for a four year moving average type




ssc install egenmore    /*If not already installed*/


use http://dss.princeton.edu/training/Panel101.dta


xtset country year


egen moveave_x1 = filter(x1), lags(0/3) normalise


browse country year x1 moveave_x1




Where x1 is the variable of interest. Replace x1 with your own variable.


Type help egenmore for more details.





To remove or replace strings from var1 below use the following command (in which we are replacing all string and special characters with nothing)


gen var2=regexr(var1,"[.\}\)\*a-zA-Z]+","")


destring var2, replace




To extract strings from a combination of strings and numbers


gen var2=regexr(var1,"[.0-9]+","")






More info see:  http://www.ats.ucla.edu/stat/stata/faq/regex.htm



Operators for Stata’s regular expression (regexr)

Source: http://www.stata.com/support/faqs/data/regex.html



Some ids or codes are made up from several individual codes and you may want to separate them out. Let’s assume that ‘id’ below is composed of three different ids and we want to create three different ids.



We will use the function ‘substr to separate the numbers. Steps:


1.    Convert the variable from numeric to string, type


a. tostring id, gen(newid)


2.    Use substr to extract each component


a. gen newid1=substr(newid,1,1)

b. gen newid2=substr(newid,2,1)

c. gen newid3=substr(newid,3,1)


3.    Convert the individual components back to numeric


a. destring newid1-newid3, replace


You should have something like:





Dropping observations based on string variables.


In the example below we want to drop observations that contain “E.S.”, we use the string function regexm, type


drop if regexm( var1,"E.S.") > 0








Dealing with zip files within Stata


You can use unzipfile command to extract compressed data, for example


unzipfile mydata.zip


You can also zip file using zipfile


zipfile myzip.zip mydata.dta


For some extra info please check:



To extract compressed files you can also use 7-zip (freeware) available at




Or if installed in you computer, use Winzip.


In either case, right-click on the compressed (zip) file, select ‘extract to here’.


Files will be extracted next to the zip files.



Counting groups (panel data) (click to see the menu on the left)


egen count = group(panel)




egen’ = Stata command to create special variables (type help egen for more details)


count’ = Name of the new variable (you can change it to something else)


group’ = Part of ‘egen’, a function use to create ids.


panel’ = The variable in your dataset for the panels (i.e. country, states, companies, etc.)



Then type



summarize count






Std. Dev.












In this example, the maximum number is 7 which equals the total number of panels in the dataset



Sort variables in alphabetical order in the variables window (click to see the menu on the left)


Type either






order _all, alphabetic


For more details type


help order


To sort cases see here





Checking for outliers (click to see the menu on the left)



Outliers can change the direction of the predicted line; we need to examine the residuals.


To identify regression residuals look at: studentized residuals, hat values and Cook's distance


·         Cook's distance measures how much an observation influences the overall model or predicted values

·         Studentized residuals are the residuals divided by their estimated standard deviation (values >2 may be problematic).

·         Hat-points identify influential observations on all fitted values. Go from 1/n to 1 (these are relative to the mean hat value in the data hat-value = (k+1)/n, k = # of predictors excluding constant)


use http://www.ats.ucla.edu/stat/stata/examples/ara/Prestige, clear

rename educat education

rename percwomn women

rename occ_code census

recode occ_type(2=1 "bc")(4=2 "wc")(3=3 "prof")(else=.), gen(type) label(type)

regress prestige education income i.type

predict yhat1                   /* Predicting y*/

predict hat1, hat             /*Leverage: measures the potential leverage of Yi on all the fitted values. Pull the line towards them*/

predict res1, resid          /* Getting the residuals*/

predict stud1, rstudent  /* Studentized residuals, values larger than 2 in absolute value may be problematic*/

predict cook1, cooksd   /* Cook’s distance, refers to values  influencing the overall model */


sum hat1

local lowx = r(mean)-r(sd)

local hix = r(mean)+r(sd)

twoway scatter  stud1 hat1 [aw= cook1], msymbol(oh) yline(2) yline(-2) xline(`lowx') xline(`hix') || scatter  stud1 hat1 if stud1>2 | hat1>`hix', mlabels( occtitle) msymbol(i) title("Studentized residuals, Hat values and Cook's Distance")


In the graph below we find no significant outliers



You can see that if you run the regression again without, the possible outliers: MEDICAL_TECHNICIANS, ELECTRONIC_WORKERS and GENERAL_MANAGERS and plot the fitted lines.


regress prestige education income i.type if occtitle!="MEDICAL_TECHNICIANS" & occtitle!="ELECTRONIC_WORKERS" & occtitle!="GENERAL_MANAGERS"

predict yhat1a

twoway lfit  prestige yhat1 ||  lfit  prestige yhat1a





Testing interactions using Stata 11/12 (click to see the menu on the left)


Example 1

sysuse auto   /* Loading a dataset that comes with Stata*/

reg price c.mpg##i.rep78

test 2.rep78#c.mpg

test 2.rep78#c.mpg=3.rep78#c.mpg



sysuse auto

reg price foreign##i.rep78

test 1.foreign#3.rep78

test 1.foreign#3.rep78 1.foreign#4.rep78

test 1.foreign#3.rep78=1.foreign#4.rep78



Getting percentiles (click to see the menu on the left)



********Using –summarize--


sysuse auto                 /* Loading a dataset that comes with Stata*/

sum price, detail

return list                 /* See saved results*/

gen pctile90=r(p90)         /* Generate a variable with the 90th percentile*/



*********Using _pctile


sysuse auto                        /* Loading a dataset that comes with Stata*/

_pctile price, p(25 75)            /*Estimating 25 and 75 percentiles*/

return list                        /*See the saved results

gen pctile25=r(r1)                 /*Creating a variable with the 25th percentile*/

gen pctile75=r(r2)                 /*Creating a variable with the 75th percentile*/

sum price, detail                  /*Checking it works*/

browse price pctile25 pctile75     /*Checking the variables*/




























[1] A parallel goal of this course is to introduce you to some useful data resources and show you how to deal with some downloading issues and transformations.

[2] ACII files come in two flavors: stream mode (where no specific structure is defined) and record mode (when each line is define in certain way). Type help infile for further details.

[3] If it does not work, you may have to type destring income before summarize

[4] [NOTE: tab1 does not accept weights]. To simplify things we will work with unweighted data.

[5] For a general guide on how to deal with categorical data see http://www.ats.ucla.edu/stat/Stata/examples/icda/icdast2.htm

[6] Here are other useful date functions:

mdy(month,day,year)for daily data

yw(year, week)for weekly data

ym(year,month)for monthly data

yq(year,quarter)for quarterly data

yh(year,half-year)for half-yearly data

[7] Here are other functions:


[8] Here are other functions:


[9] Here are other useful date functions:

mdy(month,day,year)for daily data

yw(year, week)for weekly data

ym(year,month)for monthly data

yq(year,quarter)for quarterly data

yh(year,half-year)for half-yearly data

[10] To specify a range of dates (or integers in general) you can use the tin() and twithin() functions. Tin() includes the first and last date, twithin() does not.

tab a if tin(01jan1995,01jun1995)
tab b if twithin(01jan2000,01jan2001)