We put together a set of DFM programs to carry out calculations reported in our our paper "Factor Models and Structural Vector Autoregrssions in Macroeconomics" (Stock and Watson, 2016).  Paul Ho helped in the development of these programs.  

The programs can be modified for use in other DFM, SDFM or FAVAR applications.

We describe the programs below.

Many of the programs rely on common inputs (data, parameters, etc.). You'll need to change these for your application.  We discuss them below.

The notes use the following notation:  "T" denotes the time series sample size. "N" denotes the cross-section sample size.

Data:
The various programs use a Matlab structure array.  It is called "datain".  For our project the elements of datain are computed in the progams datain_all.m (which reads in our complete dataset) and datain_real.m (which reads in the "real" dataset). You may read in your data in any way you see fit, but the programs require that the following elements of datain.

Components of datain:

	Key Components of datain:

		(1) datain.bpdata:  This is a "T x N" matrix of the data that will be used in the analysis.  The data should have been pre-processed by taking logs, differences, etc., as appropriate for your application.

		(2) datain.bpnamevec:  This is an "N x 1" string vector with the names of the series in in BPDATA.  This is used to label output etc.  You can use names like 'Series1', 'Series2', etc., but for interpreting output names like 'GDP', 'FedFunds', are better.

		(3) datain.bplabvec_short = This is an 'N x 1' string vector that contains a "short" description for each series. This is used in the various output files.  This too could be 'Series1', 'Series2', etc., but for interpreting output names like 'Real GDP', 'Federal Funds Rate', are better.

		(4) datain.bplabvec_long = This is an 'N x 1' string vector that contains a "long" description for each series. We use this to produce the data description tables in the appendix, but you might use it for other purposes.  This too could be 'Series1', 'Series2', etc., but for interpreting output names like 'Real GDP, NIPA Table xx', 'Federal Funds Rate, Board of Governors of the Federal Reserve System', are better.

		(5) datain.bpinclcode:  This is an 'N x 1' vector with a code for each series that indicates whether the series is used to estimate the factors. This is described in Section 6.1.2 of the paper.
			bpinclcode(i) = 1: Use the i'th series to estimate the factors, and also estimate factor loadings, and other statistics for this series
			bpinclcode(i) = 2: Do not use the i'th series to estimate the factors, but estimate factor loadings, and other statistics for this series

		(6) datain.bptcodevec:  This is an "N x 1" vector that shows the "transformation" code for each data series.  Even though the data in "bpdata" have been pre-process, this code is used in the programs because IRFs and VDs are reported for the levels of the series.  Thus, if the i'th series in "bpdata" is a first difference, then cummulative IRFs and VDs are reported in the output.  Note: the program does NOT carry out the transformations - these should be done before the data are placed in the matrix bpdata in (1) -- this code merely says what transformation was carried out.  Here are the tcodevalues that the program looks for:
			bptcodevec(i) = 1; no transformation  (IRFS and VDs are reported)
			bptcodevec(i) = 2; first difference transformation  (cummulative IRFS and VDs are reported)
			bptcodevec(i) = 4; log transformation  (IRFS and VDs are reported)
			bptcodevec(i) = 5; first difference of logarithm transformation  (cummulative IRFS and VDs are reported)
			Thus, the program treats "1 and 4" values the same way, and '2 and 5' values the same way.
			Also, the program allows for values of '3 or 6' which correspond to second differences.  In this case, cummulative IRFs and VDs are reported (so these are IRFs for the first differences and not levels.)
			You can modify any of this in the program 'units_to_levels.m'
		
		(7) datain.dnobs:  This is a scalar, which is the is the number of time series observations, "T", in the notation used in these notes.

		(8) datain.calds:  This is a "T x 2" calendar matrix.  The first column is the year and the second column is the period (month, quarter, etc.).  The the t'th row of the this matrix contains entries like '2013 11'  denoting the 11th month of 2013.

		(9) datain.calvec:  This is "T x 1" calendar matrix, used for plotting etc.  The t'th element for the 11 month of 2013 would be 2013.833 (= 2013 + (11-1)/12).

		Note: we use the program "calendar_make.m" (available in the replication materials) to form these calendar arrays, but you can do it any way you want.
		
	Other components of datain:
		
		In our application we wanted to have access to a few other data-related objects, so we also kept these in datain.  These are not critical for the other programs, but FYI, here is what they are:
		
		(10) datain.bpdata_raw:  These were the raw data before any transformations
		(11) datain.bpdata_noa:  These were the transformed data, but before adjustment for outliers.
		(12) datain.bpdata_trend:  These are the "trends" in the series
		(13) datain.bpdata_unfiltered: These were the data before detrending
		(14) datain.bptcodevec:  This was a "category" code for each series (NIPA, Wages, Prices, etc.) that we used to organize the output.
		
Estimation Parameters:
	There are various estimation parameters used in the various programs.  Some common ones are listed here, and more are discussed in the context of particular programs and tasks below.
	
	Many of the estimation parameters are stored in the structural array called "est_par"
	
	Elements of estpar:
		est_par.smpl_par:  This contains information about the sample period
			smpl_par.nfirst: a "1 x 2" vector with the beginning year and period for estimation.  
				Example: smpl.nfirst = [1959 3]  (start in 1959 period 3)
			smpl_par.nlast: a "1 x 2" vector with the end year and period for estimation.  
				Example: smpl.nlast = [2014 4]  (end in 2014 period 4)
		  smpl_par.calvec:  This is the "calvec" vector discussed about in "datain"
		  smpl_par.nper:  This is the number of periods per year 
		
		est_par.fac_par:  These are parameters for estimating the factors
			fac_par.nt_min: This is the mininum of number of time series observations that a series must have for inclusion in the least squares estimation of the factors.  (In our calculations we set nt_min = 20.)
			est_par.lambda.nt_min: This is minimum number of obs for any series used to estimate lamba, irfs, etc. (In our calculations we set nt_min = 40.)
			fac_par.tol: The least squares minimization is carried out iteratively.  This is a parameter that governs "convergence" of the iterations.  (In our calculations we set fac_par.tol = 10^-8).
			est_par.fac_par.lambda_constraints_est: These are parameter that described the constraints that are imposed on lambda (the factor loadings) for the least squares estimates of the factors. These are described below in the particular applications. 
      est_par.fac_par.lambda_constraints_full:  These are parameter that described the constraints that are imposed on lambda (the factor loadings) after estimation of the factors. These are described below in the particular applications.
			
		est_par.var_par:  Parameters for Factor-VAR 
			fac_par.var_par.nlag: The is number of lags in the VAR model for the factors.
			est_par.var_par.iconst:  1 if constant to be included in VAR, 0 otherwise.
			est_par.var_par.icomp:  1 if companion matrix is computed by VAR, 0 otherwise.  (This is necessary for IRFs and Variance Decomps. See below)
			
		est_par.n_uarlag: number of lags in univariate AR for the uniqueness. (This is necessary for Variance Decomps). 
			
			
Tasks and Programs:
Here we briefly summarize the tasks we carried out in the paper and the programs used for these tasks.

Determining the number of factors: (As an example, see Table 2 of the paper)
	The program "hom_descriptive_statistics_all_variables.m" computes a variety of statistics that are helpful for determining the number of factors.
	
	Additional parameters for this program:
		nfac_max:  The maximum of factors
		
	After reading in the data, the program calls "est_nfac.m" where the work is done.  The output for this progam is in the structure "nfac_out".  The entries are
	
		nfac_out.st.bn:  Bai-Ng ICP2 values for the number of static factors (computed in bai_ng.m)
		nfac_out.st.ssr: The sum of squared residuals (over T and N) for each choice of the number of factors (computed in factor_estimation_ls.m)
		nfac_out.st.r2: The (trace or average) R-squared for each choice of the number of factors (computed in factor_estimation_ls.m)
		nfac_out.st.tss: The total sum of squares (over T and N) for each choice of the number of factors (computed in factor_estimation_ls.m)
		nfac_out.st.number: The number of observations ( T*N in a balanced panel) for each choice of the number of factors (computed in factor_estimation_ls.m)
		nfac_out.st.nt: Number of time series observation (T).
		nfac_out.dy.aw: This is the Bai-Ng ICP2 criteria, but for the residuals computed for the number of dynamic factors (see section 2.4.2 of the paper).  This is computed in "amengual_watson.m".
		nfac_out_dy.ssr: These are sum of squared residuals from the Amengual-Watson procedure
		nfac_out_dy.r2: These are the r-squared values from the Amengual-Watson procedure.
		
	Computing the R-squared (fraction of variance explained) by the static factors for each of the variables in the data set.  (As an example, see columns A of Table 3)
	
		These are computed in "Tabulate_rsquared_static_factors.m"
			This program computes an "N x nfac_max" matrix, where the (i,j) element is the R2 for series i in a model with j static factors.
			
		Additional Parameters required in Tabulate_rsquared_static_factors.m
			nfac_max: The maximum number of static factors
		
		See "tabulate_rsquared_static_factors.m" for the settings of the "est_par" parameters.
		
	
	Computing the Variance Decomps for the Statitic Factors (Descriptive Statistics, not structural analysis). (As an example, see columns B of Table 2).
		These are computed in "Tabulate_variance_decomps.m"
			The key output for this program is a "N x numfactors x nhorizons" matrix that with (i,j,k) element being the fraction of the variability of series "i" at horizon "k" explained by the first "j" dynamic factors.
			This matrix is "irf_vdecomp_out.vfrac_y_fac_mat"
			
			Additional parameters required for Tabulate_variance_decomps.m
				decomp_par:  These are the variance decomp parameters
					decomp_par.hor:  The maximum horizon.
					decomp_par_varcum: Set this equal to 1.  (This parameter is 1 if the program computes cummulative variance decomps for dynamic factors 1 through j, which is what is needed here.)
					decomp_par.cancor: Set this equal to 1.  (This orders the shocks so they correspond to the dynamic factors, where the order is determined by the canonical correlation of the static factor VAR residuals and the residuals from the regressions of the series onto lags of the factors.)  
					
	Computing a Structural DFM with a single identified shock. (The oil-price-exogenous SDFM is an example of this.)  
		The program "sdfm_price_exogenous.m" provides an example of how to do this.	
		
		The program computes IRFs and VarDecomps for the identified shock.  SEs are computed via parametric bootstrap simulations.  Output can be found in the structures "irf_vdecomp_out"	and "se_irf_vdecomp_out" which are described below.
		
		Additional Input Parameters:
		
			Imposing the named factor constraint: 
				As decribed in the paper, identification is achieved by the "named factor" and "unit effect normalizations".  The "named factor" constraints are setup in the program.  These are set up as follows.  Let lam(i) denote a column vector, whose transpose is the i'th row of lambda.  Thus, the common components for series x(i)_t is lam(i)'F_t.  The value of lam(i) for the "named factor" must be constrained.  This is accomplished using constraints of the form R*lam(i) = r, where R is a matrix and r is a vector. 
			
				These constraints are specified in a matrix "lambda_constraints".  lambda_constraints is a "n_constraints x (2+n_factor)" matrix, where n_constraints denotes the total number of linear constraints. Each row corresponds to a single constraint, specified as follows:
			
					lam_constraints(.,1): The series/row of lambda that is constrained ("i" in the example)
					lam_constraints(.,2:n_factors+1):  The j'th row of R corresponding to the constraint
					lam_constraints(.,2+n_factors): The j'th value of r for the constraint.
				
				As an example, suppose there are 3 static factors, the first static is "named" by the 14th variable in the dataset, and this is the only constraint.  We then want lambda(14) to be contrained to take on the value [1 0 0].  In this case, lam_constraints needs three rows:
					lam_constraints(1,:) = [14 1 0 0 1];
					lam_constraints(2,:) = [14 0 1 0 0];
					lam_constraints(3,:) = [14 0 0 1 0];
				
				We use the program "lambda_construct" to form the elements of lambda_constraints. This programs takes as an input the name of the series associated with the constraint and the values of R and r.
				
				In the sdfm_price_exogenous.m example, we have 4 different oil prices that name a single factor.  Thus we need to impose the same constraints on lamba for each of these series.  We accomplish this with code that has the form:
				
					str_var = {'MCOILBRENTEU', 'WPU0561', 'MCOILWTICO', 'RAC_IMP'};  % four oil prices with constraint
          R = eye(nfac.total);
          r = [1; zeros(nfac.total-1,1)];
          lambda_constraints = lambda_construct(str_var,namevec, R,r); % Here, 'namevec' is the string array containing the names of all of the series.
          
        There is one more complication: We need to impose the constraint twice: once when the factors are estimated using a subset of the dataset and once when the factor loadings are estimated using the full data. Thus we need to versions of lambda_constraints.  These are saved as 
        	est_par.fac_par.lambda_constraints_est: The constraints for the factor-estimation sample
        	est_par.fac_par.lambda_constraints_full: The constraints for the full sample.
        	
       Thus, the code in program looks like this 
        
       		% -- Impose constraints on lambda to identify this structural shock
					str_var = {'MCOILBRENTEU', 'WPU0561', 'MCOILWTICO', 'RAC_IMP'};
					R = eye(nfac.total);
					r = [1; zeros(nfac.total-1,1)];
					lambda_constraints_estdata = lambda_construct(str_var, est_namevec, R,r);
					lambda_constraints_bpdata = lambda_construct(str_var, datain.bpnamevec, R,r);% other oil variables that load on only first factor, but without unit-scale restriction
					est_par.fac_par.lambda_constraints_est = lambda_constraints_estdata;
					est_par.fac_par.lambda_constraints_full  	
					  
			Other parameters:
				decomp_par:  These are the variance decomp parameters described above
					decomp_par.hor:  The maximum horizon.
					decomp_par_varcum: Set this equal to 0.  
					decomp_par.cancor: Set this equal to 0.
			
				n_rep:  This is the number of bootstrap replications.  We used 500.  You should set this equal to a small value (save n_rep = 5) until you are sure your program is working as the replications take some time.    
			
			  nfac.unobserved: This is the total number of factors in a SDFM (= 8 in our example);
        nfac.observed: This is= 0 in a SDFM
        
        n_ident_shocks:  The number of identified shocks.  This is used for printing output (as the only output required is for the identified shocks).
			
			The output:
				Point estimates are contained in the structure irf_vdecomp_out and the results from bootstrap simulations are in se_irf_vdecomp_out.  Here is how these structures are organized.
				
				irf_vdecomp_out:
					irf_vdecomp_out.eps_structural: 'T x n_fac' matrix of structural shocks (correspondin to the G matrix) 
          irf_vdecomp_out.imp_y_fac_mat:  'N x n_fac x n_horizn' matrix of impulse respones using unit standard deviation normalization
					irf_vdecomp_out.imp_y_fac_mat_scl:  'N x n_fac x n_horizn' matrix of impulse respones using unit effect normalization
					irf_vdecomp_out.vcomp_y_fac_mat: 'N x n_fac x n_horizon' matrix with variance of each variable associated with each structural shock. (Amount, not a fraction).
					irf_vdecomp_out.vcomp_y_u_mat: 'N x n_horizon' matrix with variance of each variable associated with its idiosynchratic shock. (Amount, not a fraction).
					irf_vdecomp_out.vtotal_y_fac_mat:  'N x n_horizon' matrix with variance of each variable associated with the common shocks (total .. all common shocks). (Amount, not a fraction).
					irf_vdecomp_out.vfrac_y_fac_mat: 'N x n_horizon' matrix with fraction of variance of each variable associated with the common shocks (total .. all common shocks).
					irf_vdecomp_out.vfrac_y_comp_mat: 'N x n_fac x n_horizon' matrix with fraction of variance of each variable associated with each of the common shocks.
				
				se_irf_vdecomp_out
					se_irf_vdecomp_out.mean_imp_y_fac_mat_scl: 'N x n_fac x n_horizn' matrix with mean of the simulations from the impulse respones using unit effect normalization
					se_irf_vdecomp_out.mean_vfrac_y_comp_mat: 'N x n_fac x n_horizon' matrix with the mean of the fraction of variance of each variable associated with each of the common shocks.
					se_irf_vdecomp_out.se_imp_y_fac_mat_scl: 'N x n_fac x n_horizn' matrix with standard deviation of the simulations from the impulse respones using unit effect normalization
					se_irf_vdecomp_out.se_vfrac_y_comp_mat: 'N x n_fac x n_horizon' matrix with the standard deviation of the fraction of variance of each variable associated with each of the common shocks.


Computing a FAVAR with a single identified shock. (The oil-price-exogenous SDFM is an example of this.)  
		The program "FAVAR_price_exogenous.m" provides an example of how to do this.
		
		The structure is nearly identical to the SDFM program.  Here are few differences
		
		 nfac.observed: Number of observed factors.  This is = 1 in this example
		 nfac.unobserved: Number of unobserved factors (7 in our example).
		 
		 In the FAVAR, you have to supply the data on the observed factor. This is save in
		 
		 est_par.fac_par.w:  "T x n_observed" matrix that contains the observed factors.  Note:  There can be no missing data over the estimation sample period. 
		 
		 In our example we used the following code to input these variables:
		 
		 		% Number of unobserved factors
				nfac.unobserved = 7;
				% .. Observed Factor ..
				str_var = {'WPU0561'};
				observed_factor = observed_factor_setup(str_var,datain);
				est_par.fac_par.w = observed_factor.fac;         % This contains data on the observed factors
				nfac.observed = size(observed_factor.fac,2);
				
		 where the program "observed_factor_setup" pulls the observed factor data from the dataset.
		 

SDFMs, FAVARs, and hybrid models with multiple shocks:  The programs "sdfm_kilian.m" and "favar_kilian.m" contain examples where multiple shocks are identified.  These programs have the same structure as the single-shock models.  If you look at the code AFTER successfully estimating a single-identified-shock model you should be able to understand what's going on.

SVARS:  We have estimated SVARs as a special case of the FAVAR model, but with no unobserved factors. An advantage of this formulation is that the setup computes coherent IRFs and Var Decomps for ALL of the variables in the dataset, not just the series included in the SVAR.  The programs SVAR_oil_price_exogenous.m and SVAR_kilian.m are two examples.

     	
Finally, there are two programs that are used in all of the programs above that are worth mentioning:

	factor_estimation_ls.m:
		This program carries out least squares estimation of the factors. 
		If you look at the bottom of the program you can see variables that it computes and save.
		
	factor_estimation_ls_full.m
		This program calls factor_estimation_ls.m to estimate the factors. It then
			(a) Computes a VAR for the factors
			(b) Compute the regression of all the variables on the factors to estimate lambda
			(c) Estimates a univariate autoregression for each of the series-specific error terms (e(t) in equation (6) in the paper.
		Again, if you look at the bottom of the program you can see variables that it computes and save.  