Setup


Preliminaries

After downloading the code in the github repository, you should first run the setup_library.m script.

Preliminaries

The script starts by setting up the path. It prompts the user to select the folder with the unzipped files from the Assay Toolkit, and stores that folder in a directory character array field of a structure called Params.
clear
clc
 
% Set up the path to include the directory with the MATLAB package
directoryPrompt = ‘Select folder where setup_library.m is located:’;
Params.directory = [uigetdir([], directoryPrompt), filesep];
It then restores the default MATLAB path and adds the Toolkit with its subfolders to it:
restoredefaultpath;
addpath(genpath(Params.directory));
cd(Params.directory);
warning(‘off’,‘all’)
Next, the script asks the user to input their WRDS username and password, and stores those as username and pass character array fields in the Params structure:
Params.username = usernameUI();
Params.pass = passwordUI();
Next, the user needs to set up several additional parameters. The first two are the sample start and end dates, which are stored as SAMPLE_END and SAMPLE_END numeric fields in the Params structure. If the project doesn’t require it, we recommend starting as late as possible to preserve memory usage down the road.
Params.SAMPLE_START = 1925;
Params.SAMPLE_END = 2021;
Next, the user needs to specify a flag that indicates whether to only leave domestic common equity (that is, share codes 10 or 11) from CRSP or not. The flag is stored as a domComEqFlag logical field in the Params structure. The default value is 1, which only leaves domestic common equity and is the standard practice following Fama and French (1992). A value of 0 would include all shares on CRSP (that is, including ADRs, REITs, etc.)
Params.domComEqFlag = true;
Next, the user needs to indicate the COMPUSTAT variables to be downloaded in a COMPVarNames character array field in the Params structure. Those could be either be all (“All”) or specified by a .csv file containing two columns: one that lists the annually updated COMPUSTAT variables and one that lists quarterly updated COMPUSTAT variables.
The file, if referred to, should be on the MATLAB path. The github repo includes a file, “COMPUSTAT Variable Names.csv”, which contains lists of 76 annually updated and 35 quarterly updated COMPUSTAT variables and that is the value of this parameter by default.
Here is a snippet of what the default file looks like:
We recommend adding variables of interest to that file. Downloading all variables is very time- and memory-consuming. The toolkit also contains functions for downloading additional COMPUSTAT variables following the initial setup, which are described at the end of this page.
Params.COMPVarNames = ‘COMPUSTAT Variable Names.csv’;
The final parameter specifies the types of transaction costs to be used in a tcostsType character array field in the Params structure. The three options here are:
  1. ‘gibbs’ – the Gibbs effective spread estimate from Hasbrouck (2009).
  2. ‘lf_combo’ (default) – the low-frequency combination effective spread measure from Chen and Velikov (2022).
  3. ‘full’ – the low- and high-frequency combination effective spread measure from Chen and Velikov (2022).
Note that all of these options require code to be run in advance.
The Gibbs effective spread estimate is used in all of these and Hasbrouck’s (2009) SAS code has to be run annually before the start of the Assay Toolkit setup. The code is included in /Library Update/Inputs/Gibbs/. The output file from that code that includes the Gibbs effective spreads up to 2021 is also included in that folder and will be updated annually. Thus, if your SAMPLE_END parameter is 2021 or earlier, you don’t need to run Hasbrouck’s (2009) SAS code.
For the high-frequency combination measure, one has to have access to WRDS cloud, ISSM, Daily and Monthly TAQ, and WRDS TAQ IID. The code to produce the high-frequency measure comes from Chen and Velikov (2022) and is also included in /Library Update/High-frequency effective spreads/.
Note that, as Chen and Velikov (2022) demonstrate, the low-frequency effective spread measures are severely biased post-decimalization. Here is their Figure 2:
Thus, if you are interested in the more recent performance of anomalies and have access to the high-frequency (TAQ, ISSM) databases, we strongly recommend you use the high-frequency effective spread measure.
Params.tcostsType = ‘full’;
Here is what the Params structure looks like ultimately:
Params
Params = struct with fields:
directory: ‘D:\MATLAB 2022 Update\’
username: ‘mihailv’
pass: ‘*************’
SAMPLE_START: 1925
SAMPLE_END: 2021
domComEqFlag: 1
COMPVarNames: ‘COMPUSTAT Variable Names.csv’
tcostsType: ‘full’
After the initial setup, the next line calls a function (checkJavaHeapMemory) that checks the JAVA heap memory setup and ensures that it is maxed out. This is however, only done on Windows machines, otherwise it just prints a reminder to check it manually
% Check Java Heap Memory
checkJavaHeapMemory();
The next line calls a function (startLogFile) that toggles on a log file in the current folder. The log file stores timekeeping updates that the setup code prints every time a major step is done.
% Start a log file
startLogFile(Params.directory, ‘library_setup’);
Run of library_setup started on 23-Jan-2023 16:32:19 by user mjv5465.

WRDS connection

The next line of code calls a function (setupWRDSConn) that sets up and tests the WRDS PostgreSQL JDBC connection using your username and password entered above and stored in the Params structure.This is what will allow us to download the data from WRDS programatically through MATLAB. You can read more about accessing WRDS data through MATLAB here.
% Set up the WRDS PostgreSQL JDBC connection
setupWRDSConn(Params);
Setting up WRDS connection. Setup started at 23-Jan-2023 16:32:19.
Connection to WRDS is successful.
WRDS connection setup ended at 23-Jan-2023 16:32:22.

Monthly CRSP data

Download raw data

Finally, we get to downloading data. The next line calls a function (getCRSPData) which creates a /Data/CRSP/ subfolder in the main directory (Params.directory), downloads the following CRSP datasets using the getWRDSTable() function, and stores them as .csv files in /Data/CRSP/:
  • MSF
  • MSFHDR
  • MSEDELIST
  • MSEEXCHDATES
  • CCMXPF_LNKHIST
  • STOCKNAMES.
The MSF dataset is the main dataset from the CRSP monthly data. The MSEDELIST dataset has delisting returns. The rest are used for identifying information and merges with COMPUSTAT.
For each dataset, the getWRDSTable() also prints out when the download starts, when it ends, and the number of rows and columns in it.
% Download & store all the CRSP data we’ll need
getCRSPData(Params);
Now working on downloading the raw CRSP. Run started at 23-Jan-2023 16:32:22.
Downloading the WRDS table CRSP.MSFHDR. Download started at 23-Jan-2023 16:32:23.
CRSP.MSFHDR download ended at 23-Jan-2023 16:32:26.
CRSP.MSFHDR has 36408 rows and 46 columns.
Now exporting the WRDS table CRSP.MSFHDR into .csv. Export started at 23-Jan-2023 16:32:26.
CRSP.MSFHDR export ended at 23-Jan-2023 16:32:29.
Downloading the WRDS table CRSP.MSF. Download started at 23-Jan-2023 16:32:31.
CRSP.MSF download ended at 23-Jan-2023 16:36:11.
CRSP.MSF has 4834507 rows and 21 columns.
Now exporting the WRDS table CRSP.MSF into .csv. Export started at 23-Jan-2023 16:36:11.
CRSP.MSF export ended at 23-Jan-2023 16:37:42.
Downloading the WRDS table CRSP.MSEDELIST. Download started at 23-Jan-2023 16:37:44.
CRSP.MSEDELIST download ended at 23-Jan-2023 16:37:44.
CRSP.MSEDELIST has 36408 rows and 19 columns.
Now exporting the WRDS table CRSP.MSEDELIST into .csv. Export started at 23-Jan-2023 16:37:44.
CRSP.MSEDELIST export ended at 23-Jan-2023 16:37:45.
Downloading the WRDS table CRSP.MSEEXCHDATES. Download started at 23-Jan-2023 16:37:45.
CRSP.MSEEXCHDATES download ended at 23-Jan-2023 16:37:54.
CRSP.MSEEXCHDATES has 110198 rows and 25 columns.
Now exporting the WRDS table CRSP.MSEEXCHDATES into .csv. Export started at 23-Jan-2023 16:37:54.
CRSP.MSEEXCHDATES export ended at 23-Jan-2023 16:37:58.
Downloading the WRDS table CRSP.CCMXPF_LNKHIST. Download started at 23-Jan-2023 16:38:00.
CRSP.CCMXPF_LNKHIST download ended at 23-Jan-2023 16:38:01.
CRSP.CCMXPF_LNKHIST has 110923 rows and 8 columns.
Now exporting the WRDS table CRSP.CCMXPF_LNKHIST into .csv. Export started at 23-Jan-2023 16:38:01.
CRSP.CCMXPF_LNKHIST export ended at 23-Jan-2023 16:38:02.
Downloading the WRDS table CRSP.STOCKNAMES. Download started at 23-Jan-2023 16:38:02.
CRSP.STOCKNAMES download ended at 23-Jan-2023 16:38:09.
CRSP.STOCKNAMES has 77779 rows and 16 columns.
Now exporting the WRDS table CRSP.STOCKNAMES into .csv. Export started at 23-Jan-2023 16:38:09.
CRSP.STOCKNAMES export ended at 23-Jan-2023 16:38:11.
CRSP raw data download ended at 23-Jan-2023 16:38:11.

Organize and store

The next line calls a function (makeCRSPMonthlyData) which reads in and stores the raw CRSP data and creates the matrices that we’ll use for asset pricing later. Most of our variables of interest will be stored as matrices with the same dimensions: number of dates (nMonths or nDays) x number of stocks (nStocks). The dimensions will be determined by the number of unique permnos in the CRSP MSF and dates in the MSF/DSF files after filtering based on sample start and end dates and the flag for domestic common equity. The function creates and stores the dates (nMonths x 1) and CRSP’s permno identifier (nStocks x 1) vectors which contain the unique months and permnos, as well as the following matrices (all nMonths x nStocks) with raw CRSP data:
  • shrcd – share code
  • exchcd – exchange code
  • siccd – SIC industrial classification code
  • prc – price (or the negative of the bid/ask midpoint when stock not traded)
  • bid – closing bid
  • ask – closing ask
  • bidlo – closing bid or low price
  • askhi – closing ask or high price
  • vol – monthly share volume (in hundreds)
  • ret_x_dl – holding period return without adjusting for delisting
  • shrout – shares outstanding (in thousands)
  • cfacpr – cumulative factor to adjust price
  • cfacshr – cumulative factor to adjust shares
  • spread – realized closing bid/ask spread
  • retx – holding period return without dividends and without adjusting for delisting
% Make CRSP data
makeCRSPMonthlyData(Params);
Now working on making variables from CRSP. Run started at 23-Jan-2023 16:38:11.
CRSP_MSF file loaded. It contains 4834507 rows and 21 columns.
Removed 1045635 observations from CRSP_MSF which didn’t have share codes 10 or 11.
Removed 13277 observations from CRSP_MSF that were before the start date or after the end date specified in Params.
Now working on variable shrcd, which is 1 out of 15.
Done with shrcd.
Now working on variable exchcd, which is 2 out of 15.
Done with exchcd.
Now working on variable siccd, which is 3 out of 15.
Done with siccd.
Now working on variable prc, which is 4 out of 15.
Done with prc.
Now working on variable bid, which is 5 out of 15.
Done with bid.
Now working on variable ask, which is 6 out of 15.
Done with ask.
Now working on variable bidlo, which is 7 out of 15.
Done with bidlo.
Now working on variable askhi, which is 8 out of 15.
Done with askhi.
Now working on variable vol_x_adj, which is 9 out of 15.
Done with vol_x_adj.
Now working on variable ret_x_dl, which is 10 out of 15.
Done with ret_x_dl.
Now working on variable shrout, which is 11 out of 15.
Done with shrout.
Now working on variable cfacpr, which is 12 out of 15.
Done with cfacpr.
Now working on variable cfacshr, which is 13 out of 15.
Done with cfacshr.
Now working on variable spread, which is 14 out of 15.
Done with spread.
Now working on variable retx, which is 15 out of 15.
Done with retx.
CRSP monthly variables run ended at 23-Jan-2023 16:40:31.

Make derived variables

The next line calls a function (makeCRSPDerivedVariables) which creates variables that are derived from the raw CRSP variables and stored in the /Data/ subfolder in the main directory (Params.directory). These include:
  • Return adjusted for delisting – ret (nMonths x nStocks). The delisting adjustment just adds the delisting return for each permno in the month following the last month with return data. The resulting return matrix, ret, has dimensions nMonths x nStocks and is main matrix used for asset pricing research
  • Market capitalization matrix – me (nMonths x nStocks)
  • NYSE indicator matrix – NYSE (nMonths x nStocks)
  • Fama-French factors – the makeCRSPDerivedVariables() function calls another function, getFFFactors(), which programatically downloads the Fama-French factors from Ken French’s website, reshapes them as vectors with the same size as our dates vector (nMonths x 1), and stores them in ff.mat
  • Industry classificaitons – the makeCRSPDerivedVariables() function calls another function, makeIndustryClassifications(), which creates indicator matrices (nMonths x nStocks) for the SIC industrial classification and Fama-French 10-, 17-, and 49-industry classifications.
  • Industry returns – the makeCRSPDerivedVariables() function calls another function, makeIndustryReturns(), which creates a matrix (nMonths x nInds) with the value-weighted industry returns for the Fama-French 49 industries.
  • Universes – the makeCRSPDerivedVariables() function calls another function, makeUniverses(), which creates a structure with several stock unverse designations (Russell and Fama-French)
  • Share issuance variables – ashrout and dashrout
  • Past performance variables – the makeCRSPDerivedVariables() function calls another function, makePastPerformance(), which repeatedly creates several past performance (i.e., momentum & reversals) variables – R (classic 12-1 momentum), R62 (recent 6-1 momentum), R127 (intermediate horizon 12-6 momentum), R3613 (long-run reversals).
Finally, the makeCRSPMonthlyData() function calls another function, testCRSPData(), which tests whether the download and formatting were done correctly by regressing a market and momentum factors created from the just downloaded variables on the Fama-French market and momentum factors. We should see really high t-statistcs here.
% Make additional CRSP variables
makeCRSPDerivedVariables(Params);
Now working on making variables from CRSP. Run started at 23-Jan-2023 16:40:31.
Adjusting for delisting complete. Kodak’s delisting return was -0.5381 in 201201
Fama-French factors creation complete @ 23-Jan-2023 16:40:46.
Now working on industry classifications at 23-Jan-2023 16:40:46.
Universe creation complete.
Now working on testing the CRSP data at 23-Jan-2023 16:42:16.
Regress the Fama-French MKT on our MKT:
Ordinary Least-squares Estimates
R-squared = 1.0000
Rbar-squared = 1.0000
sigma^2 = 0.0000
Durbin-Watson = 1.9086
Nobs, Nvars = 1146, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 -0.002241 -4.729722 0.000003
variable 2 1.001186 11350.235970 0.000000
Regress the Fama-French MKT on our MKT, constructed using returns that are not adjusted for delisting:
Ordinary Least-squares Estimates
R-squared = 1.0000
Rbar-squared = 1.0000
sigma^2 = 0.0000
Durbin-Watson = 1.9604
Nobs, Nvars = 1146, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 -0.000062 -0.502679 0.615287
variable 2 1.000046 43883.935259 0.000000
The correlation between UMD from Ken French and replicated UMD is 99.61%.
Compare the average return UMD from Ken French and replicated UMD:
Ordinary Least-squares Estimates
R-squared = 0.0000
Rbar-squared = 0.0000
sigma^2 = 0.0022
Durbin-Watson = 1.8492
Nobs, Nvars = 1140, 1
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.638500 4.579802 0.000005
Ordinary Least-squares Estimates
R-squared = 0.0000
Rbar-squared = 0.0000
sigma^2 = 0.0022
Durbin-Watson = 1.8225
Nobs, Nvars = 1140, 1
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.639785 4.630946 0.000004
Regress the two on each other:
Ordinary Least-squares Estimates
R-squared = 0.9922
Rbar-squared = 0.9922
sigma^2 = 0.0000
Durbin-Watson = 2.2448
Nobs, Nvars = 1140, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 -0.004605 -0.370182 0.711316
variable 2 1.005189 380.286469 0.000000
Ordinary Least-squares Estimates
R-squared = 0.9922
Rbar-squared = 0.9922
sigma^2 = 0.0000
Durbin-Watson = 2.2181
Nobs, Nvars = 1140, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.009541 0.774107 0.439028
variable 2 0.987070 380.286469 0.000000
CRSP monthly derived variables run ended at 23-Jan-2023 16:42:26.

COMPUSTAT

Download raw data

The next line calls a function, getCOMPUSTATData(), which creates a /Data/COMPUSTAT/ subfolders in the main directory (Params.directory), downloads the annual and quarterly COMPUSTAT variables indicated by Params.COMPVarNames using the getWRDSTable() function, and stores them in /Data/COMPUSTAT/. To download the specific variables indicated by the Params.COMPVarNames file, getCOMPUSTATData() calls another function, getCOMPUSTATQuery(), which creates a specific query that is then passed onto getWRDSTable(). The default Params.COMPVarNames file (“COMPUSTAT Variable Names.csv”) includes 76 annual and 35 quarterly COMPUSTAT variables.
% Download & store all the COMPUSTAT data we’ll need (annual and quarterly)
getCOMPUSTATData(Params);
Now working on downloading the raw COMPUSTAT. Run started at 23-Jan-2023 16:42:26.
The following variables are not on COMPUSTAT, so they will not be created: div, xssd,
Downloading the WRDS table COMP.FUNDA. Download started at 23-Jan-2023 16:42:27.
COMP.FUNDA download ended at 23-Jan-2023 16:42:58.
COMP.FUNDA has 552606 rows and 77 columns.
Now exporting the WRDS table COMP.FUNDA into .csv. Export started at 23-Jan-2023 16:42:58.
COMP.FUNDA export ended at 23-Jan-2023 16:43:22.
Downloading the WRDS table COMP.FUNDQ. Download started at 23-Jan-2023 16:43:22.
COMP.FUNDQ download ended at 23-Jan-2023 16:44:26.
COMP.FUNDQ has 1935665 rows and 36 columns.
Now exporting the WRDS table COMP.FUNDQ into .csv. Export started at 23-Jan-2023 16:44:26.
COMP.FUNDQ export ended at 23-Jan-2023 16:45:11.
COMPUSTAT raw data download ended at 23-Jan-2023 16:45:12.

Merge, organize, and store

The next line calls a function, mergeCRSPCOMP(), which reads in and stores the raw annual and quarterly COMPUSTAT data and creates (nMonths x nStocks) matrices for all of the downloaded variables. It uses the CCMXPF_LNKHIST to merge CRSP and COMPUSTAT based on its linkage of COMPUSTAT gvkey’s to CRSP permno’s. The merge protocol is as follows:
  • Only link types LC or LU are used
  • COMPUSTAT datadate variable is between the LINKDT and LINKENDDT
  • For the annual COMPUSTAT variables:
  • Annual COMPUSTAT variables are assigned to June of the year following the datadate year
  • When a company changes its fiscal year end within a year, only the latter fiscal year end date variable values are used
  • For the quarterly COMPUTSAT variables:
  • Quarterly COMPUSTAT variables are assigned the month corresponding to the RDQ variable
  • When there are permno-RDQ pairs with multiple fiscal quarter ends, only the fiscal quarter end closest to the RDQ date variable values are used
% Merge CRSP & COMPUSTAT, store all variables
mergeCRSPCOMP(Params);
Now working on merging CRSP and annual COMPUSTAT. Run started at 23-Jan-2023 16:45:12.
There were 674 cases of permno-years in which companies moved their fiscal year end.
Now working on COMPUSTAT variable ACT, which is 1/76.
Now working on COMPUSTAT variable AOLOCH, which is 2/76.
Now working on COMPUSTAT variable AP, which is 3/76.
Now working on COMPUSTAT variable APALCH, which is 4/76.
Now working on COMPUSTAT variable AT, which is 5/76.
Now working on COMPUSTAT variable BAST, which is 6/76.
Now working on COMPUSTAT variable CAPX, which is 7/76.
Now working on COMPUSTAT variable CEQ, which is 8/76.
Now working on COMPUSTAT variable CHE, which is 9/76.
Now working on COMPUSTAT variable CHECH, which is 10/76.
Now working on COMPUSTAT variable COGS, which is 11/76.
Now working on COMPUSTAT variable DD1, which is 12/76.
Now working on COMPUSTAT variable DD2, which is 13/76.
Now working on COMPUSTAT variable DD3, which is 14/76.
Now working on COMPUSTAT variable DD4, which is 15/76.
Now working on COMPUSTAT variable DD5, which is 16/76.
Now working on COMPUSTAT variable DLC, which is 17/76.
Now working on COMPUSTAT variable DLTT, which is 18/76.
Now working on COMPUSTAT variable DP, which is 19/76.
Now working on COMPUSTAT variable DVC, which is 20/76.
Now working on COMPUSTAT variable EBIT, which is 21/76.
Now working on COMPUSTAT variable EBITDA, which is 22/76.
Now working on COMPUSTAT variable EMP, which is 23/76.
Now working on COMPUSTAT variable ESUBC, which is 24/76.
Now working on COMPUSTAT variable FCA, which is 25/76.
Now working on COMPUSTAT variable FOPO, which is 26/76.
Now working on COMPUSTAT variable FOPT, which is 27/76.
Now working on COMPUSTAT variable GP, which is 28/76.
Now working on COMPUSTAT variable IB, which is 29/76.
Now working on COMPUSTAT variable INVCH, which is 30/76.
Now working on COMPUSTAT variable INVFG, which is 31/76.
Now working on COMPUSTAT variable INVT, which is 32/76.
Now working on COMPUSTAT variable ITCB, which is 33/76.
Now working on COMPUSTAT variable LCT, which is 34/76.
Now working on COMPUSTAT variable LT, which is 35/76.
Now working on COMPUSTAT variable NI, which is 36/76.
Now working on COMPUSTAT variable NP, which is 37/76.
Now working on COMPUSTAT variable OANCF, which is 38/76.
Now working on COMPUSTAT variable OIADP, which is 39/76.
Now working on COMPUSTAT variable OIBDP, which is 40/76.
Now working on COMPUSTAT variable PI, which is 41/76.
Now working on COMPUSTAT variable PPEGT, which is 42/76.
Now working on COMPUSTAT variable PPENT, which is 43/76.
Now working on COMPUSTAT variable PRSTKC, which is 44/76.
Now working on COMPUSTAT variable PRSTKCC, which is 45/76.
Now working on COMPUSTAT variable PRSTKPC, which is 46/76.
Now working on COMPUSTAT variable PSTK, which is 47/76.
Now working on COMPUSTAT variable PSTKL, which is 48/76.
Now working on COMPUSTAT variable PSTKRV, which is 49/76.
Now working on COMPUSTAT variable RECCH, which is 50/76.
Now working on COMPUSTAT variable RECT, which is 51/76.
Now working on COMPUSTAT variable REVT, which is 52/76.
Now working on COMPUSTAT variable SALE, which is 53/76.
Now working on COMPUSTAT variable SCSTKC, which is 54/76.
Now working on COMPUSTAT variable SEQ, which is 55/76.
Now working on COMPUSTAT variable SICH, which is 56/76.
Now working on COMPUSTAT variable SPPIV, which is 57/76.
Now working on COMPUSTAT variable TXACH, which is 58/76.
Now working on COMPUSTAT variable TXDB, which is 59/76.
Now working on COMPUSTAT variable TXDC, which is 60/76.
Now working on COMPUSTAT variable TXDITC, which is 61/76.
Now working on COMPUSTAT variable TXP, which is 62/76.
Now working on COMPUSTAT variable WCAP, which is 63/76.
Now working on COMPUSTAT variable WCAPCH, which is 64/76.
Now working on COMPUSTAT variable XAD, which is 65/76.
Now working on COMPUSTAT variable XIDO, which is 66/76.
Now working on COMPUSTAT variable XINT, which is 67/76.
Now working on COMPUSTAT variable XLR, which is 68/76.
Now working on COMPUSTAT variable XRD, which is 69/76.
Now working on COMPUSTAT variable XSGA, which is 70/76.
Now working on COMPUSTAT variable XPP, which is 71/76.
Now working on COMPUSTAT variable DRC, which is 72/76.
Now working on COMPUSTAT variable DRLT, which is 73/76.
Now working on COMPUSTAT variable XACC, which is 74/76.
Now working on COMPUSTAT variable DVP, which is 75/76.
Now working on COMPUSTAT variable FYE, which is 76/76.
CRSP and Annual COMPUSTAT merge ended at 23-Jan-2023 17:13:03.
Now working on merging CRSP and quarterly COMPUSTAT. Run started at 23-Jan-2023 17:13:03.
There were 2753 cases of permno-RDQ months associated with multiple quarters.
Now working on COMPUSTAT variable ACTQ, which is 1/35.
Now working on COMPUSTAT variable ATQ, which is 2/35.
Now working on COMPUSTAT variable CEQQ, which is 3/35.
Now working on COMPUSTAT variable CHEQ, which is 4/35.
Now working on COMPUSTAT variable COGSQ, which is 5/35.
Now working on COMPUSTAT variable CSHOQ, which is 6/35.
Now working on COMPUSTAT variable DLCQ, which is 7/35.
Now working on COMPUSTAT variable DLTTQ, which is 8/35.
Now working on COMPUSTAT variable DPQ, which is 9/35.
Now working on COMPUSTAT variable EPSPXQ, which is 10/35.
Now working on COMPUSTAT variable IBQ, which is 11/35.
Now working on COMPUSTAT variable INVTQ, which is 12/35.
Now working on COMPUSTAT variable LCTQ, which is 13/35.
Now working on COMPUSTAT variable LTQ, which is 14/35.
Now working on COMPUSTAT variable NIQ, which is 15/35.
Now working on COMPUSTAT variable OIADPQ, which is 16/35.
Now working on COMPUSTAT variable OIBDPQ, which is 17/35.
Now working on COMPUSTAT variable PIQ, which is 18/35.
Now working on COMPUSTAT variable PPEGTQ, which is 19/35.
Now working on COMPUSTAT variable PPENTQ, which is 20/35.
Now working on COMPUSTAT variable PSTKNQ, which is 21/35.
Now working on COMPUSTAT variable PSTKQ, which is 22/35.
Now working on COMPUSTAT variable PSTKRQ, which is 23/35.
Now working on COMPUSTAT variable RECTQ, which is 24/35.
Now working on COMPUSTAT variable REVTQ, which is 25/35.
Now working on COMPUSTAT variable SALEQ, which is 26/35.
Now working on COMPUSTAT variable SEQQ, which is 27/35.
Now working on COMPUSTAT variable TXDBQ, which is 28/35.
Now working on COMPUSTAT variable TXDITCQ, which is 29/35.
Now working on COMPUSTAT variable XINTQ, which is 30/35.
Now working on COMPUSTAT variable XSGAQ, which is 31/35.
Now working on COMPUSTAT variable ADJEX, which is 32/35.
Now working on COMPUSTAT variable WCAPQ, which is 33/35.
Now working on COMPUSTAT variable RDQ, which is 34/35.
Now working on COMPUSTAT variable FQTR, which is 35/35.
CRSP and Quarterly COMPUSTAT merge ended at 23-Jan-2023 17:29:08.

Make derived variables

The next line calls a function (makeCOMPUSTATDerivedVariables) which creates variables that are derived from the merged raw COMPUSTAT variables and stored in the /Data/ subfolder ni the main directory (Params.directory). These variables are too many to list here, but they include book equity (including the historical book equity values from Davis, Fama, and French, 2000), book-to-market, cash flow from operations, and earnings surprises, among others.
% Make additional COMPUSTAT variables
makeCOMPUSTATDerivedVariables(Params);
Now working on making variables derived from COMPUSTAT. Run started at 23-Jan-2023 17:29:09.
Now let’s compare our HML with the HML from Ken French’s website:
The correlation between HML from Ken French and replicated HML is 99.11%.
Compare the average return HML from Ken French and replicated HML:
Ordinary Least-squares Estimates
R-squared = 0.0000
Rbar-squared = 0.0000
sigma^2 = 0.0012
Durbin-Watson = 1.5847
Nobs, Nvars = 1146, 1
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.338682 3.244532 0.001210
Ordinary Least-squares Estimates
R-squared = 0.0000
Rbar-squared = 0.0000
sigma^2 = 0.0012
Durbin-Watson = 1.5894
Nobs, Nvars = 1146, 1
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.323876 3.116776 0.001874
Regress the two on each other:
Ordinary Least-squares Estimates
R-squared = 0.9822
Rbar-squared = 0.9822
sigma^2 = 0.0000
Durbin-Watson = 2.1236
Nobs, Nvars = 1146, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 0.016237 1.161937 0.245503
variable 2 0.995584 251.572073 0.000000
Ordinary Least-squares Estimates
R-squared = 0.9822
Rbar-squared = 0.9822
sigma^2 = 0.0000
Durbin-Watson = 2.1283
Nobs, Nvars = 1146, 2
***************************************************************
Variable Coefficient t-statistic t-probability
variable 1 -0.010269 -0.737944 0.460700
variable 2 0.986602 251.572073 0.000000
COMPUSTAT derived variables run ended at 23-Jan-2023 17:33:34.

Daily CRSP

Download raw data

The next line calls a function (getCRSPDailyData) which creates a /Data/CRSP/daily/ subfolder in the main directory (Params.directory), downloads subsamples from the CRSP DSF dataset using the getWRDSTable() function, and stores them in /Data/CRSP/daily/. The breakdown of the DSF dataset is due to potential memory limitations and issues with internet connectivity. It is done based on the following breakpoints (with the start date moved to Params.SAMPLE_START if not equal to 1925):
  • 1925
  • 1950
  • 1975
  • 1985
  • 5-year increments (1990, 1995, etc.) up to Params.SAMPLE_END
The function also downloads the DSEDELIST dataset from CRSP.
% Download & store all the daily CRSP data we’ll need
getCRSPDailyData(Params);
Now working on downloading the raw daily CRSP. Run started at 23-Jan-2023 17:33:34.
Now working on 1924-1950 daily stock file.
Downloading the WRDS table CRSP.DSF1. Download started at 23-Jan-2023 17:33:35.
CRSP.DSF1 download ended at 23-Jan-2023 17:34:50.
CRSP.DSF1 has 5737990 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF1 into .csv. Export started at 23-Jan-2023 17:34:50.
CRSP.DSF1 export ended at 23-Jan-2023 17:36:09.
Now working on 1950-1975 daily stock file.
Downloading the WRDS table CRSP.DSF2. Download started at 23-Jan-2023 17:36:10.
CRSP.DSF2 download ended at 23-Jan-2023 17:41:38.
CRSP.DSF2 has 13274552 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF2 into .csv. Export started at 23-Jan-2023 17:41:38.
CRSP.DSF2 export ended at 23-Jan-2023 17:44:28.
Now working on 1975-1985 daily stock file.
Downloading the WRDS table CRSP.DSF3. Download started at 23-Jan-2023 17:44:30.
CRSP.DSF3 download ended at 23-Jan-2023 17:50:10.
CRSP.DSF3 has 13983674 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF3 into .csv. Export started at 23-Jan-2023 17:50:10.
CRSP.DSF3 export ended at 23-Jan-2023 17:52:59.
Now working on 1985-1990 daily stock file.
Downloading the WRDS table CRSP.DSF4. Download started at 23-Jan-2023 17:53:02.
CRSP.DSF4 download ended at 23-Jan-2023 17:55:17.
CRSP.DSF4 has 8784736 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF4 into .csv. Export started at 23-Jan-2023 17:55:17.
CRSP.DSF4 export ended at 23-Jan-2023 17:57:08.
Now working on 1990-1995 daily stock file.
Downloading the WRDS table CRSP.DSF5. Download started at 23-Jan-2023 17:57:09.
CRSP.DSF5 download ended at 23-Jan-2023 17:59:32.
CRSP.DSF5 has 9538594 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF5 into .csv. Export started at 23-Jan-2023 17:59:32.
CRSP.DSF5 export ended at 23-Jan-2023 18:01:37.
Now working on 1995-2000 daily stock file.
Downloading the WRDS table CRSP.DSF6. Download started at 23-Jan-2023 18:01:38.
CRSP.DSF6 download ended at 23-Jan-2023 18:04:12.
CRSP.DSF6 has 11169752 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF6 into .csv. Export started at 23-Jan-2023 18:04:12.
CRSP.DSF6 export ended at 23-Jan-2023 18:06:43.
Now working on 2000-2005 daily stock file.
Downloading the WRDS table CRSP.DSF7. Download started at 23-Jan-2023 18:06:45.
CRSP.DSF7 download ended at 23-Jan-2023 18:09:21.
CRSP.DSF7 has 9071529 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF7 into .csv. Export started at 23-Jan-2023 18:09:21.
CRSP.DSF7 export ended at 23-Jan-2023 18:11:44.
Now working on 2005-2010 daily stock file.
Downloading the WRDS table CRSP.DSF8. Download started at 23-Jan-2023 18:11:46.
CRSP.DSF8 download ended at 23-Jan-2023 18:14:16.
CRSP.DSF8 has 8668041 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF8 into .csv. Export started at 23-Jan-2023 18:14:16.
CRSP.DSF8 export ended at 23-Jan-2023 18:16:44.
Now working on 2010-2015 daily stock file.
Downloading the WRDS table CRSP.DSF9. Download started at 23-Jan-2023 18:16:46.
CRSP.DSF9 download ended at 23-Jan-2023 18:19:33.
CRSP.DSF9 has 8698332 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF9 into .csv. Export started at 23-Jan-2023 18:19:33.
CRSP.DSF9 export ended at 23-Jan-2023 18:21:57.
Now working on 2015-2020 daily stock file.
Downloading the WRDS table CRSP.DSF10. Download started at 23-Jan-2023 18:21:58.
CRSP.DSF10 download ended at 23-Jan-2023 18:24:59.
CRSP.DSF10 has 9375566 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF10 into .csv. Export started at 23-Jan-2023 18:24:59.
CRSP.DSF10 export ended at 23-Jan-2023 18:27:13.
Now working on 2020-2021 daily stock file.
Downloading the WRDS table CRSP.DSF11. Download started at 23-Jan-2023 18:27:15.
CRSP.DSF11 download ended at 23-Jan-2023 18:27:40.
CRSP.DSF11 has 2184272 rows and 14 columns.
Now exporting the WRDS table CRSP.DSF11 into .csv. Export started at 23-Jan-2023 18:27:40.
CRSP.DSF11 export ended at 23-Jan-2023 18:28:11.
Downloading the WRDS table CRSP.DSEDELIST. Download started at 23-Jan-2023 18:28:11.
CRSP.DSEDELIST download ended at 23-Jan-2023 18:28:12.
CRSP.DSEDELIST has 36437 rows and 19 columns.
Now exporting the WRDS table CRSP.DSEDELIST into .csv. Export started at 23-Jan-2023 18:28:12.
CRSP.DSEDELIST export ended at 23-Jan-2023 18:28:12.
Daily CRSP raw data download ended at 23-Jan-2023 18:28:13.

Organize and store

The next line calls a function (makeCRSPDailyData) which reads in and concatenates the raw daily CRSP data and creates the daily matrices. The function first creates and stores the ddates (nDays x 1) vectors which contains and identifies the unique days for which we have daily CRSP data. It also creates the following matrices (all nDays x nStocks):
  • dprc – price (or the negative of the bid/ask midpoint when stock not traded)
  • dbid – closing bid
  • dask – closing ask
  • dbidlo – closing bid or low price
  • daskhi – closing ask or high price
  • dvol – monthly share volume (in hundreds)
  • dret_x_dl – holding period return without adjusting for delisting
  • dshrout – shares outstanding (in thousands)
  • dcfacpr – cumulative factor to adjust price
  • dcfacshr – cumulative factor to adjust shares
  • dopen – open price
  • dnumtrd – number of trades
% Construct the raw variables from the CRSP daily data
makeCRSPDailyData(Params);
Now working on making variables from daily CRSP. Let’s read the files in first. Run started at 23-Jan-2023 18:28:13.
Now working on assigning the data to our familiar daily matrices. Run started at 23-Jan-2023 18:43:14. This step takes a couple of hours.
CRSP daily variables assigned at 23-Jan-2023 19:34:23. Now storing them.
CRSP daily variables run ended at 23-Jan-2023 19:38:05.

Make derived variables

The next line calls a function (makeCRSPDailyDerivedVariables) which creates variables that are derived from the raw daily CRSP variables and stored in the /Data/ subfolder in the main directory (Params.directory). These include:
  • Daily return adjusted for delisting. The delisting adjustment just adds the delisting return for each permno in the day following the last day with return data. The resulting return matrix (dret) has dimensions nDays x nStocks and is main matrix used for asset pricing research
  • Daily market capitalization matrix – dme (nDays x nStocks)
  • Daily Fama-French factors – the makeCRSPDailyDerivedVariables() function calls another function (getFFDailyFactors) which programatically downloads the Fama-French daily factors from Ken French’s website, reshapes them as vectors with the same size as our ddates vector (nDays x 1), and stores them in dff.mat
  • Amihud illiquidity measure at the monthly level – amihud (nMonths x nStocks)
  • Realized volatility measures at the monthly level – RVOL1, RVOL3, RVOL6, RVOL12, RVOL36, RVOL60 (all nMonths x nStocks)
  • Max/min daily return at the monthly level – dretmax, dretmin (both nMonths x nStocks)
  • Idiosyncratic volatility measures at the monthly level – IVOL, IVOL3, IffVOL, IffVOL3 (all nMonths x nStocks)
  • Cumulative abnormal returns around earnings announcements at the monthly level- CAR3 (nMonths x nStocks)
% Make additional variables that use CRSP daily
makeCRSPDailyDerivedVariables(Params);
Now working on creating some variables derived from daily CRSP. Run started at 23-Jan-2023 19:38:12.
Daily Fama-French factors creation complete @ 23-Jan-2023 19:40:48.
CRSP daily derived variables run ended at 23-Jan-2023 20:17:32.

Trading costs

The next line calls a function (makeTradingCosts) which creates the trading cost measure based on the trading cost type specified in the Params structure. The three types are as follows:
  1. ‘gibbs’ – the Gibbs effective spread estimate from Hasbrouck (2009).
  2. ‘lf_combo’ (default) – the low-frequency combination effective spread measure from Chen and Velikov (2022).
  3. ‘full’ – the low- and high-frequency combination effective spread measure from Chen and Velikov (2022).
The function stores the following variables:
  • A structure with the raw effective spreads for all individual measures included in the chosen trading cost type – effSpreadStrut
  • A trading cost matrix before filling in the missing observations through closest match – tcosts_raw (Months x nStocks)
  • A trading cost matrix after filling in the missing observations through closest match – tcosts (Months x nStocks)
  • Trading cost type character array – tcostsType
  • Fama-French factors trading costs .mat file that includes vectors with trading costs for all factors (nMonths x 1) – ff_tc.mat
The tcosts matrix is the one that will be used for measuring trading costs on cross-sectional equity trading strategies.
% Make transaction costs
makeTradingCosts(Params)
Now working on creating the transaction costs. Run started at 23-Jan-2023 20:17:33.
Now working on Hasbrouck’s (2009) Gibbs construction. Run started at 23-Jan-2023 20:17:33.
Now working on Corwin and Schultz HL effective spread construction. Run started at 23-Jan-2023 20:17:45.
Now working on Abdi and Ranaldo CHL effective spread construction. Run started at 23-Jan-2023 20:20:27.
Now working on Kyle and Obizhaeva’s (2016) volume-over-volatility effective spread construction. Run started at 23-Jan-2023 20:23:55.
Now working on TAQ+ISSM effective spread construction. Run started at 23-Jan-2023 20:25:09.
Now working on the FF factors trading costs construction. Run started at 23-Jan-2023 20:35:45.
Trading costs construction run ended at 23-Jan-2023 20:36:36.

Anomalies from Novy-Marx and Velikov (2016)

The next line calls a function (makeNovyMarxVelikovAnomalies) which creates a table with data for the 23 anomalies from Novy-Marx and Velikov (2016), and stores that table as a .csv file in /Data/Anomalies/.
% Make anomalies
makeNovyMarxVelikovAnomalies(Params)
Now working on making anomaly signals from Novy-Marx and Velikov (RFS, 2016). Run started at 23-Jan-2023 20:36:37.
Anomaly signal run ended, data exported at 23-Jan-2023 20:39:21.

Betas from Novy-Marx and Velikov (2022)

The next line calls a function (makeBetas) which creates multiple types of betas and stores those individually and in /Data/betas.mat. See Novy-Marx and Velikov (2022) for more details on the beta estimations.
% Make betas
makeBetas(Params);
Now working on making the betas. Run started at 23-Jan-2023 20:39:21.
Making Frazzini-Pedersen (2014) betas first.
Making the rest of the betas next.
Beta construction run ended at 23-Jan-2023 21:27:26.
The last line of the code toggles the log file off.
% End the log file
diary off

Additional COMPUSTAT variables

If you need additional COMPUSTAT variables (e.g., these random ones), you can run the following in the main Toolkit folder:
% Annual data, single variable
getCOMPUSTATAdditionalData(Params.username, Params.pass, {‘RDIP’}, ‘annual’);
There were 674 cases of permno-years in which companies moved their fiscal year end.
Now working on COMPUSTAT variable RDIP, which is 1/1.
 
% Annual data, multiple variables
getCOMPUSTATAdditionalData(Params.username, Params.pass, {‘RDIP’, ‘RCP’}, ‘annual’);
There were 674 cases of permno-years in which companies moved their fiscal year end.
Now working on COMPUSTAT variable RDIP, which is 1/2.
Now working on COMPUSTAT variable RCP, which is 2/2.
 
% Quarterly data, single variable
getCOMPUSTATAdditionalData(Params.username, Params.pass, {‘RDIPQ’}, ‘quarterly’);
There were 2746 cases of permno-RDQ months associated with multiple quarters.
Now working on COMPUSTAT variable RDIPQ, which is 1/1.
 
% Quarterly data, multiple variables
getCOMPUSTATAdditionalData(Params.username, Params.pass, {‘RDIPQ’, ‘RCPQ’}, ‘quarterly’);
There were 2746 cases of permno-RDQ months associated with multiple quarters.
Now working on COMPUSTAT variable RDIPQ, which is 1/2.
Now working on COMPUSTAT variable RCPQ, which is 2/2.