Winsorizing in R to match SAS

Here is a SAS macro for winsorizing data that seems to be commonly used in accounting and finance research. This code comes from replication materials provided by Fang, Huang and Karpoff (“FHK”) here. Below I provide an R function that produces equivalent results.

%MACRO winsorize(var, var_out, by_var, left, right, input, output);
	%LET var_number = 1; 
	%LET var&var_number = %QSCAN(&var, &var_number, %STR( ));
	%DO %WHILE (&&var&var_number NE);
		%LET var_number = %EVAL(&var_number + 1);
		%LET var&var_number = %QSCAN(&var, &var_number, %STR( ));
	%LET var_number = %EVAL(&var_number - 1);

  %IF &left = AND &right = %THEN %DO;
    %PUT ERROR: Both top and bottom percentiles to winsorize are missing, no winsorize is needed!!!;
    %GOTO exit;

	%IF &var_out = AUTONAME %THEN %DO;
		%DO i = 1 %TO &var_number;
			%LET &&var_out_&i = &&var&i.._w_&left._&right;

	%ELSE %DO; 
		%LET var_out_number = 1;
		%LET var_out&var_out_number = %QSCAN(&var_out, &var_out_number, %STR( ));
	  	%DO %WHILE (&&var_out&var_out_number NE);
			%LET var_out_number = %EVAL(&var_out_number + 1);
	 		%LET var_out&var_out_number = %QSCAN(&var_out, &var_out_number, %STR( ));
		%LET var_out_number = %EVAL(&var_out_number - 1);
		%IF &var_number ^= &var_out_number %THEN %DO;
			%PUT ERROR: The Number of variables to winsorize does not equal the number of output variables!!!;
			%PUT variables to be winsorize:;
			%DO i = 1 %TO &var_number;	
				%PUT    &i: &&var&i;		
			%PUT variable names after winsorize:;
			%DO i = 1 %TO &var_out_number; 
				%PUT    &i: &&var_out&i; 
				%GOTO exit;

	%IF &by_var =  %THEN %DO;
		%LET by_var = just_for_merge;
		DATA _winsorize_temp;
			SET &input;
			just_for_merge = 1;
		DATA _winsorize_temp;
			SET &input;
	PROC SORT data = _winsorize_temp;
		BY &by_var;
		BY &by_var;
		VAR %DO i = 1 %TO &var_number; &&var&i %END;;
		OUTPUT OUT = winsorized PCTLPTS = %IF &left ^= %THEN %DO; &left %END; %IF &right ^= %THEN %DO; &right %END;
			PCTLPRE = %DO i=1 %TO &var_number; var_&i %END;
			PCTLNAME = %IF &left ^= %THEN %DO;  _pct_&left %END; %IF &right ^= %THEN %DO;  _pct_&right %END;;

	DATA &output;
		MERGE _winsorize_temp winsorized;
		BY &by_var;
		%DO i=1 %TO &var_number;
			IF &&var&i ^= . THEN DO;
				%IF &left ^= AND &right^= %THEN %DO;
					&&var_out&i = MIN(var_&i._pct_&right, MAX(var_&i._pct_&left, &&var&i));
					DROP var_&i._pct_&left var_&i._pct_&right;
		        %ELSE %IF &left ^= %THEN %DO;
	  	        			&&var_out&i = MAX(var_&i._pct_&left, &&var&i);
							DROP var_&i._pct_&left;
				%ELSE %IF &right ^= %THEN %DO;
  	        		&&var_out&i = MIN(var_&i._pct_&right, &&var&i);
					DROP var_&i._pct_&right;
		%IF &by_var = just_for_merge %THEN DROP just_for_merge;;
%MEND winsorize;

		var = at mtob leverage roa da1_adj da2_adj da3_adj da4_adj da_lr1_adj da_lr2_adj da_lr3_adj da_lr4_adj,
		var_out = at mtob leverage roa da1_adj da2_adj da3_adj da4_adj da_lr1_adj da_lr2_adj da_lr3_adj da_lr4_adj,
		by_var = fyear, 
		left = 1, 
		right = 99, 
		input = sho_accruals2,
		output = sho_accruals2);

I ran the SAS code from FHK and dumped the pre- and post-winsorization data into Stata files.

Below is code that takes the pre-winsorization data, winsorizes it, then compares the result with the post-winsorization data from SAS. Here I have hard-coded the quantiles at which the data are winsorized; it would be easy to tweak the function to make this variable. For sure, the R function is much easier to follow. A critical element to getting the same results is the argument type = 2 (get the details here).

library(dplyr, warn.conflicts = FALSE)

sho_accruals2_post_wins <- read_dta("data/sho_accruals2_post_wins.dta")
sho_accruals2_pre_wins <- read_dta("data/sho_accruals2_pre_wins.dta")

win01 <- function(x) {
    cuts <- quantile(x, probs = c(0.01, 0.99), 
                     type = 2, na.rm = TRUE)
    x[x < cuts[1]] <- cuts[1]
    x[x > cuts[2]] <- cuts[2]

win_vars <- c("at", "mtob", "leverage", "roa", "da1_adj", "da2_adj", 
                 "da3_adj", "da4_adj", "da_lr1_adj", "da_lr2_adj", 
                 "da_lr3_adj", "da_lr4_adj")

sho_accruals2_new <-
    sho_accruals2_pre_wins %>%
    group_by(fyear) %>%
    mutate_at(all_of(win_vars), win01) %>%

merged <-
    sho_accruals2_new %>%
               by = c("gvkey", "fyear"),
               suffix = c("_r", "_sas"))

tol <- 1e-8

merged %>% 
    mutate(diff = abs(da_lr1_adj_r - da_lr1_adj_sas)) %>%
    filter(diff > tol) %>% 
    select(gvkey, fyear, da_lr1_adj_r, da_lr1_adj_sas, diff) %>%
#> # A tibble: 0 x 5
#> # … with 5 variables: gvkey <chr>, fyear <dbl>, da_lr1_adj_r <dbl>,
#> #   da_lr1_adj_sas <dbl>, diff <dbl>

Created on 2020-10-09 by the reprex package (v0.3.0)

view raw hosted with ❤ by GitHub
Posted in Uncategorized | Leave a comment

Getting Fama-French industry data (2020 version)

In a post nearly 10 years ago, I put some code to get Fama-French industry data into R. I now have newer code here.

Posted in Uncategorized | Leave a comment

WRDS now has a PostgreSQL database

See here for an illustration using Python to access the data.

Posted in Uncategorized | Leave a comment

Incorporating SQL in Stata code

Here is a snippet of code showing how one can include SQL in Stata code (I assume that the ODBC connection has been set up).

#delimit ;

local sql "
 FROM director_bio.directorship_results
 WHERE non_match AND start_date < date_filed
 ORDER BY random()
 LIMIT 50";

odbc load, exec("`sql'") dsn("iangow") clear;
Posted in PostgreSQL, Stata, Uncategorized | Tagged , , | Leave a comment

Playing with times in Python

from datetime import datetime
from pytz import timezone, utc

pacific = timezone('US/Pacific')

gemma_bd = datetime(2007, 8, 27, 7, 45)
gemma_bd = pacific.localize(gemma_bd, is_dst=True)

print(utc.localize(datetime.utcnow()) - gemma_bd)
Posted in Uncategorized | Leave a comment

Installing PostgreSQL 9.6 using MacPorts

In another post, I describe how to install PostgreSQL with PL/R. The purpose of this post is to provide a streamlined version without PL/R.

1. Install PostgreSQL using MacPorts..
Of course, I assume you’ve got MacPorts up and running on your system.

sudo port install postgresql96 +perl +python27
sudo port install postgresql96-server

2. Set up PostgreSQL

I first need to initialize the database cluster and then get the server running. The following comes straight from the on-screen instructions supplied with the MacPorts port postgresql96-server. I wonder if it would work if I used the existing (and hidden) _postgres user. I also wonder if I should use the --encoding=UTF8 --locale=en_US options I used in my earlier installation.

sudo mkdir -p /opt/local/var/db/postgresql96/defaultdb
sudo chown postgres:postgres /opt/local/var/db/postgresql96/defaultdb
sudo su postgres -c '/opt/local/lib/postgresql96/bin/initdb -D /opt/local/var/db/postgresql96/defaultdb'

Sometimes permissions are messed up (and PostgreSQL will complain). To fix these:

sudo chmod -R 755 /opt/local/var/db/

Note that MacPorts creates a launch daemon. To load it now and to make sure it launches on system start, do:

sudo defaults write /Library/LaunchDaemons/org.macports.postgresql96-server.plist Disabled -bool false
sudo launchctl load /Library/LaunchDaemons/org.macports.postgresql96-server.plist

Then do some set-up to get the database going.

sudo su - postgres
/opt/local/lib/postgresql96/bin/psql -U postgres -d template1

I tend to do my work using a database named “crsp” (originally I just had CRSP data in there) under my own user name. So I set stuff up accordingly. Within psql:


4. Figure out how to get data into the database.

Posted in Uncategorized | 2 Comments

Executing SQL in a file against an ODBC source from Stata

I recently discovered that Stata has good support for OBDC. While the built-in functionality includes odbc sqlfile, a “batch job alternative to odbc exec,” there seems to be no equivalent alternative for odbc load (even though odbc load is surely the more useful function).

But this is easily addressed using an .ado program I adapted from that discussed in an earlier post.

I created a file (loadsql2.ado) containing the following:

program define loadsql2
*! Load the output of an SQL file into Stata, version 1.6 (
version 13.1
syntax using/, DSN(string)
tempname sqlfile exec line;
file open `sqlfile' using `"`using'"', read text;
file read `sqlfile' `line';
while r(eof)==0 {;
    local `exec' = "``exec''
    file read `sqlfile' `line';
file close `sqlfile';
* display "``exec''";
odbc load , exec("``exec''") dsn("`dsn'") clear;

I placed this file in ~/Library/Application\ Support/Stata/ado/personal.

Since I have a ODBC DSN iangow set up, I can now issue the following (see my earlier post for details on ~/some_sql.sql):

loadsql2 using "~/some_sql.sql", dsn("iangow") 

and then I have data in Stata:

. list in 1/5

     |  gvkey    datadate       at       ni |
  1. | 001073   31dec1985    3.095     .541 |
  2. | 001073   31dec1986    4.422     .584 |
  3. | 001073   31dec1987      3.9    -.359 |
  4. | 001074   31dec1970   34.856    1.716 |
  5. | 001073   31dec1988    2.067   -1.191 |
Posted in Mac, PostgreSQL, Stata | Leave a comment

Another way to access PostgreSQL from Stata

In an earlier post, I discussed a way of accessing PostgreSQL from Stata. One downside is that this approach would not work for Windows users without significant modification. I asked a question on StackOverflow to see if anyone could identify the necessary modifications.

While one user could answer my StackOverflow question, it seems the easier way to access PostgreSQL from Stata on Windows is via ODBC. It turns out to be fairly straightforward to get ODBC working on Mac OS X (see here).

Posted in Uncategorized | Leave a comment

GitHub code repository

For anyone reading this blog, I also have a repository on GitHub dedicated to code for compiling data from standard sources (e.g., WRDS). I’m still in the process of documenting this, but the documentation for some parts (e.g., Perl scripts for getting data from WRDS into PostgreSQL) is fairly complete.

Posted in Uncategorized | Tagged , | Leave a comment

Installing R on Mac OS from the command line

I just updated R remotely. I wanted to avoid the hassle of logging in via VNC (“Screen Sharing”), so I did this:

sudo installer -pkg R-3.2.3.pkg -target /
rm R-3.2.3.pkg

Now, to get RStudio from the command-line:

sudo hdiutil attach RStudio-0.99.489.dmg
sudo cp -R /Volumes/RStudio-0.99.489/ /Applications/
sudo hdiutil detach /Volumes/RStudio-0.99.489
rm RStudio-0.99.489.dmg
Posted in Mac, R | Tagged , | Leave a comment