Skip to main content

SAS Guide for Stata Users

This document describes the basic features of using SAS to work with and manipulate datasets, and to perform essential statistical functions. It is written from the point-of-view of Stata users, i.e. it assumes that you are familiar with using Stata to perform these operations.

The advantages of SAS lie in its flexibility and superior performance when working with very large datasets. Even with the computing power and memory on today's computers, Stata often becomes unwieldy and inefficient when working with large datasets (that can be hundreds of megabytes in size), because Stata puts the entire datset into memory even though you are often working with just a few variables at a time.

SAS by contrast allows you to work with datasets of almost unlimited size. This is related to a second disadvantage of Stata: there can only be one dataset in memory at a time and so working with multiple datasets (for example, multiple merges) requires you to explicitly modify and save each dataset, and then open the new one. SAS can deal with multiple datasets concurrently. SAS is also well-suited to the UNIX environment of the Social Sciences Computing Cluster.

A key feature of SAS is that virtually every statement or operation will be part of one of just two types: a DATA routine or a PROC routine. The DATA routines open, close, create, merge and manipulate datasets. The PROC routines perform pre-written procedures and operations on existing datasets.

An important distinction between SAS and Stata is that Stata applies each command to the entire dataset (or specified subset) before executing the next command. SAS however, reads in each observation from the dataset, applies either a single statement or a set of statements to the observation (such as multiple commands within a DATA step) and then reads in the next observation.

With SAS, at any one time only one observation is 'in memory' while in the case of Stata, the entire dataset is in memory and can be accessed at any time. This is why memory requirements and constraints in Stata can be quite binding. It is also the reason you need to pre-specify memory requirements in Stata before running a large-memory program (and you need to modify these specifications for different datasets). SAS does have memory specification requirements, but typically only for very special jobs.

The best way to see the difference between SAS and Stata is to compare programs written in each language to perform a simple task. As an example, I will use a dataset on US patents that is available from the National Bureau of Economic Research (NBER). The datset consists of the following variables:

Compare the following two programs to read the dataset into memory, sort it, calculate summary statistics for the entire dataset, calculate summary statistics on a subset of variables and observations, and finally calculate summary statistics for a subset of variables and observations, for each value of a particular variable.

The SAS example below also demonstrates the use of permanent versus temporary SAS datasets. Permanent datasets in SAS are associated with a particular libref, in this case called datalib. This is done by the libname statement which specifies the physical location of the file. Temporary datasets, on the other hand are only usable during the running of the program and are deleted as soon as the program has finished.

The example below assumes that the Patents dataset exists in both SAS and Stata formats in the same subdirectory and the SAS program first reads the permanent SAS dataset into a temporary dataset before making any changes.


Stata

 use "~/thesis/data/patents.dta"
 sort year country
 sum
 sum cmade creceive if year==1976 & country=="USA"
 sort cat
 by cat: sum cmade creceive if year==1976 & country=="USA"

SAS

 libname datlib '~/thesis/data';
 data patents;
     set datlib.patents;
 run;
 proc sort data=patents;
     by year country;
 run;
 proc means data=patents;
 run;
 proc means data=patents;
     where (year=1976) and (country="USA");
     var cmade creceive;
 run;
 proc sort data=patents;
     by cat;
 run;
 proc means data=patents;
     where (year=1976) and (country="USA");
     by cat;
     var cmade creceive;
 run;

It may appear that the SAS program is longer and takes more time to write. This is because of the need to set up PROC and DATA steps for each action. However the increase in program writing time is usually compensated for by the speed and flexibility that SAS allows, especially when the datasets become very large.

Note also that some of the statements in this SAS program could be removed without affecting the results; for example, every run statement except the last can be removed and the program will run as before.

Similarly it is usually unnecessary to specify the name of the dataset in various PROC and DATA steps. However including these statements is good programming practice and will certainly simplify matters if this code is altered in the future.

Specific differences in syntax and command structure between Stata and SAS can be found in A SAS User's Guide to Stata and, in particular, How to "Do" in Stata What You Know How to "Program" in SAS. These are links to a guide for SAS users to learn how to use Stata, but the comparisons between the two programs apply to anyone learning either program.

Merging and Appending Datasets

The examples below show you the Stata and SAS programs to merge two datasets according to a key value. For expositional purposes, assume that we already have two datasets, in each of the two formats (SAS and Stata). The first dataset is the one described above. The second dataset contains the patent id variable as well as the variable 'yearapp' which contains the year of the patent application (which is usually well before the date that it was granted).

We need to merge these two datasets according to the patent variable.


Stata

 use appyear.dta
 merge 1:1 patent using appyear.dta

SAS

 proc sort data=appyear;
     by patent;
 run;
 proc sort data=patents;
     by patent;
 run;
 data patents;
     merge patents appyear;
     by patent;
 run;

The examples below show you the Stata and SAS programs to append two datasets. For expositional purposes, assume that we already have two datasets, in each of the two formats (SAS and Stata). The first dataset is the one described above, where the patent grant year is between 1975 and 2000. The second dataset contains older patents: those from 1967 to 1974 with exactly the same variables.

We need to create a new dataset where the second dataset is appended to the first.


Stata

 use patents.dta
 append using patentsold.dta
 save combined.dta

SAS

 data combined;
     set patents patentsold;
 run;

Keeping and Dropping Variables and Observations

The examples below show you the Stata and SAS programs to drop a subset of variables and/or observations from the Patents dataset described above. Each program will drop two variables: subcat and creceive. It will also drop all observations from years 1995 and later.


Stata

 use patents.dta
 drop subcat creceive;
 drop if year>1994

SAS

 data patents(drop=subcat creceive);
     set patents;
     if year>1994 then delete;
 run;

Note that the drop (and keep) statement name and syntax is the same in Stata regardless of whether you need to drop variables or observations; Stata will understand from the context what the drop statement applies to. SAS, by contrast, uses 'drop' to delete variables and 'delete' to delete observations.

Note too that missing values are the largest values in Stata and the smallest values in SAS, so pay careful attention to use of missing values.

Converting Datasets from Stata to SAS with Stat/Transfer

If you already have a dataset in Stata format, you can use Stat/Transfer to convert it to SAS. In fact, Stat/Transfer can be used to convert between any two of the following types of file formats: SAS, Stata, Matlab, ASCII, Excel, Access and others.

If you have a Windows version of Stat/Transfer, it is straightforward to convert file types using the drop down menus. Simply specify the location and filetype of the existing dataset and the location and filetype of the desired dataset. Stat/Transfer will convert the datasets keeping label names and variable names intact.

If you do not have access to a Windows version of Stat/Transfer, it is equally easy to use the UNIX version that is installed on the SSCC. For financial reasons, Stat/Transfer is only available on the host seldon.it.northwestern.edu. You cannot use it on any of the other SSCC hosts.

The example below shows you how to create a SAS dataset from an existing Stata dataset.

To start Stat/Transfer, first login to seldon.it.northwestern.edu, and then go to the subdirectory where the data files to be transferred are located.

At the command prompt simply type "st file1.ex1 file2.ex2", where ex1 and ex2 are proper extensions for the filetypes. For example, if you have a Stata dataset called patents.dta which you need copied to SAS version 7 (or later) format, type:

[ach131@seldon thesis]$ st patents.dta patents.sas7bdat

No messages were printed, so the transfer into a SAS dataset worked.

Last Updated: 24 January 2017

Get Help Back to top