SAS Guide for Stata Users

This document describes the basic features of using SAS to work with and manipulate datasets, and to perform essential statistical functions. It is written from the point-of-view of Stata users, i.e. it assumes that you are familiar with using Stata to perform these operations.

The advantages of SAS lie in its flexibility and superior performance when working with very large datasets. Even with the computing power and memory on today's computers, Stata often becomes unwieldy and inefficient when working with large datasets (that can be hundreds of megabytes in size), because Stata puts the entire datset into memory even though you are often working with just a few variables at a time.

SAS by contrast allows you to work with datasets of almost unlimited size. This is related to a second disadvantage of Stata: there can only be one dataset in memory at a time and so working with multiple datasets (for example, multiple merges) requires you to expicitly modify and save each dataset, and then open the new one. SAS can deal with multiple datasets concurrently. SAS is also well-suited to the UNIX environment of the Social Sciences Computing Cluster.

A key feature of SAS is that virtually every statement or operation will be part of one of just two types: a DATA routine or a PROC routine. The DATA routines open, close, create, merge and manipulate datasets. The PROC routines perform pre-written procedures and operations on existing datasets.

An important distinction between SAS and Stata is that Stata applies each statement to the entire dataset (or specified subset) before executing the next statement. SAS however, reads in each observation from the dataset, applies either a single statement or a set of statements to the observation (such as multiple commands within a DATA step) and then reads in the next observation.

With SAS, at any one time only one observation is 'in memory' while in the case of Stata, the entire dataset is in memory and can be accessed at any time. This is why memory requirements and constraints in Stata can be quite binding. It is also the reason you need to pre-specify memory requirements in Stata before running a large-memory program (and you need to modify these specifications for different datasets). SAS does have memory specification requirements, but typically only for very special jobs.

The best way to see the difference between SAS and Stata is to compare programs written in each language to perform a simple task. As an example, I will use a datset on US patents that is available from the National Bureau of Economic Research (NBER). The datset consists of the following variables:
  • patent: The ID number of the patent
  • year: The Year it was granted
  • country: A string variable containing the nationality of the inventor
  • cat: The Patent Category (an integer from 1 to 5)
  • subcat: Subcategory (An integer from 1 to 100)
  • cmade: Number of patent citations made by this patent to other patents
  • creceive: number of citations received by this patent from other patents

Compare the following two programs to read the dataset into memory, sort it, calculate summary statistics for the entire dataset, calculate summary statistics on a subset of variables and observations, and finally calculate summary statistics for a subset of variables and observations, for each value of a particular variable.

The SAS example below also demonstrates the use of permanent versus temporary SAS datasets. Permanent datasets in SAS are associated with a particular libref, in this case called datalib. This is done by the libname statement which specifies the phyical location of the file. Temporary datasets, on the other hand are only usable during the running of the program and are deleted as soon as the program has finished.

The example below assumes that the Patents dataset exists in both SAS and Stata formats in the same subdirectory and the SAS program first reads the permanent SAS dataset into a temporary dataset before making any changes.

STATA

        set memory 550m
        use "~/thesis/data/patents.dta"
        sort year country
        sum
        sum cmade creceive if year==1976 & country=="USA"
        sort cat
        by cat: sum cmade creceive if year==1976 & country=="USA"


SAS

        libname datlib '~/thesis/data';
        data patents;
        set datlib.patents;
        run;
		
        proc sort data=patents;
        by year country;
        run;
		
        proc means data=patents;
        run;
		
        proc means data=patents;
        where (year=1976) and (country="USA");
        var cmade creceive;
        run;
		
        proc sort data=patents;
        by cat;
        run;
		
        proc means data=patents;
        where (year=1976) and (country="USA");
        by cat;
        var cmade creceive;
        run;

It may appear that the SAS program is longer and takes more time to write. This is because of the need to set up PROC and DATA steps for each action. However the increase in program writing time is usually compensated for by the speed and flexibility that SAS allows, especially when the datasets become very large.

Note also that some of the statements in this SAS program could be removed without affecting the results; for example, every run statement except the last can be removed and the program will run as before.

Similarly it is usually unnecessary to specify the name of the dataset in various PROC and DATA steps. However including these statements is good programming practice and will certainly simplify matters if this code is altered in the future.

Specific differences in syntax and command structure between Stata and SAS can be found in A SAS User's Guide to Stata and, in particular, How to "Do" in Stata What You Know How to "Program" in SAS. These are links to a guide for SAS users to learn how to use Stata, but the comparisons between the two programs apply to anyone learning either program.

Last Updated: 20 January 2009

Information Technology 1800 Sherman Avenue Evanston, Illinois 60201 | Contact Us

Northwestern Home | Calendar: Plan-It Purple | Online Directory | Search

World Wide Web Disclaimer and University Policy Statements

© 2009 Northwestern University