PROC COMPARE In SAS [Complete Guide With 10+ Examples]

The COMPARE procedure compares the contents of two SAS data sets, selected variables in different data sets, or variables within the same data set.

PROC COMPARE compares two data sets: the base data set and the comparison data set. The procedure determines matching variables and matching observations.

Matching variables are variables with the same name or variables that you pair by using the VAR and WITH statements. Matching variables must be of the same type.

There are three main levels that can impact differences generated by proc compare as:

  1. Difference in number of observations
  2. Difference in variables
  3. Difference in actual values

These are the reasons why PROC COMPARE is the most preferred first step when it comes to comparing the datasets in SAS. It produces the listings and reports with all the necessary information highlighting the key differences with some statistics.

A Simple Definition: 

Proc Compare is a procedure that allows two datasets to be compared for properties, number of observations, number of variables, and properties of the datasets.

Syntax:

				/* proc compare syntax */

PROC COMPARE < BASE=SAS-data-set COMPARE=SAS-data-set <OPTIONS> >;
	BY < variable-1 variable-2 ...>;
	ID < variable-1 variable-2 ...>;
	VAR variable(s);
	WITH variable(s);
RUN;
			

Explanation:

  • BASE=<SAS-data-set> – specifies the data set to use as the base data set.
  • COMPARE=<SAS-data-set> – specifies the data set to use as the comparison data set.
  • BY – Produce a separate comparison for each BY group
  • ID – Identify variables to use to match observations
  • VAR – Restrict the comparison to values of specific variables
  • WITH – Compare variables of different names OR Compare two variables in the same data set
  • OUT=<SAS-data-set> – it contains the differences between matching variables.

The following two datasets will be created to show different examples with use cases of the proc compare procedure in SAS.

				/* create sample datasets */
data data_one(label='First Data Set');
	input student year $ state $ gr1 gr2;
	label year='Year of Birth';
	datalines;
2000 2020 NC 85 87
2042 2025 MS 91 92
2095 2018 TN 78 92
2187 2020 NY 87 95
2500 2025 NC 96 98
;
run;

data data_two(label='Second Data Set');
	input student  year $ state $ gr1 gr2 major $;
	label state='Home State';
	datalines;
2000 2020 NC 85 87 Math
2042 2025 MS 90 92 AR/VR
2095 2018 TN 78 92 History
2187 2020 MA 87 94 Music
2204 2025 NC 82 96 Robotics
2500 2030 NY 96 98 AI/ML
;
run;

/* view dataset*/
proc print data=data_one; title 'wotk.data.one';
proc print data=data_two; title 'wotk.data.two';
run;
			
Sample datasets for proc compare procedure in sas

How To Compare Two Datasets With PROC COMPARE

Now you have two different datasets data_one and data_two. The simplest way to execute the compare procedure is by considering one of the dataset as BASE dataset and the other one is COMPARE dataset without using any additional optional statements.

Here is a simple example of that.

				/* compare two datasets in sas */

proc compare base=data_one compare=data_two ;
   title 'Comparing Two Data Sets: Full Report';
run;

			

The output of the above query is a bit long but below is the partial output snapshot which is more important to know at the beginning stage.

proc compare dataset summary
proc compare observation summary

PROC COMPARE With PRINTALL Option

The PRINTALL option in the proc compare procedure invokes the following options: ALLVARS, ALLOBS, ALLSTATS, LISTALL, and WARNING.

  • ALLVARS – It includes in the report of value comparison results the values and, for numeric variables, the differences for all pairs of matching variables, even if they are judged equal.
  • ALLOBS – It includes in the report of value comparison results the values and, for numeric variables, the differences for all matching observations, even if they are judged equal.
  • ALLSTATS – It prints a table of summary statistics for all pairs of matching variables.
  • LISTALL – It lists all variables and observations that are found in only one data set.
  • WARNING – It displays a warning message in the SAS log when differences are found.
				/* compare two datasets in sas using printall */

proc compare base=data_one compare=data_two printall;
   title 'Comparing Two Data Sets: Full Report with PRINTALL Option';
run;
			

Firstly, the above query prints warning as differences in the SAS log and continues with comparison. Here is how sas log and output looks like: 

proc compare warning with printall statement
Proc compare PRINTALL option
Partial Output of Proc Compare with PRINTALL Option

Comparison with an ID Variable

In a simple comparison, PROC COMPARE uses the observation number to determine which observations to compare. When you use an ID variable, PROC COMPARE uses the values of the ID variable to determine which observations to compare. ID variables should have unique values and must have the same type.

ID Variable must follow these rules: 

  • ID variables must be of the same type in both data sets.
  • You should sort both data sets by the common ID variables (within the BY variables, if any) unless you specify the NOTSORTED option.

If PROC COMPARE finds two successive observations with the same ID values in a data set, then it prints the warning Duplicate Observations for the first occurrence for that data set

				/* sort datasets */
proc sort data=data_one; 
	by student; 
run;
proc sort data=data_two; 
	by student; 
run;

/* comparison based on matching values for the ID variable "student" */
proc compare
    base=data_one
    compare=data_two;
    id student;
run;
			
PROC COMPARE With an ID Variable
Partial Output of PROC COMPARE With an ID Variable

Comparison With Grouped Data On BY Variable

The BY variable specifies the variable that the procedure uses to form BY groups. You can specify more than one variable.

If you do not use the NOTSORTED option in the BY statement, then the observations in the data set must be sorted by all the variables that you specify.

In the below example comparison has been done on grouped data by variable “state”.

				/* sort datasets by state*/
proc sort data=data_one; 
	by state; 
run;
proc sort data=data_two; 
	by state; 
run;

/* comparison of grouped data on BY variable */
proc compare
    base=data_one
    compare=data_two;
    by state;
run;
			
Compare datasets with group by variables
Compare datasets with group by variables 2
Partial Output of Grouped BY Variable "state"

Compare Datasets For Selected Variable

By default the proc compare does comparison across all the variables but you can do the comparison for specific variables using the VAR statement.

You can list down the variables after the VAR keyword in the proc compare procedure.

				/*compare the differences between the datasets only for 'state' variable */
proc compare
    base=data_one
    compare=data_two ;
    var state; /* select variable to compare */
run;
			
PROC Compare in SAS with VAR statement

PROC COMPARE With OUTPUT Dataset (OUT=Option)

By default, the OUT= data set contains an observation for each pair of matching observations. The OUT= data set contains the following variables from the data sets you are comparing:

  • all variables named in the BY statement
  • all variables named in the ID statement
  • all matching variables or, if you use the VAR statement, all variables listed in the VAR statement

In the below example we are using some options along with OUT= to have max information stored in the output dataset.

NOPRINT suppresses the printing of the procedure output. 

OUTNOEQUAL includes only observations that are judged unequal.

OUTBASE writes an observation to the output data set for each observation in the base data set. 

OUTCOMP writes an observation to the output data set for each observation in the comparison data set. 

OUTDIF writes an observation to the output data set that contains the differences between the two observations.

				/* sort datasets */
proc sort data=data_one; 
	by student; 
run;
proc sort data=data_two; 
	by student; 
run;

/* proc compare with output dataset (OUT=SAS-data-set) */
proc compare
    base=data_one
    compare=data_two
    out=output_ds outnoequal outbase outcomp outdif noprint;
    id student;
run;

/* view output dataset*/
proc print data=output_ds; run;
			
proc compare out= dataset in sas

The differences for character variables are noted with an X or a period (.).

An X shows that the characters do not match. A period shows that the characters do match. For numeric variables, an E means that there is no difference.

Create Output Data Set of Statistics With Proc Compare

In the proc compare procedure you can compare numeric variables and create summary statistics by using the OUTSTAT statement. By default it generates statistics for all the numeric variables but you can select the numeric variables in the VAR statement.

You can list the numeric variables after the VAR statement and the proc compare procedure only creates summary statistics for those variables ONLY.

The following statistics will be calculated:

  • N
  • MEAN
  • STD
  • MAX
  • MIN
  • STDERR
  • T
  • PROBT
  • NDIF
  • DIFMEANS
  • R, RSQ

In the below example, the OUTSTAT statement creates the summary statistics output dataset “work.output_stat” only for two numeric variables which are mentioned after the VAR statement: student and gr1 .

				/* proc compare outstat output statistics dataset */
proc compare
    base=data_one
    compare=data_two
    outstat=output_stat noprint;
    var student gr1; /* select numeric variables for outstat */
run;

/* view OUTSTAT data-set output */
proc print data=output_stat; run;
			
Proc compare in sas with OUTSTAT statistics in dataset

Compare Datasets With TRANSPOSE Option

The TRANSPOSE option prints the reports of value differences by observation instead of by variable.

				/* proc compare with transpose option */

proc compare
    base=data_one
    compare=data_two
    transpose ;
 id student;
run;
			
Transpose in PROc Compare procedure
Partial Output of Proc Compare with Transpose Option

PROC COMPARE Limitations

PROC COMPARE does not produce information about values that are the same in each comparison data set. It produces information about values that are different, not the same.

PROC COMPARE does not produce a data set that contains observations that are in one of the comparison data sets but not in the other, or that are in both comparison data sets.

The options for the COMPARE statement can produce much of this information, but they do not produce a data set. If you want to produce a data set that contains this information, use a DATA step that contains a MERGE statement.

Example: The following example demonstrates how to use data merge technique to create comparison summary datasets by comparing observations and variables. 

For more details check out this article: How To Compare Datasets In SAS

				/* sort datasets by student */
proc sort data=data_one; 
	by student; 
run;
proc sort data=data_two; 
	by student; 
run;

data in_data_one in_data_two in_both;
   merge data_one (in=in_a) data_two(in=in_b);
   by student;
      if in_a and not in_b then output in_data_one; /* left join */
      if in_b and not in_a then output in_data_two; /* right join */
      if in_a and in_b then output in_both; /* inner join */
run;

/* view output datasets */
proc print data=in_data_one; title 'work.in_data_one'; run;
proc print data=in_data_two; title 'work.in_data_two'; run;
proc print data=in_both; title 'work.in_both'; run;
			
proc compare limitations and data merge solution in sas

FAQ – PROC COMPARE In SAS

What is PROC COMPARE in SAS?

PROC COMPARE is a SAS procedure that compares two datasets and reports differences between them.

What are the benefits of using PROC COMPARE? 

PROC COMPARE is a quick and easy way to identify similarities and differences between two datasets. It can be used to compare datasets with different structures, variable names, and formats.

How do I use PROC COMPARE in SAS? 

To use PROC COMPARE, you need to specify the base dataset and the comparison dataset. You can also specify options to control the output of the procedure.

What are some common options used with PROC COMPARE?

Some common options used with PROC COMPARE include OUT=, OUTBASE, OUTCOMP, OUTDIF, and OUTSTATS. These options control the output of the procedure and allow you to create custom reports.

What are are the alternatives for PROC COMPARE?

You can use DATA MERGE technique to compare the datasets and create output datasets easily. This method help produce information about values that are the same in each comparison data set. It produces information about values that are different, not the same.