This project explores the empirical relationship between educational attainment and earned income using a detailed dataset of demographic, socioeconomic, and regional variables. The core objective is to determine how different education levels and demographic factors (e.g., age, race, health, citizenship, marital status) influence annual earnings. The analysis uses SAS to generate insights through data visualization, statistical modeling, and regression analysis.
DATA WORK.projdata1;
set e625data.cps_raw_sample;
IF pehspnon eq 1 THEN hispanic = 1;
ELSE IF pehspnon eq 2 THEN hispanic = 0;
white = 0;
black = 0;
asian = 0;
other = 0;
IF hispanic EQ 0 AND prdtrace EQ 1 THEN white = 1;
IF hispanic EQ 0 AND prdtrace EQ 2 THEN black = 1;
IF hispanic EQ 0 AND prdtrace EQ 3 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 4 THEN asian = 1;
IF hispanic EQ 0 AND prdtrace EQ 5 THEN other = 1;
IF hispanic EQ 0 AND pratrace EQ 6 THEN black = 1;
IF hispanic EQ 0 AND pratrace EQ 7 THEN other = 1;
IF hispanic EQ 0 AND pratrace EQ 8 THEN asian = 1;
IF hispanic EQ 0 AND prdtrace EQ 9 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 10 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 11 THEN asian = 1;
IF hispanic EQ 0 AND pratrace EQ 12 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 13 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 14 THEN other = 1;
IF hispanic EQ 0 AND prdtrace EQ 15 THEN other = 1;
* Add labels to variables;
label White = "White = 1 if the person is white and is not Hispanic, and = 0 otherwise";
label Black = "Black = 1 if the person is black and is not Hispanic, and = 0 otherwise";
label Asian = "Asian = 1 if the person is Asian and is not Hispanic, and = 0 otherwise";
label Other = "Other = 1 if the person is of another race and is not Hispanic";
label Hispanic = "hispanic = 1 if the person is Hispanic regardless of race";
RUN;
Create Citizenship & Marital Status Variables ```sas citizen = 0; if PRCITSHP ge 1 and PRCITSHP le 4 then citizen = 1;
if A_MARITL in (1, 2, 3) then married = 1; else if A_MARITL = 4 then widowed = 1; else if A_MARITL = 5 then divorced = 1; else if A_MARITL = 6 then separated = 1; else if A_MARITL = 7 then never_married = 1;
*Used the CPS variable A_SEX to create a variable called female = 1 for females and = 0 for males*;
* Create female variable*;
*Female =1 Male =0*;
Female = 0;
If A_Sex = 2 Then Female = 1;
label Female = 'female = 1 for females and = 0 for males';
RUN;
Earned_Income = PEARNVAL;
Annual_Hours = HRSWK * WKSWORK;
Hourly_Wage = Earned_Income / Annual_Hours;
label Earned_Income ='is equal to PEARNVAL;Income Earned in a Year';
label Annual_Hours ='is equal to usual hours per week X number of weeks worked';
label Hourly_Wage ='is equal to Earned_Income/Annual_Hours';
RUN;
educ_lt_hs = 0;
educ_eq_hs = 0;
educ_some_college = 0;
educ_college = 0;
educ_ma = 0;
educ_prof_phd = 0;
if A_HGA =< 38 then educ_lt_hs = 1;
else if A_HGA = 39 then educ_eq_hs = 1;
else if A_HGA = 40 then educ_some_college = 1;
else if A_HGA in (41, 42, 43) then educ_college = 1;
else if A_HGA = 44 then educ_ma = 1;
else if A_HGA in (45, 46) then educ_prof_phd = 1;
* Add labels to variables;
label educ_lt_hs = "= 1 if education less than a high school degree; = 0 otherwise";
label educ_eq_hs = "= 1 if education= high school degree; = 0 otherwise";
label educ_some_college = "= 1 if if education beyond high school but less than bachelors degree; = 0 otherwise";
label educ_college = "= 1 if college graduate; = 0 otherwise";
label educ_ma = "= 1 if if masters degree; = 0 otherwise";
label educ_prof_phd = "= 1 if if professional or doctoral degree; = 0 otherwise";
*Categorize New Education Variables*;
educ_cat = 0;
IF educ_lt_hs = 1 THEN educ_cat = 1;
IF educ_eq_hs = 1 THEN educ_cat = 2;
IF educ_some_college = 1 THEN educ_cat = 3;
IF educ_college = 1 THEN educ_cat = 4;
IF educ_ma = 1 THEN educ_cat = 5;
IF educ_prof_phd = 1 THEN educ_cat = 6;
RUN;
poor_health = 0;
IF HEA = 4 THEN poor_health = 1;
IF HEA = 5 THEN poor_health = 1;
label poor_health ='= 1 if health status is fair or poor and is = 0 otherwise';
miss_educ = 0;
IF AXHGA ne 0 THEN miss_educ =1;
LABEL miss_educ = "=1 if AXHGA is not = 0; 0 otherwise";
miss_earn = 0;
IF I_HRSWK ne 0 OR I_WKSWK ne 0 THEN miss_earn =1;
LABEL miss_earn = "=1 if missing earnings information."
miss_demo = 0;
IF AXAGE ne 0 OR
AXHGA ne 0 OR
I_HEA ne 0 OR
PXHSPNON ne 0 OR
PXMARITL not in (-1,0) OR
PXRACE1 ne 0 THEN miss_demo = 1;
LABEL miss_demo = "=1 if missing demographic information";
miss_demo_earn = 0;
IF miss_demo eq 1 OR miss_earn eq 1 THEN miss_demo_earn = 1;
LABEL miss_demo_earn = "=1 if miss_earn or miss_demo =1";
RUN;
Once the data was cleaned and transformed, a series of SAS procedures were used to examine variable distributions, join datasets, and prepare a finalized dataset for regression analysis.
PROC MEANS
was used to generate descriptive statistics for:
projdata1
)cps_hh_file
)TEMP
)projdata3
)e625proj.analysis_data
)These procedures calculated:
N
)NMISS
)MAXDEC=3
PROC MEANS DATA=WORK.projdata1;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN WORK.projdata1";
RUN;
PROC MEANS DATA=e625data.cps_hh_file n nmiss mean min max maxdec=3;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN E625DATA.CPS_HH_FILE";
PROC MEANS DATA=TEMP;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN TEMP";
PROC MEANS DATA= work.projdata3;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN work.projdata3";
RUN;
hhid
, PERIDNUM
) to enrich the analysis dataset:
hhfile
and unit6
merged by hhid
to create TEMP
, which included geographic regional identifiers.south
, northeast
, midwest
, west
) were derived from the GEREG
variable.gestfips
) using PROC MEANS
, and merged into the dataset as median_income
.PROC SORT DATA=e625data.cps_hh_file out=hhfile;
BY hhid;
PROC SORT DATA=e625u6.unit6 out=unit6;
BY hhid;
RUN;
DATA temp;
MERGE hhfile unit6;
BY hhid;
south = 0;
northeast = 0;
midwest = 0;
west = 0;
IF GEREG = 1 THEN northeast = 1;
IF GEREG = 2 THEN midwest = 1;
IF GEREG = 3 THEN south = 1;
IF GEREG = 4 THEN west = 1;
LABEL south = 'South Region';
LABEL northeast = 'Northeast Region';
LABEL midwest = 'Midwest Region';
LABEL west = 'West Region';
PROC SORT DATA= temp;
BY gestfips;
DATA WORK.projdata2;
MERGE temp medianinc;
BY gestfips;
LABEL median_income = "State Median Income";
PROC MEANS DATA = temp NOPRINT;
BY gestfips;
OUTPUT OUT = medianinc MEDIAN(earned_income) = median_income;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN TEMP by gestfips";
RUN;
PROC SORT DATA=work.projdata2 out=projd2;
BY PERIDNUM;
PROC SORT DATA=e625data.cps_additional_variables out=add_vars;
BY PERIDNUM;
PROC SORT DATA=work.projdata1 out=projd1;
BY PERIDNUM;
RUN;
DATA WORK.projdata3;
MERGE projd1 projd2 add_vars;
BY PERIDNUM;
RUN;
Additional variables from cps_additional_variables
were merged with individual-level data (projdata1
) and household-level data (projdata2
) to create a consolidated dataset: projdata3
.
e625proj.analysis_data
miss_demo
, miss_earn
, miss_demo_earn
)educyrs12
, median_income
)
DATA e625proj.analysis_data;
SET work.projdata3;
KEEP earned_income educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd
age18 poor_health female black hispanic asian other citizen married
divorced separated widowed median_income northeast midwest west educyrs educ_cat miss_demo educyrs12 miss_earn miss_demo_earn;
RUN;
PROC CONTENTS
and PROC MEANS
to ensure the dataset was correctly structured and statistically sound before modeling.
PROC MEANS DATA= e625proj.analysis_data;
TITLE "DESCRIPTIVE STATISTICS FOR FINAL VARIABLES";
PROC CONTENTS DATA= e625proj.analysis_data;
RUN;
This process ensured the final dataset was comprehensive, statistically validated, and ready for regression analysis.
This project estimated multiple regression models to analyze how education and demographic factors affect earned income. The regressions were carefully designed to test for consistency between respondents with and without imputed or allocated data.
Before modeling, a frequency cross-tabulation (PROC FREQ
) was conducted using:
PROC FORMAT;
VALUE missinfo
0 = "Not Allocated/Imputed"
1 = "Allocated/Imputed";
PROC FREQ DATA=e625proj.analysis_data NOPRINT;
TABLE miss_demo*miss_earn / NOCOL NOFREQ NOPERCENT OUT=MissingDataCrossTabulations;
WHERE miss_demo = 1 AND miss_earn = 1;
FORMAT miss_demo missinfo. miss_earn missinfo.;
PROC PRINT DATA=MissingDataCrossTabulations;
TITLE "Table of Cross Tabulations of miss_demo and miss_earn";
RUN;
This allowed us to assess how many observations were affected by missing demographic and earnings data. Only individuals without any imputed data (miss_demo = 0, miss_earn = 0) were included in the primary regression to ensure robust and unbiased estimates.
PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression Without Imputed/Allocated Demographic or Earnings Data";
WHERE miss_demo = 0 & miss_earn = 0;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd
age18 poor_health female black hispanic asian other citizen married divorced
separated widowed median_income northeast midwest west;
RUN;
This regression estimates the effect of education (as a series of dummy variables) and other demographic controls on earned income.
To determine whether all education dummies jointly contribute to the model, a TEST statement was used:
PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression Without Imputed/Allocated Demographic or Earnings Data (T)";
WHERE miss_demo = 0 & miss_earn = 0;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd
age18 poor_health female black hispanic asian other citizen married divorced
separated widowed median_income northeast midwest west;
TEST educ_eq_hs = 0, educ_some_college = 0, educ_college = 0, educ_ma = 0, educ_prof_phd = 0;
RUN;
This F-test evaluates whether the education variables collectively improve the model’s explanatory power. The result was statistically significant (p < 0.0001), confirming that education is a key predictor of income.
For robustness, the same regression model was re-estimated using only respondents with imputed demographic and earnings data (miss_demo = 1 & miss_earn = 1). This helps assess how missing data treatment may affect conclusions.
PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression on Imputed/Allocated Demographic or Earnings Data";
WHERE miss_demo = 1 & miss_earn = 1;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd
age18 poor_health female black hispanic asian other citizen married divorced
separated widowed median_income northeast midwest west;
RUN;
The direction and magnitude of most coefficients remained similar, but there were small variations in effect sizes—especially for variables like gender, marital status, and education. These differences were documented in the final report.
The results of this analysis reveal several important insights into the relationship between education, demographics, and earned income:
This analysis confirms the powerful economic impact of educational attainment and highlights how other demographic and regional factors contribute to income inequality in the U.S. labor market. These insights can inform policy discussions around education access, wage equity, and economic opportunity.