Khalil Sakho

Logo

View the Project on GitHub khalilsakho/ePortfolio

SAS Project 1 - Impact of Demographic Factors on Earned Income

📚 Table of Contents

🔍 Objective

This project explores the empirical relationship between educational attainment and earned income using a detailed dataset of demographic, socioeconomic, and regional variables. The core objective is to determine how different education levels and demographic factors (e.g., age, race, health, citizenship, marital status) influence annual earnings. The analysis uses SAS to generate insights through data visualization, statistical modeling, and regression analysis.

📎 Tools Used

🔧 Data Manipulation and Cleaning

Once the data was cleaned and transformed, a series of SAS procedures were used to examine variable distributions, join datasets, and prepare a finalized dataset for regression analysis.


📑 Summary Statistics

PROC MEANS DATA=WORK.projdata1;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN WORK.projdata1";
RUN;

PROC MEANS DATA=e625data.cps_hh_file n nmiss mean min max maxdec=3;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN E625DATA.CPS_HH_FILE";

PROC MEANS DATA=TEMP;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN TEMP";

PROC MEANS DATA= work.projdata3;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN work.projdata3";
RUN;

🧩 Dataset Merging and Enhancements

PROC SORT DATA=e625data.cps_hh_file out=hhfile; 
BY hhid; 

PROC SORT DATA=e625u6.unit6 out=unit6; 
BY hhid; 
RUN;

DATA temp;
MERGE hhfile unit6;
BY hhid;

	south = 0;
	northeast = 0; 
	midwest = 0; 
	west = 0;
	
	IF GEREG = 1 THEN northeast = 1;
	IF GEREG = 2 THEN midwest = 1;
	IF GEREG = 3 THEN south = 1;
	IF GEREG = 4 THEN west = 1;

	LABEL south = 'South Region';
	LABEL northeast = 'Northeast Region'; 
	LABEL midwest = 'Midwest Region'; 
	LABEL west = 'West Region';

PROC SORT DATA= temp;
BY gestfips;

DATA WORK.projdata2;
MERGE temp medianinc;
BY gestfips;
LABEL median_income = "State Median Income";

PROC MEANS DATA = temp NOPRINT;
BY gestfips;
OUTPUT OUT = medianinc MEDIAN(earned_income) = median_income;
TITLE "DESCRIPTIVE STATISTICS FOR VARIABLES WITHIN TEMP by gestfips";
RUN;
	
PROC SORT DATA=work.projdata2 out=projd2; 
BY PERIDNUM; 
  
PROC SORT DATA=e625data.cps_additional_variables out=add_vars; 
BY PERIDNUM;
  
PROC SORT DATA=work.projdata1 out=projd1; 
BY PERIDNUM;
RUN;

DATA WORK.projdata3;
MERGE projd1 projd2 add_vars;
BY PERIDNUM;

RUN; 

🧮 Final Dataset Construction

This process ensured the final dataset was comprehensive, statistically validated, and ready for regression analysis.

🧪 Regression Models

This project estimated multiple regression models to analyze how education and demographic factors affect earned income. The regressions were carefully designed to test for consistency between respondents with and without imputed or allocated data.


📊 Cross-Tabulation of Missing Data

Before modeling, a frequency cross-tabulation (PROC FREQ) was conducted using:

PROC FORMAT;
VALUE missinfo
	0 = "Not Allocated/Imputed"
	1 = "Allocated/Imputed";
PROC FREQ DATA=e625proj.analysis_data NOPRINT;
TABLE miss_demo*miss_earn / NOCOL NOFREQ NOPERCENT OUT=MissingDataCrossTabulations;
WHERE miss_demo = 1 AND miss_earn = 1;
FORMAT miss_demo missinfo. miss_earn missinfo.;  

PROC PRINT DATA=MissingDataCrossTabulations;
TITLE "Table of Cross Tabulations of miss_demo and miss_earn";
RUN;

This allowed us to assess how many observations were affected by missing demographic and earnings data. Only individuals without any imputed data (miss_demo = 0, miss_earn = 0) were included in the primary regression to ensure robust and unbiased estimates.

📈 Primary Regression Model (Without Imputed Data)

PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression Without Imputed/Allocated Demographic or Earnings Data";
WHERE miss_demo = 0 & miss_earn = 0;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd 
		  age18 poor_health female black hispanic asian other citizen married divorced 
		  separated widowed median_income northeast midwest west;
RUN;

This regression estimates the effect of education (as a series of dummy variables) and other demographic controls on earned income.

🧪 Joint Hypothesis Test (Education Variables)

To determine whether all education dummies jointly contribute to the model, a TEST statement was used:

PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression Without Imputed/Allocated Demographic or Earnings Data (T)";
WHERE miss_demo = 0 & miss_earn = 0;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd 
		  age18 poor_health female black hispanic asian other citizen married divorced 
		  separated widowed median_income northeast midwest west;
TEST educ_eq_hs = 0, educ_some_college = 0, educ_college = 0, educ_ma = 0, educ_prof_phd = 0;
RUN;

This F-test evaluates whether the education variables collectively improve the model’s explanatory power. The result was statistically significant (p < 0.0001), confirming that education is a key predictor of income.

🧪 Comparison Model (With Imputed Data)

For robustness, the same regression model was re-estimated using only respondents with imputed demographic and earnings data (miss_demo = 1 & miss_earn = 1). This helps assess how missing data treatment may affect conclusions.

PROC REG DATA= e625proj.analysis_data PLOTS=NONE;
TITLE "Multiple Regression on Imputed/Allocated Demographic or Earnings Data";
WHERE miss_demo = 1 & miss_earn = 1;
MODEL earned_income = educ_eq_hs educ_some_college educ_college educ_ma educ_prof_phd 
		  age18 poor_health female black hispanic asian other citizen married divorced 
		  separated widowed median_income northeast midwest west;
RUN;

The direction and magnitude of most coefficients remained similar, but there were small variations in effect sizes—especially for variables like gender, marital status, and education. These differences were documented in the final report.

📈 Key Findings

The results of this analysis reveal several important insights into the relationship between education, demographics, and earned income:


🎓 Education and Income


👥 Demographic Effects


💡 Other Factors


🧪 Missing Data Sensitivity


📊 Model Performance


📌 Conclusion

This analysis confirms the powerful economic impact of educational attainment and highlights how other demographic and regional factors contribute to income inequality in the U.S. labor market. These insights can inform policy discussions around education access, wage equity, and economic opportunity.