.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "packages/statistics/auto_examples/plot_wage_education_gender.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_packages_statistics_auto_examples_plot_wage_education_gender.py: Test for an education/gender interaction in wages ================================================== Wages depend mostly on education. Here we investigate how this dependence is related to gender: not only does gender create an offset in wages, it also seems that wages increase more with education for males than females. Does our data support this last hypothesis? We will test this using statsmodels' formulas (http://statsmodels.sourceforge.net/stable/example_formulas.html). .. GENERATED FROM PYTHON SOURCE LINES 17-18 Load and massage the data .. GENERATED FROM PYTHON SOURCE LINES 18-53 .. code-block:: Python import pandas import urllib.request import os if not os.path.exists("wages.txt"): # Download the file if it is not present url = "http://lib.stat.cmu.edu/datasets/CPS_85_Wages" with urllib.request.urlopen(url) as r, open("wages.txt", "wb") as f: f.write(r.read()) # EDUCATION: Number of years of education # SEX: 1=Female, 0=Male # WAGE: Wage (dollars per hour) data = pandas.read_csv( "wages.txt", skiprows=27, skipfooter=6, sep=None, header=None, names=["education", "gender", "wage"], usecols=[0, 2, 5], ) # Convert genders to strings (this is particularly useful so that the # statsmodels formulas detects that gender is a categorical variable) import numpy as np data["gender"] = np.choose(data.gender, ["male", "female"]) # Log-transform the wages, because they typically are increased with # multiplicative factors data["wage"] = np.log10(data["wage"]) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/scientific-python-lectures-ja/checkouts/latest/scientific-python-lectures/packages/statistics/examples/plot_wage_education_gender.py:32: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'. data = pandas.read_csv( .. GENERATED FROM PYTHON SOURCE LINES 54-55 simple plotting .. GENERATED FROM PYTHON SOURCE LINES 55-61 .. code-block:: Python import seaborn # Plot 2 linear fits for male and female. seaborn.lmplot(y="wage", x="education", hue="gender", data=data) .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_wage_education_gender_001.png :alt: plot wage education gender :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_wage_education_gender_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 62-63 statistical analysis .. GENERATED FROM PYTHON SOURCE LINES 63-72 .. code-block:: Python import statsmodels.formula.api as sm # Note that this model is not the plot displayed above: it is one # joined model for male and female, not separate models for male and # female. The reason is that a single model enables statistical testing result = sm.ols(formula="wage ~ education + gender", data=data).fit() print(result.summary()) .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: wage R-squared: 0.193 Model: OLS Adj. R-squared: 0.190 Method: Least Squares F-statistic: 63.42 Date: Mon, 17 Nov 2025 Prob (F-statistic): 2.01e-25 Time: 00:19:38 Log-Likelihood: 86.654 No. Observations: 534 AIC: -167.3 Df Residuals: 531 BIC: -154.5 Df Model: 2 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept 0.4053 0.046 8.732 0.000 0.314 0.496 gender[T.male] 0.1008 0.018 5.625 0.000 0.066 0.136 education 0.0334 0.003 9.768 0.000 0.027 0.040 ============================================================================== Omnibus: 4.675 Durbin-Watson: 1.792 Prob(Omnibus): 0.097 Jarque-Bera (JB): 4.876 Skew: -0.147 Prob(JB): 0.0873 Kurtosis: 3.365 Cond. No. 69.7 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. .. GENERATED FROM PYTHON SOURCE LINES 73-77 The plots above highlight that there is not only a different offset in wage but also a different slope We need to model this using an interaction .. GENERATED FROM PYTHON SOURCE LINES 77-83 .. code-block:: Python result = sm.ols( formula="wage ~ education + gender + education * gender", data=data ).fit() print(result.summary()) .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: wage R-squared: 0.198 Model: OLS Adj. R-squared: 0.194 Method: Least Squares F-statistic: 43.72 Date: Mon, 17 Nov 2025 Prob (F-statistic): 2.94e-25 Time: 00:19:38 Log-Likelihood: 88.503 No. Observations: 534 AIC: -169.0 Df Residuals: 530 BIC: -151.9 Df Model: 3 Covariance Type: nonrobust ============================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------- Intercept 0.2998 0.072 4.173 0.000 0.159 0.441 gender[T.male] 0.2750 0.093 2.972 0.003 0.093 0.457 education 0.0415 0.005 7.647 0.000 0.031 0.052 education:gender[T.male] -0.0134 0.007 -1.919 0.056 -0.027 0.000 ============================================================================== Omnibus: 4.838 Durbin-Watson: 1.825 Prob(Omnibus): 0.089 Jarque-Bera (JB): 5.000 Skew: -0.156 Prob(JB): 0.0821 Kurtosis: 3.356 Cond. No. 194. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. .. GENERATED FROM PYTHON SOURCE LINES 84-87 Looking at the p-value of the interaction of gender and education, the data does not support the hypothesis that education benefits males more than female (p-value > 0.05). .. GENERATED FROM PYTHON SOURCE LINES 87-92 .. code-block:: Python import matplotlib.pyplot as plt plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.320 seconds) .. _sphx_glr_download_packages_statistics_auto_examples_plot_wage_education_gender.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_wage_education_gender.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_wage_education_gender.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_wage_education_gender.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_