.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "packages/statistics/auto_examples/plot_airfare.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_packages_statistics_auto_examples_plot_airfare.py: Air fares before and after 9/11 ===================================== This is a business-intelligence (BI) like application. What is interesting here is that we may want to study fares as a function of the year, paired accordingly to the trips, or forgetting the year, only as a function of the trip endpoints. Using statsmodels' linear models, we find that both with an OLS (ordinary least square) and a robust fit, the intercept and the slope are significantly non-zero: the air fares have decreased between 2000 and 2001, and their dependence on distance travelled has also decreased .. GENERATED FROM PYTHON SOURCE LINES 17-21 .. code-block:: Python # Standard library imports import os .. GENERATED FROM PYTHON SOURCE LINES 22-23 Load the data .. GENERATED FROM PYTHON SOURCE LINES 23-61 .. code-block:: Python import pandas import requests if not os.path.exists("airfares.txt"): # Download the file if it is not present r = requests.get( "https://users.stat.ufl.edu/~winner/data/airq4.dat", verify=False, # Wouldn't normally do this, but this site's certificate # is not yet distributed ) with open("airfares.txt", "wb") as f: f.write(r.content) # As a separator, ' +' is a regular expression that means 'one of more # space' data = pandas.read_csv( "airfares.txt", delim_whitespace=True, header=0, names=[ "city1", "city2", "pop1", "pop2", "dist", "fare_2000", "nb_passengers_2000", "fare_2001", "nb_passengers_2001", ], ) # we log-transform the number of passengers import numpy as np data["nb_passengers_2000"] = np.log10(data["nb_passengers_2000"]) data["nb_passengers_2001"] = np.log10(data["nb_passengers_2001"]) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/scientific-python-lectures-ja/checkouts/latest/scientific-python-lectures/packages/statistics/examples/plot_airfare.py:38: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead data = pandas.read_csv( .. GENERATED FROM PYTHON SOURCE LINES 62-63 Make a dataframe with the year as an attribute, instead of separate columns .. GENERATED FROM PYTHON SOURCE LINES 63-93 .. code-block:: Python # This involves a small danse in which we separate the dataframes in 2, # one for year 2000, and one for 2001, before concatenating again. # Make an index of each flight data_flat = data.reset_index() data_2000 = data_flat[ ["city1", "city2", "pop1", "pop2", "dist", "fare_2000", "nb_passengers_2000"] ] # Rename the columns data_2000.columns = pandas.Index( ["city1", "city2", "pop1", "pop2", "dist", "fare", "nb_passengers"] ) # Add a column with the year data_2000.insert(0, "year", 2000) data_2001 = data_flat[ ["city1", "city2", "pop1", "pop2", "dist", "fare_2001", "nb_passengers_2001"] ] # Rename the columns data_2001.columns = pandas.Index( ["city1", "city2", "pop1", "pop2", "dist", "fare", "nb_passengers"] ) # Add a column with the year data_2001.insert(0, "year", 2001) data_flat = pandas.concat([data_2000, data_2001]) .. GENERATED FROM PYTHON SOURCE LINES 94-95 Plot scatter matrices highlighting different aspects .. GENERATED FROM PYTHON SOURCE LINES 95-112 .. code-block:: Python import seaborn seaborn.pairplot( data_flat, vars=["fare", "dist", "nb_passengers"], kind="reg", markers="." ) # A second plot, to show the effect of the year (ie the 9/11 effect) seaborn.pairplot( data_flat, vars=["fare", "dist", "nb_passengers"], kind="reg", hue="year", markers=".", ) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_001.png :alt: plot airfare :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_001.png :class: sphx-glr-multi-img * .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_002.png :alt: plot airfare :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_002.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 113-114 Plot the difference in fare .. GENERATED FROM PYTHON SOURCE LINES 114-128 .. code-block:: Python import matplotlib.pyplot as plt plt.figure(figsize=(5, 2)) seaborn.boxplot(data.fare_2001 - data.fare_2000) plt.title("Fare: 2001 - 2000") plt.subplots_adjust() plt.figure(figsize=(5, 2)) seaborn.boxplot(data.nb_passengers_2001 - data.nb_passengers_2000) plt.title("NB passengers: 2001 - 2000") plt.subplots_adjust() .. rst-class:: sphx-glr-horizontal * .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_003.png :alt: Fare: 2001 - 2000 :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_003.png :class: sphx-glr-multi-img * .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_004.png :alt: NB passengers: 2001 - 2000 :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_004.png :class: sphx-glr-multi-img .. GENERATED FROM PYTHON SOURCE LINES 129-131 Statistical testing: dependence of fare on distance and number of passengers .. GENERATED FROM PYTHON SOURCE LINES 131-141 .. code-block:: Python import statsmodels.formula.api as sm result = sm.ols(formula="fare ~ 1 + dist + nb_passengers", data=data_flat).fit() print(result.summary()) # Using a robust fit result = sm.rlm(formula="fare ~ 1 + dist + nb_passengers", data=data_flat).fit() print(result.summary()) .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: fare R-squared: 0.275 Model: OLS Adj. R-squared: 0.275 Method: Least Squares F-statistic: 1585. Date: Mon, 17 Nov 2025 Prob (F-statistic): 0.00 Time: 00:19:49 Log-Likelihood: -45532. No. Observations: 8352 AIC: 9.107e+04 Df Residuals: 8349 BIC: 9.109e+04 Df Model: 2 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 211.2428 2.466 85.669 0.000 206.409 216.076 dist 0.0484 0.001 48.149 0.000 0.046 0.050 nb_passengers -32.8925 1.127 -29.191 0.000 -35.101 -30.684 ============================================================================== Omnibus: 604.051 Durbin-Watson: 1.446 Prob(Omnibus): 0.000 Jarque-Bera (JB): 740.733 Skew: 0.710 Prob(JB): 1.42e-161 Kurtosis: 3.338 Cond. No. 5.23e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.23e+03. This might indicate that there are strong multicollinearity or other numerical problems. Robust linear Model Regression Results ============================================================================== Dep. Variable: fare No. Observations: 8352 Model: RLM Df Residuals: 8349 Method: IRLS Df Model: 2 Norm: HuberT Scale Est.: mad Cov Type: H1 Date: Mon, 17 Nov 2025 Time: 00:19:49 No. Iterations: 12 ================================================================================= coef std err z P>|z| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 215.0848 2.448 87.856 0.000 210.287 219.883 dist 0.0460 0.001 46.166 0.000 0.044 0.048 nb_passengers -35.2686 1.119 -31.526 0.000 -37.461 -33.076 ================================================================================= If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore . .. GENERATED FROM PYTHON SOURCE LINES 142-143 Statistical testing: regression of fare on distance: 2001/2000 difference .. GENERATED FROM PYTHON SOURCE LINES 143-152 .. code-block:: Python result = sm.ols(formula="fare_2001 - fare_2000 ~ 1 + dist", data=data).fit() print(result.summary()) # Plot the corresponding regression data["fare_difference"] = data["fare_2001"] - data["fare_2000"] seaborn.lmplot(x="dist", y="fare_difference", data=data) plt.show() .. image-sg:: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_005.png :alt: plot airfare :srcset: /packages/statistics/auto_examples/images/sphx_glr_plot_airfare_005.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none OLS Regression Results ============================================================================== Dep. Variable: fare_2001 R-squared: 0.159 Model: OLS Adj. R-squared: 0.159 Method: Least Squares F-statistic: 791.7 Date: Mon, 17 Nov 2025 Prob (F-statistic): 1.20e-159 Time: 00:19:49 Log-Likelihood: -22640. No. Observations: 4176 AIC: 4.528e+04 Df Residuals: 4174 BIC: 4.530e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 148.0279 1.673 88.480 0.000 144.748 151.308 dist 0.0388 0.001 28.136 0.000 0.036 0.041 ============================================================================== Omnibus: 136.558 Durbin-Watson: 1.544 Prob(Omnibus): 0.000 Jarque-Bera (JB): 149.624 Skew: 0.462 Prob(JB): 3.23e-33 Kurtosis: 2.920 Cond. No. 2.40e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.4e+03. This might indicate that there are strong multicollinearity or other numerical problems. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.108 seconds) .. _sphx_glr_download_packages_statistics_auto_examples_plot_airfare.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_airfare.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_airfare.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_airfare.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_