Skip to content

Commit 09aa6f9

Browse files
authored
bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression (#16813)
Co-authored-by: Tymoteusz Wołodźko <twolodzko+gitkraken@gmail.com
1 parent 172c0f2 commit 09aa6f9

File tree

6 files changed

+326
-1
lines changed

6 files changed

+326
-1
lines changed

Doc/library/statistics.rst

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,17 @@ tends to deviate from the typical or average values.
6868
:func:`variance` Sample variance of data.
6969
======================= =============================================
7070

71+
Statistics for relations between two inputs
72+
-------------------------------------------
73+
74+
These functions calculate statistics regarding relations between two inputs.
75+
76+
========================= =====================================================
77+
:func:`covariance` Sample covariance for two variables.
78+
:func:`correlation` Pearson's correlation coefficient for two variables.
79+
:func:`linear_regression` Intercept and slope for simple linear regression.
80+
========================= =====================================================
81+
7182

7283
Function details
7384
----------------
@@ -566,6 +577,98 @@ However, for reading convenience, most of the examples show sorted sequences.
566577

567578
.. versionadded:: 3.8
568579

580+
.. function:: covariance(x, y, /)
581+
582+
Return the sample covariance of two inputs *x* and *y*. Covariance
583+
is a measure of the joint variability of two inputs.
584+
585+
Both inputs must be of the same length (no less than two), otherwise
586+
:exc:`StatisticsError` is raised.
587+
588+
Examples:
589+
590+
.. doctest::
591+
592+
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
593+
>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
594+
>>> covariance(x, y)
595+
0.75
596+
>>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
597+
>>> covariance(x, z)
598+
-7.5
599+
>>> covariance(z, x)
600+
-7.5
601+
602+
.. versionadded:: 3.10
603+
604+
.. function:: correlation(x, y, /)
605+
606+
Return the `Pearson's correlation coefficient
607+
<https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
608+
for two inputs. Pearson's correlation coefficient *r* takes values
609+
between -1 and +1. It measures the strength and direction of the linear
610+
relationship, where +1 means very strong, positive linear relationship,
611+
-1 very strong, negative linear relationship, and 0 no linear relationship.
612+
613+
Both inputs must be of the same length (no less than two), and need
614+
not to be constant, otherwise :exc:`StatisticsError` is raised.
615+
616+
Examples:
617+
618+
.. doctest::
619+
620+
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
621+
>>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
622+
>>> correlation(x, x)
623+
1.0
624+
>>> correlation(x, y)
625+
-1.0
626+
627+
.. versionadded:: 3.10
628+
629+
.. function:: linear_regression(regressor, dependent_variable)
630+
631+
Return the intercept and slope of `simple linear regression
632+
<https://en.wikipedia.org/wiki/Simple_linear_regression>`_
633+
parameters estimated using ordinary least squares. Simple linear
634+
regression describes relationship between *regressor* and
635+
*dependent variable* in terms of linear function:
636+
637+
*dependent_variable = intercept + slope \* regressor + noise*
638+
639+
where ``intercept`` and ``slope`` are the regression parameters that are
640+
estimated, and noise term is an unobserved random variable, for the
641+
variability of the data that was not explained by the linear regression
642+
(it is equal to the difference between prediction and the actual values
643+
of dependent variable).
644+
645+
Both inputs must be of the same length (no less than two), and regressor
646+
needs not to be constant, otherwise :exc:`StatisticsError` is raised.
647+
648+
For example, if we took the data on the data on `release dates of the Monty
649+
Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_, and used
650+
it to predict the cumulative number of Monty Python films produced, we could
651+
predict what would be the number of films they could have made till year
652+
2019, assuming that they kept the pace.
653+
654+
.. doctest::
655+
656+
>>> year = [1971, 1975, 1979, 1982, 1983]
657+
>>> films_total = [1, 2, 3, 4, 5]
658+
>>> intercept, slope = linear_regression(year, films_total)
659+
>>> round(intercept + slope * 2019)
660+
16
661+
662+
We could also use it to "predict" how many Monty Python films existed when
663+
Brian Cohen was born.
664+
665+
.. doctest::
666+
667+
>>> round(intercept + slope * 1)
668+
-610
669+
670+
.. versionadded:: 3.10
671+
569672

570673
Exceptions
571674
----------

Doc/whatsnew/3.10.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1051,6 +1051,14 @@ The :mod:`shelve` module now uses :data:`pickle.DEFAULT_PROTOCOL` by default
10511051
instead of :mod:`pickle` protocol ``3`` when creating shelves.
10521052
(Contributed by Zackery Spytz in :issue:`34204`.)
10531053
1054+
statistics
1055+
----------
1056+
1057+
Added :func:`~statistics.covariance`, Pearson's
1058+
:func:`~statistics.correlation`, and simple
1059+
:func:`~statistics.linear_regression` functions.
1060+
(Contributed by Tymoteusz Wołodźko in :issue:`38490`.)
1061+
10541062
site
10551063
----
10561064

Lib/statistics.py

Lines changed: 135 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,30 @@
7373
2.5
7474
7575
76+
Statistics for relations between two inputs
77+
-------------------------------------------
78+
79+
================== ====================================================
80+
Function Description
81+
================== ====================================================
82+
covariance Sample covariance for two variables.
83+
correlation Pearson's correlation coefficient for two variables.
84+
linear_regression Intercept and slope for simple linear regression.
85+
================== ====================================================
86+
87+
Calculate covariance, Pearson's correlation, and simple linear regression
88+
for two inputs:
89+
90+
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
91+
>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
92+
>>> covariance(x, y)
93+
0.75
94+
>>> correlation(x, y) #doctest: +ELLIPSIS
95+
0.31622776601...
96+
>>> linear_regression(x, y) #doctest:
97+
LinearRegression(intercept=1.5, slope=0.1)
98+
99+
76100
Exceptions
77101
----------
78102
@@ -98,6 +122,9 @@
98122
'quantiles',
99123
'stdev',
100124
'variance',
125+
'correlation',
126+
'covariance',
127+
'linear_regression',
101128
]
102129

103130
import math
@@ -110,7 +137,7 @@
110137
from bisect import bisect_left, bisect_right
111138
from math import hypot, sqrt, fabs, exp, erf, tau, log, fsum
112139
from operator import itemgetter
113-
from collections import Counter
140+
from collections import Counter, namedtuple
114141

115142
# === Exceptions ===
116143

@@ -826,6 +853,113 @@ def pstdev(data, mu=None):
826853
return math.sqrt(var)
827854

828855

856+
# === Statistics for relations between two inputs ===
857+
858+
# See https://en.wikipedia.org/wiki/Covariance
859+
# https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
860+
# https://en.wikipedia.org/wiki/Simple_linear_regression
861+
862+
863+
def covariance(x, y, /):
864+
"""Covariance
865+
866+
Return the sample covariance of two inputs *x* and *y*. Covariance
867+
is a measure of the joint variability of two inputs.
868+
869+
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
870+
>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
871+
>>> covariance(x, y)
872+
0.75
873+
>>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
874+
>>> covariance(x, z)
875+
-7.5
876+
>>> covariance(z, x)
877+
-7.5
878+
879+
"""
880+
n = len(x)
881+
if len(y) != n:
882+
raise StatisticsError('covariance requires that both inputs have same number of data points')
883+
if n < 2:
884+
raise StatisticsError('covariance requires at least two data points')
885+
xbar = mean(x)
886+
ybar = mean(y)
887+
total = fsum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
888+
return total / (n - 1)
889+
890+
891+
def correlation(x, y, /):
892+
"""Pearson's correlation coefficient
893+
894+
Return the Pearson's correlation coefficient for two inputs. Pearson's
895+
correlation coefficient *r* takes values between -1 and +1. It measures the
896+
strength and direction of the linear relationship, where +1 means very
897+
strong, positive linear relationship, -1 very strong, negative linear
898+
relationship, and 0 no linear relationship.
899+
900+
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
901+
>>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
902+
>>> correlation(x, x)
903+
1.0
904+
>>> correlation(x, y)
905+
-1.0
906+
907+
"""
908+
n = len(x)
909+
if len(y) != n:
910+
raise StatisticsError('correlation requires that both inputs have same number of data points')
911+
if n < 2:
912+
raise StatisticsError('correlation requires at least two data points')
913+
cov = covariance(x, y)
914+
stdx = stdev(x)
915+
stdy = stdev(y)
916+
try:
917+
return cov / (stdx * stdy)
918+
except ZeroDivisionError:
919+
raise StatisticsError('at least one of the inputs is constant')
920+
921+
922+
LinearRegression = namedtuple('LinearRegression', ['intercept', 'slope'])
923+
924+
925+
def linear_regression(regressor, dependent_variable, /):
926+
"""Intercept and slope for simple linear regression
927+
928+
Return the intercept and slope of simple linear regression
929+
parameters estimated using ordinary least squares. Simple linear
930+
regression describes relationship between *regressor* and
931+
*dependent variable* in terms of linear function::
932+
933+
dependent_variable = intercept + slope * regressor + noise
934+
935+
where ``intercept`` and ``slope`` are the regression parameters that are
936+
estimated, and noise term is an unobserved random variable, for the
937+
variability of the data that was not explained by the linear regression
938+
(it is equal to the difference between prediction and the actual values
939+
of dependent variable).
940+
941+
The parameters are returned as a named tuple.
942+
943+
>>> regressor = [1, 2, 3, 4, 5]
944+
>>> noise = NormalDist().samples(5, seed=42)
945+
>>> dependent_variable = [2 + 3 * regressor[i] + noise[i] for i in range(5)]
946+
>>> linear_regression(regressor, dependent_variable) #doctest: +ELLIPSIS
947+
LinearRegression(intercept=1.75684970486..., slope=3.09078914170...)
948+
949+
"""
950+
n = len(regressor)
951+
if len(dependent_variable) != n:
952+
raise StatisticsError('linear regression requires that both inputs have same number of data points')
953+
if n < 2:
954+
raise StatisticsError('linear regression requires at least two data points')
955+
try:
956+
slope = covariance(regressor, dependent_variable) / variance(regressor)
957+
except ZeroDivisionError:
958+
raise StatisticsError('regressor is constant')
959+
intercept = mean(dependent_variable) - slope * mean(regressor)
960+
return LinearRegression(intercept=intercept, slope=slope)
961+
962+
829963
## Normal Distribution #####################################################
830964

831965

Lib/test/test_statistics.py

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2407,6 +2407,84 @@ def test_error_cases(self):
24072407
quantiles([10, None, 30], n=4) # data is non-numeric
24082408

24092409

2410+
class TestBivariateStatistics(unittest.TestCase):
2411+
2412+
def test_unequal_size_error(self):
2413+
for x, y in [
2414+
([1, 2, 3], [1, 2]),
2415+
([1, 2], [1, 2, 3]),
2416+
]:
2417+
with self.assertRaises(statistics.StatisticsError):
2418+
statistics.covariance(x, y)
2419+
with self.assertRaises(statistics.StatisticsError):
2420+
statistics.correlation(x, y)
2421+
with self.assertRaises(statistics.StatisticsError):
2422+
statistics.linear_regression(x, y)
2423+
2424+
def test_small_sample_error(self):
2425+
for x, y in [
2426+
([], []),
2427+
([], [1, 2,]),
2428+
([1, 2,], []),
2429+
([1,], [1,]),
2430+
([1,], [1, 2,]),
2431+
([1, 2,], [1,]),
2432+
]:
2433+
with self.assertRaises(statistics.StatisticsError):
2434+
statistics.covariance(x, y)
2435+
with self.assertRaises(statistics.StatisticsError):
2436+
statistics.correlation(x, y)
2437+
with self.assertRaises(statistics.StatisticsError):
2438+
statistics.linear_regression(x, y)
2439+
2440+
2441+
class TestCorrelationAndCovariance(unittest.TestCase):
2442+
2443+
def test_results(self):
2444+
for x, y, result in [
2445+
([1, 2, 3], [1, 2, 3], 1),
2446+
([1, 2, 3], [-1, -2, -3], -1),
2447+
([1, 2, 3], [3, 2, 1], -1),
2448+
([1, 2, 3], [1, 2, 1], 0),
2449+
([1, 2, 3], [1, 3, 2], 0.5),
2450+
]:
2451+
self.assertAlmostEqual(statistics.correlation(x, y), result)
2452+
self.assertAlmostEqual(statistics.covariance(x, y), result)
2453+
2454+
def test_different_scales(self):
2455+
x = [1, 2, 3]
2456+
y = [10, 30, 20]
2457+
self.assertAlmostEqual(statistics.correlation(x, y), 0.5)
2458+
self.assertAlmostEqual(statistics.covariance(x, y), 5)
2459+
2460+
y = [.1, .2, .3]
2461+
self.assertAlmostEqual(statistics.correlation(x, y), 1)
2462+
self.assertAlmostEqual(statistics.covariance(x, y), 0.1)
2463+
2464+
2465+
class TestLinearRegression(unittest.TestCase):
2466+
2467+
def test_constant_input_error(self):
2468+
x = [1, 1, 1,]
2469+
y = [1, 2, 3,]
2470+
with self.assertRaises(statistics.StatisticsError):
2471+
statistics.linear_regression(x, y)
2472+
2473+
def test_results(self):
2474+
for x, y, true_intercept, true_slope in [
2475+
([1, 2, 3], [0, 0, 0], 0, 0),
2476+
([1, 2, 3], [1, 2, 3], 0, 1),
2477+
([1, 2, 3], [100, 100, 100], 100, 0),
2478+
([1, 2, 3], [12, 14, 16], 10, 2),
2479+
([1, 2, 3], [-1, -2, -3], 0, -1),
2480+
([1, 2, 3], [21, 22, 23], 20, 1),
2481+
([1, 2, 3], [5.1, 5.2, 5.3], 5, 0.1),
2482+
]:
2483+
intercept, slope = statistics.linear_regression(x, y)
2484+
self.assertAlmostEqual(intercept, true_intercept)
2485+
self.assertAlmostEqual(slope, true_slope)
2486+
2487+
24102488
class TestNormalDist:
24112489

24122490
# General note on precision: The pdf(), cdf(), and overlap() methods

Misc/ACKS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1927,6 +1927,7 @@ David Wolever
19271927
Klaus-Juergen Wolf
19281928
Dan Wolfe
19291929
Richard Wolff
1930+
Tymoteusz Wołodźko
19301931
Adam Woodbeck
19311932
William Woodruff
19321933
Steven Work
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Covariance, Pearson's correlation, and simple linear regression functionality was added to statistics module. Patch by Tymoteusz Wołodźko.

0 commit comments

Comments
 (0)