Subsections
B. Linear Regressions
So you want to fit a straight line to a set of measurements... First,
make sure you really want to do this. That is, see if you can convince
yourself that a plot of your data, a series of
pairs, is
compatible with a linear model
 |
(83) |
where
is the slope, and
is the
intercept.
Fig. 11 shows a plot of a set of data that does not
appear to be linear. We could apply a linear fit to this data, but it
is unclear what the results would mean.
Figure 11:
An example of a set of data which does
not appear to be compatible with a linear model. Instead, the slope
of the graph appears to increase with
.
|
|
A sample set of data compatible with a linear model is given in
Table 1 and plotted in Fig. 12. Note that
none of the data points fall along the straight line shown in the
figure. Measurements compatible with a linear model generally do not
all fall exactly on a single straight line, because measurements
include uncertainties. Although the data do not fall on a single line,
the diamonds in Fig. 12 appear to be scattered randomly
about a common line.
Figure 12:
The data (diamonds) and the best linear fit
(dashed line).
|
|
Table 1:
Data which appear to be compatible with a
linear model.
| x values |
y values |
| 0 |
7.28 |
| 1 |
7.51 |
| 2 |
7.16 |
| 3 |
7.96 |
| 4 |
8.64 |
| 5 |
9.46 |
| 6 |
8.39 |
| 7 |
9.14 |
|
Once you are convinced that your data are compatible with a linear
model, it is reasonable to apply the method of least squares,
also called a linear regression, to your data. This is a
method of finding the slope and intercept, with associated
uncertainties, of the line giving the ``best fit'' to your data. The
method is implemented as a function in the Excel
spreadsheet and as an analysis tool in Logger Pro. Detailed
instructions for using them are given below.
Figure 13:
The solid vertical bars are the
differences
between the data and the linear
fit. The sum of the squares of the deviations are minimized by the
fitting procedure.
|
|
A derivation of the method of least squares is beyond the scope of
this course,1 and we won't need to use the associated
equations, since the method is automated in the software tools we
use. Instead of working through a derivation, we will consider the
the general idea behind the process. In Fig. 13, the
solid bars show the differences
 |
(84) |
between the data and the fit function (Eq. 83). The method
of least squares gives the line which minimizes the sum of the squares
of these differences,
![\begin{displaymath}
\sum_i [y_i - (mx_i + b)]^2
\end{displaymath}](img288.gif) |
(85) |
These are the ``least squares'' referred to in the name of the
method. Try sketching a different line on Fig. 13,
and add your own ``difference bars,'' and you should be able to
convince yourself that they would give a larger value of the sum in
Eq. 85.
The method of least squares is analogous to calculating the mean and
the associated standard deviation of a set of measurements of a single
quantity. The uncertainties in the measurements are assumed to be
random, and the resulting slope and intercept are estimates of the
most probable true values. The dashed line shown in
Fig. 12 is the result of a least squares fit to the
data in Table 1. The fit results are
where
is the slope,
is the
intercept,
is an
estimate of the uncertainty of individual
measurements called the
mean square error or the standard deviation of the y
estimate and
is the correlation coefficient.
Figure 14:
Error bars on the individual
measurements are
. The shaded region
reflects the uncertainty ranges
and
of the
best fit slope and intercept.
|
|
-
and
The uncertainties
and
in the slope and intercept
are standard deviations. Hence, with a large number of measurements,
the fit values of the slope and intercept fall within
and
of the true values with a probability of 68%, and it is
95% probable that the true values fall within
and
. The shaded region in Fig. 14 reflects the
uncertainty ranges
and
.
-
The mean square error is calculated via
![\begin{displaymath}
\sigma_y^2 = \frac{1}{N-2} \sum_i [y_i - (mx_i + b)]^2
\end{displaymath}](img298.gif) |
(86) |
and is analogous to the standard deviation (squared) of a distribution
of repeated measurements. Error bars equal to
are shown in
Fig. 14. Note that not all of the measurements agree with
the best fit line within uncertainty. By the definition of the
standard deviation, we only expect 68% of the measurements to come
within
of the best fit line.
Finally, the correlation coefficient
is a measure of the degree of
correlation between
and
values. Typically, we are interested
in
, called the coefficient of determination. It falls
in the range
, and describes the fraction of the
variation in
values explained by the linear model. It follows
that the quantity
is the fraction of the variation we can
attribute to the uncertainties in the measurements, provided the
differences between the data and the linear model are purely random.
In the example given here, about 72% of the variation in the data is
explained by the linear model and about 28% of the variation is random.
If the data falls precisely on the fit line,
, and
is exactly 1. If the data is completely uncorrelated,
is close
to zero. There will generally be a clear correlation between
measurements and models in the laboratory work for this course, so we
will not be particularly concerned with
values. If you are given
instead of
, the range of possible values is
. Negative
values correspond to negative correlations (negative
slopes).
Caution! The correlation coefficient will not help you to
identify data that is incompatible with a linear model. In fact,
the least squares fit to the data shown in Fig. 11 is
, which indicates a stronger correlation than that
of the linear data of Table 1 and Fig. 12.
Note : If we were working with actual data, the slope and
intercept would have physical meaning, and we would report
and
with appropriate units. We would also
report the error estimate for individual
measurements as
(
).
In some cases, you may have a reason (a theoretical model, e.g.) to
test whether your data are consistent with a linear model with a y
intercept
of zero. If a linear fit to your data yields a standard
deviation of the
intercept
greater than
itself, then
you may conclude that your data are consistent with a zero y
intercept.
If your theoretical model predicts a zero
intercept,
and you find that a linear fit yields an intercept consistent
within uncertainty with zero, you may want to perform a fit in which
you fix the value of
at zero in order to find a best value of the
slope compatible with your model. Instructions for performing linear
fits with
fixed at zero with Logger Pro and
Excel are included below.
- The data you wish to fit must be in the Logger Pro data
table and displayed in the graph window.
- If you want to fit a subset of the data shown on the graph,
select a rectangular area in the graph window containing the points
you want to fit by dragging the mouse.
- Click on Analyze -> Automatic Curve Fit ..., and select
the Linear form (
). Make sure the Perform Fit
On: box is set to the desired quantity. Then, click on Try Fit.
- The labels that Logger Pro gives to the fit results
relate to our notation as
Note that you are given
instead of
.
- To fix the intercept (
) to zero, after you click on
Analyze -> Automatic Curve Fit ..., choose the
Proportional form (
) instead of the Linear form
(
).
- First, enter your data into the spreadsheet. The data should be
arranged in two columns,
values in one column and
values in
another.
- Select a group of 6 cells, two columns wide by three rows high,
for the fit results.
- In the formula bar type
=linest(<y-cells>, <x-cells>, true, true)
where <y-cells> is the range of cells containing your y
values and <x-cells> is the range of cells containing your x
values. The two ``true'' entries tell the linest function to
allow the intercept
to be non-zero and to report fit statistics
,
,
, and
.
- Hit <F2> followed by <CTRL> <SHIFT> <ENTER>,
and fit results will appear in the six cells as follows
- To fix the intercept (
) to zero, set the first logical
argument of the linest() function to false. That is,
in the formula bar, type
=linest(<y-cells>, <x-cells>, false, true)
|
Copyright © 2003-2007, Lewis A. Riley
|
Updated Tue Nov 30 13:48:34 2004
|

This work is licensed under a Creative Commons License.