 Princeton University
Physics Department
PHY 209 main page

## Overview

This week, you will write functions that find and plot the best-fit line to a set of data having measurement uncertainties. The data you will look at are the distances to, and recession velocities of, distant galaxies. The slope of the linear relationship is known as Hubble's constant, Ho.

The goals today are:

• Read text data from a file into Python numpy arrays. Plot the various Hubble data.
• Look at two data sets (nearby galaxies and cosmologically distant galaxies). Find the best-fit line for a set of data, and estimate the uncertainties on the best slope and intercept.
• Compute and understand the chi-squared statistic for a fit. Compute the chi-squared probability and make a judgement about the goodness of fit (i.e. whether the model of a line agrees with the data).

## Prelab

### To do before Lab on October 12

See
just the pre-lab section for handy printing.

1. Read Lupton pages 50-52 (handed out in class). It's about one philosophical justification for using maximum-likelihood estimators. To wit: that within Bayesian statistics, maximizing the likelihood of seeing our data is equivalent to finding the most likely model.
2. Numerical Recipes sections 15.0 (ps), 15.1 (ps), 15.2 (ps) .
3. Read the material in Physics/math background.
4. Review of Particle Properties : Big-Bang Cosmology, sections 19.0-19.2. This is fairly technical, so just skim through and don't worry if you don't follow everything, but take note of Figure 19.1, which is the modern Hubble's Law data from supernova.

On Python, I suggest that you read or skim Chapter 9 sections 1-2 and Chapter 27 (or Chapter 32 in the 4th edition) in Learning Python, 3rd Edition. They are "Tuples, Files and Everything Else" and "Exception Basics", respectively. Remember that we'll be reading files and testing for exceptions this week. (Link requires an institutional subscription. That is, you must either be on campus or set up a VPN to campus.)

#### Problems

Use the line-fitting formulas from seminar to try a simple data set:
{x}={0,1,2,3,4}
{y}={4,2,2,2,0}
σ or {δy}={1,1,1,1,1}
1. Making a plot (by hand or with Python), it should be clear roughly what line is the best fit. Do the sums and find the maximum-likelihood values of (α,β). Are they as expected? Find the uncertainty in each.
2. Find chi-squared. How many degrees of freedom are there? Is your chi-squared consistent with being equal to the number of degrees of freedom, plus or minus sqrt(2 DOF)?
3. Find chi-squared again, but this time don't fit a line. Instead, assume the simpler model that the data all have the same, unknown value plus errors. That is, ignore {x}. What is chi-squared for the model "all data are expected to equal 2"? You'll find that this model is a little unlikely but not overwhelmingly so.

## Project

#### Overview

An overview of the lab's main projects this week:
1. Read data from a file and plot them.
2. Find the best-fit line. Estimate the uncertainties on slope and intercept.
3. Use the chi-squared test for whether a fit is good.
4. (optional) Repeat the fitting many times with fake (random) data to see how the fit results vary.

#### Reference

Here are several of the functions I used in doing this week's project. I list them here so you don't have to waste the entire session trying to find the function you need. Also see the
PHY 209 references
```numpy.array           convert list or tuple to a vector of data
numpy.sum             sum of a vector
numpy.average         mean of a vector
numpy.fabs            absolute value
numpy.mean            mean of a vector
numpy.std             standard deviation
pylab.plot            2-d scatter or line plots
pylab.errorbar        2-d scatter plots with error bars
pylab.subplot         multiple plots on one window
numpy.random.normal   array of random values with normal (Gaussian) distribution
scipy.stats.chisqprob Chi-squared probability to exceed (the scipy module should be installed on the lab machines and feynman)
```

#### Projects

Try to do at least the first 3 of these 4 projects. Here they are in much more detail.

1. Read text data from a file and plot them.
The first step in the present work is not directly related to the statistics problem at hand, but it is quite important: you have to learn how to read in a text file. There are two you can use: hubble_low_z.dat and hubble.dat. The files are lists of supernovae with 3 facts about each: the redshift z, which for our purposes is the recession velocity divided by c (this is true only to leading order in v/c); the distance d in Mpc; and the measurement error on distance, delta-d (also in Mpc). Mpc, or megaparsecs, are the typical distance units in cosmology, equal to 3,261,560 light years.

1. Reading text files in Python is pretty simple.
• "Open" the file for reading.
• Go through all the lines, splitting them up into their columns.
• Convert the results into numpy arrays.
Here is most of my program that reads these into arrays named `v`, `d`, and `derr`. Notice that we have to build the data as lists and convert to arrays later, because lists have an `append` method but arrays don't. (Lists are mutable and arrays are immutable, in the Python lingo.)
```from numpy import *
datafile=open('hubble_low_z.dat','r')  # 2nd argument says open for reading
for line in datafile:            # Iterating over a file goes line-by-line
(a,b,c)=line.split()         # split() breaks text at whitespace

# Somehow have to keep a record of a,b,c???
# Arrays can't be built element-by-element.
# But lists can...

datafile.close()                 # Close the data file
```
Try to add the missing pieces so that you end up with numpy arrays named v, d, and derr containing the data. If you are stuck, refer to the Python reading.

2. Before you go on, use your solution to the previous part to create a function `read_hubble_data(filename)`. Have it return a "tuple" of the arrays. (That is, use ```return (z,d,derr)``` as the last line of the function. The parentheses are optional in that statement.) Tuples are like lists, but they are both more efficient and more annoying (because you can't modify them, you may only replace them with others; again, this is what Python calls an "immutable" object). Your function should accept an argument `filename` that tells it what data file to read. Save this function in a file (using emacs), to which we will soon add the other features of this project. By doing this as a function, you have encapsulated the things you've solved so far and made them easier to use as a component in a complete solution to the best-fit problem. You can check how Joe did it.

3. What if you gave it a filename that doesn't exist? Modify your function so that it gracefully handles this error. It might return without loading anything, or perhaps it asks the user for a filename using `file=raw_input('What file contains the data? ')` or perhaps it uses a default filename. In any of these choices, use the exception handling pair `try: ... except:`. Here's an unrelated example of the syntax:
```list=[1,3,5,7]
n=int(raw_input('What list item do you want? '))
# Now do something that might produce an error.  Don't be afraid!
try:
print 'list[',n,']=',list[n]

except IndexError:
print 'list has no element n=',n
```

4. Once this works, plot distance as a function of redshift. Reviewing the use of our `pylab` plotting package (grab sample_plot.py):
```# Notice three new features this week
# 1. plotting errorbars instead of just points or lines
# 2. plotting a function (any valid python or numpy expression depending on x)
# 3. drawing to a file.
# -Joe  Oct 8, 2007

from pylab import *
errorbar(z,d,derr,ecolor='r', fmt='o') # Error bars. Circles at points.  No line
xlabel('Redshift z')                   # Axis title and overall title
ylabel('Distance (Mpc)')
title('Supernova Hubble Diagram')
plot(z,distance(z), color='green')     # Overplot a theory curve
savefig('sample_hubble.png')

```
Does it look like real experimental data with uncertainties but clustered near a line? Let's hope so. If so, you are in good shape. Notice how the uncertainties in distance increase for more distant galaxies. Good thing then that we are using line-fit results that permit different uncertainties on each point!

2. Find the best-fit line for a set of data, and estimate the uncertainties on the best slope and intercept.
The math is laid out in the seminar summary sheet. At this point, it would pay to sketch out a program design. For this problem, it can be quite short, possibly a brief list of functions along with a quick description of what they will do. Do you want to find the slope and intercept with a single function, or do you want to have a separate one first to do all the sums over the data? Do you pass in 3 arrays `x, y, dy` as the data? Do you want the program to allow `dy` to be a single number (implying that all data have the same absolute error)? Today, this is just an exercise. But for bigger projects, the design phase is critical (so it's a good habit to build). There is no right answer, or rather there are many, many right answers (and only practice will help you to learn which right answers are most efficient and helpful to you).

All designed? Good. Now try writing the function(s) you have planned. You can jump right in and test it on real data. In my experience, physicists usually do this. But a wise and patient person will run the brand-new program on some well-understood data first, data whose best fit is totally unambiguous. That is, we need to do some tests. You should think about what tests would convince you that your fitter works. Just a few is appropriate for this kind of program. After you ponder this and write down notes on your ideas, you can look at my ideas for tests (and note on introducing classes).

Write some tests as separate functions `test1()`, `test2()` and so forth. Have each test make a report of the sums over the data, the best-fit intercept and slope and their errors. In fact, it would be quite useful to have a function `fit_report()` that prints such things in a standard format. Did you include a fit-reporter in your original design? If not, add one now, and run your tests. Did they work as expected?

First fit the `hubble_low_z.dat` file. This contains 18 nearby supernovae (only out to redshift z ~ 0.1), where the Hubble relation is found to work well. Though the Hubble law should have zero intercept, we will fit a line

d(z) = a z + b
as a test of how well the law works. Just using your judgment, how good is the fit? (We'll save the chi-squared test for the next section.) What is the Hubble constant? Note that the Hubble constant is usually reported as c/a where c is the speed of light (299,800 km/s) and a is the slope you fit for. Why did we fit for the inverse of the Hubble constant? Because the distance data are far more uncertain than the velocity data! This makes fitting d(z) more straightforward than z(d). The generally accepted value for the Hubble constant is approximately 70 km per second per Mpc. Is your intercept b consistent with zero, as excected? Use the formula for the uncertainty in b to check. What is your error on the Hubble constant? Use the formula for the uncertainty in a and keep in mind that a and H0=c/a have the same fractional uncertainty.

Make a plot of your raw data (as points with errorbars) and your best-fit line. Use the `pylab.savefig('filename.png')` function for your plot to save it as a PNG. You can similarly save as a PostScript file if you give it a name like `savefig('filename.ps')`. In principle, `pylab` can do EPS, JPEG, PDF, PNG, PS, and SVG, though you might need to "switch your pylab backend" to do some of these.

Let's not fit to the higher-z supernovae yet. The idea is that they are not going to fit well to a line, so we need first to study the chi-squared statistic for goodness-of-fit.

3. Compute and understand the chi-squared of a fit.

Compute the chi-squared for your fit to the low-z Hubble data. How large is it? There are (18-2) degrees of freedom: 18 data values, minus two degrees of freedom that were "spent" on fitting the line (since the line has two free parameters). So we expect chisq to be roughly 16 ± 6. (The chi-squared distribution has expected value of r and variance 2r when there are r degrees of freedom.) Do you get a value less than around 20 to 25? That is, is this a good fit?

Now fit `hubble.dat`, the full set, which goes out to z ~ 0.8. How large is Chi-squared? Compare this to the 60-2 degrees of freedom. It is illustrative to plot the fit you got from the low-z data alone over the full set. Astronomers expected a deviation from the Hubble law, but not in this direction! It appears that the expansion of the universe was slower in the past, or that the expansion is accelerating.

For both data sets, you can compute the "probability to exceed" for the chi-squared you found (and given the number of degrees of freedom). To be sure you are using it right, make sure that you find a probability of 65.94% for the chisq=1.6 df=3 case that corresponds to the prelab work. The scipy modules has a function that computes the chi-squared probability to exceed. Use ` from scipy.stats import chisqprob; help(chisqprob)`. For previous years when scipy wasn't available, Joe wrote a program that does the same thing, see chisqprob.py (colorized or plain). However you shouldn't need to use this code. You recognize a really bad fit by the fact that the data are wildly improbable to happen if the fit is right. Usually a probability to exceed of greater than around 1% to 5% suggests a good fit.

Aside on the cosmology: We have made a few oversimplifications for these data. First, astronomers correct for a variation in the supernova intrinsic brightness of the order 30% based on the "light curve", that is, how long the supernova explosion lasts. For reasons not well understood, we know that brighter supernovae last longer (it has to do with variations in their nickel content, but this itself is not understood). Second, we have converted the raw data (apparent brighness) into an estimate of distance, assuming identical intrinsic brightness for all Ia supernovae. Finally, we've ignored the question of whether we should expect a straight line for d(z) or d(v), given that v=cz is only true for small z. The point here is that we can detect curvature in the relationship, using statistical methods.

4. (Optional extension) Repeat it all with fake data to explore the fit uncertainties.
In the handout, we give exact expressions for the uncertainties in the slope and intercept of a linear fit, meaning the standard deviation you'd expect if you did many, many fits to equivalent data sets. We also discussed the chi-squared distribution. By repeating your fit many, many times on simulated data sets ("fake data"), you can test the accuracy of these expressions in this case.

Generate fake Hubble data N times and fit each data set. (N can probably be around 10,000 or more, depending on your computer and your patience.) To make the fake data, use the given list of independent variables (redshifts or velocities). At each redshift, use some standard line (e.g. the one with Hubble constant of H0=70 km/s Mpc-1 to make a theoretical value of the distance. To this idealized value, add a Gaussian (normal) deviate with zero mean and the appropriate standard deviations (i.e. matching the stated errors on the data). Check out `help(numpy.random.normal)` Keep track of the slope and intercept from each fit (e.g., build a Python list?).

Check whether the mean and standard deviation of the slope and intercept agree with the computed values. What does a plot of a versus b look like? Are the values anticorrelated? What are the means and standard deviations of the two? What did you expect? Make a histogram of their values (see pylab's `hist` function)

Also, for each fit, compute chi-squared. Make a histogram of its values. Does it have the expected mean and standard deviation? (They should be N and sqrt(2N) for N degrees of freedom.) Compute the probability to exceed, and histogram its values. What should that distribution look like?

Contact instructors: Cristiano Galbiati or Shawn Westerdale
 Python source modified: Thu Sep 11 11:39:51 2014. Page generated: Mon Nov 17 20:57:10 2014.
Back to this week's main page
Up to PHY 209 main page
This page generated by Python, using the markup.py package. Go here to see how.