## Monday, May 28, 2012

### Wrapping things up

We're closing in on the end of the term. Here's what's due and when:

• CA5 due Monday, June 4 (Dropbox)
• Final projects due Friday June 8 (Dropbox)

Add this to the end of CA5:

8) Simulate a 5-pixel data set with flux per pixel described by F(z) = a*(z-1)^2 + b*(z-1) + c, with a=3, b=0.2, c=4. Then use an MCMC routine to produce the two-dimensional PDF for {a,b,c}, as well as the individual 1-dimensional PDFs for a, b and c.

The basic algorithm for MCMC is as follows:

1. Evaluate the likelihood of the data at an initial, guessed set of parameters {a,b,c}. Call this L0
2. Evaluate a new likelihood L1 at a new set of parameters, say {a+delta,b,c}, where delta is a small perturbation drawn from a "jump distribution." This jump is usually just a call to a normally-distributed random number generator centered on the current value. The parameter to perturb is also drawn at random (there are other ways to do this, but this is the easiest in my experience).
3. If L1/L0 > 1, then add the new set of parameters to the chain (array of parameters). If L1/L0 < 1, then add the new parameters to the chain if a uniform random number is less than L1/L0. Else, just add the previous set of parameters to the chain.
4. Goto 1 until the desired number of chain links is calculated (usually 1e5 to 1e7 links).

For three parameters, the array of chains will be a 3xN array, where N is the number of chains run. Each chain turns out to be the marginalized, posterior pdf of that parameter, which is pretty cool.

For more info, check out Wikipedia, or Appendix A of this classic astronomy article:

http://arxiv.org/pdf/astro-ph/0310723v2.pdf

I will be available during office hours Wednesday (10am-11am) to go over MCMC in greater detail.

## Thursday, May 17, 2012

### Correction on my last post

Both the title and body of my previous post referenced Class Activity 2, when I meant to refer to Class Activity 4 (CA4). Since you already turned in CA2, I hope this wasn't too confusing!

## Wednesday, May 16, 2012

### Class Activity 4

For those of you who have completed CA4 through problem 2, you can consider yourselves complete on that class activity and you may turn in what you have. For those of you who completed through problem 4, kudos to you and you should feel good about the extra skills you picked up as a result. The time invested will pay off in your research career, I can assure you.

For now, I'd like all of you to focus on the reading assignment (Chapter 4.1 in Sivia) described in a previous post so you can get the most out of CA5 this Friday. I'll be back in town and I'll give a short lecture before we dive into the next activity.

### 2D posterior plot is easy ---CA3_problem4

Debugging is annoying and we frequently run into a lot of problems when plotting 2D posterior contour.  Believe me, you are not alone.  There is a simple code I used for CA3 problem4. It is not optimized but it shows some standard steps to do the 2D posterior plots which I find useful.

# step1: read your data and do some basic check (visualization is important)
# step2: prepare a 2D -parameter mesh grid, which can be done by meshgrid function in numpy
# step3: Given the 2D parameter grid, we can calculate the probability of each datum at the every grid    coordinate, i.e. , a 2D probability for each data point. It is usually convenient to do it in log scale.
# step4:  sum the log probability of all data points, then we have the joint 2D log probability for
the whole data set.
# step5: find the maximum joint log probability and subtract it. We do this to avoid numerical outflow problem since unnormalized  joint probability is usually a very small value.
# step6: exp(log prob) and plot a contour

The example of CA3_problem4

from numpy import *
import math
import matplotlib
import matplotlib.pyplot as plt
import asciitable
import scipy

detection = data['comp'] #detection or not 1/0
feh       = data['metallicity'] #metallicity

# check data
n         = detection.size    #the total number of data point
print feh.min(),feh.max()     # metallicity range in the data

#choose the proper range of alpha, f>0 --> alpha>0
alpha_grid= arange(0.01,0.1,0.001)
beta_grid = arange(.0,3.0,0.01)

# generate a 2D meshgrid for alpha and beta
(alpha,beta) = meshgrid(alpha_grid,beta_grid)
logpdf = zeros((beta_grid.size,alpha_grid.size))
# loop through each data point
for i in range(n):
f = alpha*10**(beta*feh[i])

f[(f >1)] = None
#set f to none if f>1, here we don't need to check f>0
# because we already set alpha>0

# sum the log(prob) of the all data
logpdf += log(detection[i]*f+(1-detection[i])*(1.-f))

# set the log(prob) at forbidden (alpha,beta) value to -999
whereAreNaNs = isnan(logpdf)
logpdf[whereAreNaNs] = -999

# find the max logpdf
print logpdf.max()

# find the peak location
ind = where(logpdf == logpdf.max())
print 'the peak is at alpha=%.4f,beta=%.4f' %(alpha_grid[ind[1]],beta_grid[ind[0]])

logpdf = logpdf-logpdf.max()

# pdf is the normal probability function
pdf = exp(logpdf)

# 2D plot
plt.figure()
CS = plt.contour(alpha,beta,pdf)

## Sunday, May 13, 2012

The reading assignment for this week is Sivia chapter 4.1 and the first subsection of 4.2 (up to 4.2.1).

Also, check out this post from last year on the Jeffreys' prior for a scale parameter, such as the slope of a linear fit. If you're feeling super inspired, check out  this dense yet handy Wikipedia entry.

## Saturday, May 12, 2012

### Avoiding Loops: a homework example

In previous post, Tim has shown us that using array is much efficient than loops. Here I'm going to give a real example about how to use numpy array to solve one of our homework problem.

The example is CA4, problem3. We have a data set of the masses of known transit planets and want to find the power index of the mass distribution function.

##########################################################

from numpy import *
import asciitable
import pylab as py

transit_mass = all_transits['MSINI']

# find the data with 1 < m <20
ind = ((transit_mass >= 1)& (transit_mass <=20)).nonzero()[0]
mass_data = transit_mass[ind]

n =  mass_data.size
# the total number of planets in the given mass range

# build an array for a possible alpha range
alpha_grid = arange(1.1,3.5,0.001)

# here we use numpy array rather than loop
# (1) we build 2D meshgrid arrays for alpha and mass
# NumPy provides a convenient function for generating meshgrid arrays,
# called, appropriately, meshgrid.

(alpha,mass) = meshgrid(alpha_grid,mass_data)
#print alpha
#print mass

#(2) Calculate the probability at each alpha and m
# Attention: array math operates on an element by element basis,
# i. .e, unlike matrix math.

mass_max =  mass_data.max()
mass_min = mass_data.min()
const = 1./(1-alpha)*(mass_max**(1.-alpha)-mass_min**(1.-alpha))
log_pdf = log(1./const*mass**(-1.*alpha))

# (3) we sum the log_pdf in the mass dimension
log_pdf_alpha = log_pdf.sum(axis = 0)
plot(alpha_grid,exp(log_pdf_alpha-log_pdf_alpha.max()))

##########################################################

The example code and transit planet data can be downloaded from here:

I find this introduction of numpy calculation very useful. It is totally worth your time.

## Wednesday, May 2, 2012

### CA3 data

Part 6 of CA3 instructs you to construct a data table based on FV05. In the interest of time and uniformity, you should just use this table instead:

http://www.astro.caltech.edu/%7Ejohnjohn/astrostats/data/fv05_vf05_data.txt

The middle column indicates 0 if the star doesn't have a planet, and 1 if it does.

Parts 1-8 will be due Monday. Continue on to problems 9 if you feel particularly inspired to learn about how to deal with measurement uncertainties when doing this sort of analysis.

Part 10 instructs you to write up your results. This is left over from last year. This year I've already encouraged you to make your results clear, either in an iPython workbook or better yet a LaTeX document. You may turn in your completed activity the same way you did with CA1 and CA2.