Caltech Astrostats: October 2010

Thursday, October 28, 2010

HW 5 Clarifications: What does the evidence ratio mean? (and revisions to the assignment)

Questions 4 and 5 on the HW are a bit confusing. Here's how to think of it. [Skip to the bottom if you're confused by the following ramblings about the role of Monte Carlo simulations in this kind of analysis.]

An evidence ratio (or "Bayes factor") has a specific meaning: that is, it is the odds ratio in favor of one model over another. If Ev1/Ev2 = 9, then your data give 9:1 odds that Model 1 is right; in other words, you have 90% confidence (9/(9+1)) in Model 1.

Why then, you might (rightly) ask, are we asking you do to Monte Carlo simulations? And what then is the point, you might (astutely) inquire, of Section 3.2 of our paper?

Well, to be honest, the more I think about it the more unnecessary I believe Section 3.2 is. Except for two important purposes: a sanity check and a convincing demonstration. Monte Carlo simulations allow you test the frequentist performance of your Bayesian statistic; this is what we do in Section 3.2, and what we are asking you to do in question 4. In Section 3.2 it was relatively straightforward to do Monte Carlo simulations to do this, because the two models I was comparing have no free parameters. And we find that the simulation-based confidence of 94% is (almost) exactly what you get by just calculating the confidence directly from the R value. HOWEVER, it is no longer clear to me how to do these simulations, or exactly what they mean, in cases where you have free parameters. What I did in this case (Section 4.2/4.3) was to simulate data with various values of the parameters and weight the resulting R distribution by the values of the parameter that the data favored (i.e. marginalized over the parameter). This ends up with a result that does in fact confirm the confidence that the Bayes factor suggests (R = 1.16 gives 94% confidence analytically), but this is a different procedure from what we have asked you to do in Question 4, where we tell you to simulate data just using the best fit parameters.

In addition to this, Question 5 is somewhat unclear.

John's out of town right now, but in my role as temporary deputy of this class, I ask you to combine the tasks that the current Questions 4 and 5 are asking you to do by answering the following two-part question:

How likely are you to be able to confidently (>95%) detect the presence of two lighthouses, as a function of their separation? And how likely are you to confidently select a one-LH model over a two-LH model, if the data really do come from just one LH?

This will still require you to do MC simulations, but this time not to try to confirm the frequentist performance of your statistic, but as a way to help you understand the limits and capabilities of your experiment.

To turn in Monday:

Example 1-d and 2-d posterior plots for each data set (i.e. the one generated with 1 LH and the one generated w/ 2 LHs), with the true values of alpha marked. I would like the 2-d posteriors to have 1-, 2-, and 3-sigma contours, though if you don't get around to this yet, not a huge deal.
A plot of "probability of confidently detecting 2 LHs" vs. LH separation (with a caption explaining what the plot means that someone not taking this class could understand)
A histogram of R values resulting from a large number of data sets simulated using one LH, with a caption giving your interpretation of what this means.

Wednesday, October 20, 2010

HW4: My Solutions

Hey class- Here are the results I get from the two exercises. My prior confidence in myself doing things like this correctly on the first try is generally something around 93%. So if you get something different, check yourself first, but if your result is persistently different from mine, let me know!

Question 1: What kind of coin?

Question 2: Fair flipper or cheater?

Remember to write captions (short but to the point) for each figure: be very clear in your language about what you're calculating! This is particularly important because most people think Bayesian methods are scary and magical; your job is to show that what you do is clear and logical!

HW4, Question 2 clarification

Hey folks- I apologize for the incomplete and inexact wording of question two on the latest worksheet.

Here is the story. A Type I flipper always tells the truth. A Type II flipper, with some unknown probability f (f less than 0.5), will decide before a flip that, no matter what you guess, he will make you wrong. That is, no matter what the coin actually is, if you guess 'heads' he says 'tails,' and vice versa.

Model Selection in the Literature

Tim and I just published a paper that uses the Bayesian model selection concept we are learning about this week. The paper can be found here.

Also, here's a link to Tim's newly minted home page.

Monday, October 18, 2010

Homework 3: What is due Wednesday

For Wednesday, you can skip Problem 4. I think it is too time-intensive for the lesson I wanted you to learn. The punchline is shown in the figure below, which was generated by Jon Swift (of Swift & Beaumont 2009).

The point is that the least-squares approach underestimates the power-law index for all values of N. So the next time you see a power-law fit in the literature, be suspicious!

For Wednesday, do Problem 3. In addition to what is written, estimate the power-law index for the exoplanet masses using MLE, and compare your answer to the value you find from a least-squares fit as discussed in Problem 4. For the least-squares fit, you can ignore mass bins containing zero planets.

Turn in a figure showing the pdf for alpha, and a figure illustrating the best-fitting powerlaws from both methods compared to a histogram of the planet masses. For the two figures, write captions describing what is plotted.

ETC: less than 2 hours

Reading assignment due Wed is still Sivia Chapter 4.1

Are your likelihoods slowing you down?

Sorry about the formatting here; looks like the Google docs embed thing isn't quite compatible with the blog formatting....Hope you can still read what you need here.

Friday, October 15, 2010

Programming Investment: Making Figures

If you continue on in astronomy, you'll find (if you haven't already) that most of the actual time you spend doing research is, in fact, programming. Not just programming to actually do whatever analysis you're doing, but often a comparable amount of time putting together nifty professional-looking, presentable figures to communicate your results.

With this in mind, you should be thinking about building an accessible toolkit of useful routines that perform tasks that you expect to repeat often. A first example of something like this would be the function we had you write a few weeks ago that calculates a 1-D confidence interval, given vectors of x and p(x). You should consider writing similar routines for common plotting tasks.

For example, I've written a function called "plot_posterior" that takes as input x and p(x) (or L(x)--not necessarily normalized), just like your confidence interval function, and makes a pretty plot, as follows:

>>> x = arange(-2,6,0.001) #make x vector

>>> px = normal(x,2,1) #evaluate N(2,1) on x

>>> plot_posterior(x,px,name='x',labelpos=(0.65,0.8)) #make the plot

(Note the default thicker-than-normal axes and lines, which is the quickest way to make your plots look more presentable and should become a habit.)

The function also has other optional parameters, such as plotting different confidence intervals, marking the median value instead of the mode as the representative value, using a symmetric interval instead of the shortest interval, turn off shading, turn off labeling, change the number of decimal places to which the result and errors are quoted, etc. This way, whenever I want to make a nice plot of a posterior distribution with regions marked and labeled and all, I don't have to start from scratch.

I've also written a similar function ('plot_posterior2d') to summarize the results of a two-dimensional parameter estimation, as you are currently working on with the Fischer & Valenti data. This produces the following, which is a nice concise way to illustrate the results:

>>> plot_posterior2d(a,b,L,confs=[0.68,0.95,0.99],'a','b',decplaces1=3)

As you can see, this function uses the 1D version of 'plot_posterior' to make the marginalized pdfs (once called with the 'horizontal' option). (It also draws only selected confidence contours on the contour plot, which is something we have not yet asked you to do, but you should be thinking about.) And now, with a modest front-end time investment, I can instantly make a snappy publication-quality plot of any two-dimensional likelihood surface I wish, potentially saving lots of time down the road.

NOTE: I'm doing all my work here using python, which I heartily recommend to all of you to experiment with at some point, if solely because of this amazing collection of example plots that I visit anytime I ever want to make any kind of fancy-looking plot. The language is quite similar in many ways to both Matlab and IDL, and has the benefits that it's both much more portable and much more free than either. I've just picked it up in the last year or so, and I'd be happy to help guide anyone who's interested.

Thursday, October 14, 2010

How to present a result?

When using Bayesian inference for parameter estimation, one has several choices as to how to present your results. In question #4 on HW2, you are asked to calculate the posterior pdf for f, the binary fraction of O stars, given 3 companion detections in 21 systems. If you were presenting this result in a paper, how would you characterize it? What is the answer?

Well, in Bayesian parameter estimation, the answer is the entire posterior pdf:

But of course, you need to quote a result somewhere in the text, and perhaps in a table, in order that someone else can read your paper and quickly take your result and use it in their work. So what do you do? Do you quote the median with a symmetric 68% confidence interval?

This is the easiest one to calculate, but with an asymmetric pdf like this it doesn't actually look that great, in my opinion. So then do you quote the most likely value, with a different confidence interval? (I believe this one is the "shortest confidence interval.")

Maybe looks a bit nicer, but really, who's to judge? Anyway, the point is that exactly the same data and analysis gives slightly different "results," depending on what you choose to quote.

This is funny, because by summarizing "results" like this you're actually encouraging others to play the "assume a normal distribution with a particular sigma" game. And I fully admit to having done this myself: taking quoted results that I know don't come from a normal posterior distribution and approximating them as normal in my analysis. What can I say, sometimes we're just lazy...who wants to track down a whole bunch of exact posterior distributions from a dozen different authors? But at least this way I can comfort myself by knowing that while I'm taking shortcuts, I know that I'm taking them....

Wednesday, October 13, 2010

Homework 2: Example TeX file

\begin{document}
\title{Your Title}

\author{Joan T. Studint\altaffilmark{1}}
\email{studint@caltech.edu}

\altaffiltext{1}{Department of Astrophysics, California Institute of Technology, MC 249-17, Pasadena, CA 91125}

\begin{abstract}
Background and motivation. Methodology. Results. Conclusion.
\end{abstract}

\keywords{methods: statistical}

\section{Data}

Where did you get the data? Cite your sources. How were the data measured?

\section{Analysis}

Describe your analysis. Include equations

\section{Results}

What did you find? Show figures. How did it compare to previous studies?

\end{document}

Correction to HW 2 data set

There was an error in the cross-listed data set. The error has been fixed and checked against exoplanets.org.

The updated data set can be found here:

http://www.astro.caltech.edu/~johnjohn/astrostats/data/fv05_vf05_data.txt

Correction to HW 2

Problem 9 refers to "HW3." This is a mistake. It is referring to the analysis in problem 8.

Office Hours 8-10pm, Thursdays

We will be holding weekly "homework lab"-type office hours from 8pm to 10pm on Thursdays, in the classroom (Cahill 219). Please come by if you have questions or if you're stuck with anything on the assignments; either Professor Johnson or Tim (or both) will be there to help.

Data for Class Activity 2

The cross-listed data can be found here:

http://www.astro.caltech.edu/~johnjohn/astrostats/data/fv05_vf05_data.txt

Tuesday, October 12, 2010

Class Activity 2.1: Time per problem

General comment on CA2:

1) Problems 9 and 10 (the writeup) will be due Monday. For problem 9, compare the marginalized pdfs of alpha and beta for the case with and without consideration of sigma_[Fe/H]. When you get here Wednesday or thereafter, do not do the integrals analytically! Do your integration numerically.

2) The primary goals of this assignment are

to understand how to calculate the occurrence rate as a function of object properties (planet occurrence as a function of stellar metallicity).
present your results clearly and professionally
gain science writing experience

Keep these goals in mind as you work on the problem set.

3) Time spent debugging should decline monotonically throughout the semester. If you get discouraged after spending an hour chasing down a MATLAB or LaTeX error, take heart in that this is one fewer bug to chase down later as you pursue your research.

4) Ask for help, especially if you exceed any specific ETC below. Talk in person and via email to your classmates, Tim, or myself.

Comments and ETCs for homework problems:

Several students have requested an estimate for the amount of time one should spend on each problem. For Wednesday, problems 1-7 are "due," which means that you should be prepared to work on problems 8-10 in class with your partner. It looked like most teams finished through 4 in class, so here are my estimated times to completion (ETC) and notes for problems 5-7:

CA5) You may ignore my specific suggestion of assigning 60% completeness to half of the stars. The key to this problem is to investigate how detection sensitivity (completeness) factors into your estimate of f. You may assign Si to the stars in your sample as you see fit. Query your intuition beforehand; how do you expect S to affect your estimate of f? Then do the test.

This should take no longer than it took you to complete problem 4: ETC 20 minutes

CA6) I will send out the cross-listed data set thanks to Mike B. and Matt.

You should still look up the FV05 paper on NASA ADS, and read the sections that describe their data and analysis so you have the necessary background for this problem (the intro would also be useful for context). ETC 30-40 minutes for reading the relevant sections

CA7) ETC 10 minutes of thought after examining Fig 5 in FV05.

Total ETC <= 1.2 hours