Friday, November 5, 2010

On Jeffreys priors: scale vs. location

I have to admit that even after reviewing many of your write-ups, I remained a bit uncertain about when I should and should not use a Jeffreys prior for a parameter (i.e. p(x) ~ 1/x). It is clear to me that the determination of the parameters describing e.g. a normal distribution involves a scale parameter sigma and a location parameter mu, and the uninformed prior for the scale parameter is p(sigma) ~ 1/sigma, and for the location parameter it's a uniform prior p(mu) ~ 1.

But what happens when mu itself is parametrized, such as mu(x; A,B) = Ax + B? Is A a scale parameter while B is a location parameter? Before class Wednesday, I would have answered yes. After class, I was seriously doubting myself.

Unfortunately, after several hours of paging through books and searching the web I couldn't find a definitive answer to this question. The difficulty is that all of the examples of Jeffreys priors I found use a normal distribution as an example, or some other simple distribution (Poisson or binomial). But I could find precisely nothing about the similarly simple and far more ubiquitous problem of an uninformative prior on the slope of a linear fit! What's up with that? So I decided to do what I should have done at first: query my intuition rather than statistics papers.

What is it that we're trying to do with a Jeffreys prior? We are simply trying to properly encapsulate the notion that we have no prior knowledge of a parameter. Let's work with the example of a time series of flux values with normal errors, for which we suspect the flux level is a linear function of time:

F(t; A, B) = A + Bt.

Is a uniform prior appropriate for the slope B? A uniform prior would say that the slope is equally likely to lie between the interval 1 and 2 as it would between 1,000,000 and 1,000,001. Hmmm, this doesn't feel right. The former increase in the slope is a 100% change, while the latter is an increase of only 1 part in a million. By making the statement p(B) ~ const, we'd be saying that a slope between 1,000,001 and 1,000,003 is twice as likely as a slope between 1 and 2! If we were to make this statement, then we'd clearly have some pretty strong prior information. We'd be far from uninformed.

So in the case of a slope, I'd argue that we have a clear-cut reason to rely on a Jeffreys prior of the form p(B) ~ 1/B, which is the same as saying p(lnB) ~ const., or every log interval is equally likely. This feels much better.

Keep in mind that when we talk about prior information, we are talking about knowledge we had about the problem at hand before we were handed the data. For the case of a flux level that is linear in time, what do we know about the offset A? Well, we certainly do not know its numerical value without peeking at the data. But we do know that the shape of the function A + Bt is going to look the same for any choice of the offset, A. The offset can be 1 or 1000 and it won't change the fact that we'll end end up with B more flux in each unit time interval.

Thus, a uniform prior is appropriate for the offset parameter A: the offset interval 1 to 2 is just as likely as the interval 1,000,000 to 1,000,001. The offset is a location parameter, and location parameters call for a uniform prior.

Another way to think about all this is to consider two choices of units. If I change the y-axis from ergs/s/cm^2 to horsepower per acre, the offset just translates from one numerical value to another. However, for a fixed offset A, the units are crucial for a slope value B. If I tell you that the slope is B=1, it matters a lot whether we are talking about deaths per year or deaths per second! As a result, before we have the data sitting on our hard drive we certainly don't want to pick a prior for the slope (rate) that expresses a preference in one choice of units over another.

Using this reasoning p(A) ~ const. and p(B) ~ 1/B feel about right to me for the simple case of a linear fit with normally-distributed errors, or for any other model parametrization for that matter (see Gregory 2005 for the case of a Keplerian orbit, in which the Doppler amplitude K is given a Jeffreys' prior while the phase information is uniform). I'm not much interested in the square root of the determinant of the Fischer information. My intuition tells me that location parameters get uniform priors, and scale parameters (in which the units on the x or y axis matter) get p(B) ~ 1/B priors.

But don't take my word for it, query your own intuition.

No comments:

Post a Comment