CosmoMC Readme
This version May 2003. Check the web
page for the latest version.
NEW (May 03): GetDist support for 'triangle' plots. Minor bug fixes.
Introduction
CosmoMC is a Markov-Chain Monte-Carlo engine for exploring cosmological
parameter space, together with code for analysing Monte-Carlo samples.
The code and results supplied do brute force theoretical calculations with
CAMB. See the paper
for an introduction, descriptions, and typical results from some pre-WMAP data.
On a multi-processor machine you can start to get good results in a few hours. On single processors
you'll need to set aside a day or so. It gets several times slower if you include tensors, matter power spectrum
and massive neutrinos. Also check our chains
page to see if we have suitable runs available that you can importance
sample from very quickly (typically just seconds to re-compute a few thousand
likelihoods).
We use strict MCMC, with a proposal density that randomly changes only
a subset of the parameters. This gives a high (about 50%) acceptance rate,
and allows the calls to CAMB to be reduced since it does not need to be
called when only varying the 'fast parameters'. The program takes as inputs
the proposal density widths of the various parameters or a covariance matrix.
Use a covariance matrix if possible as it will significantly improve performance.
It's assumed we know these (e.g. curvature of Fisher matrix), though could
adjust dynamically. Different chains are computed by running separate instances
of the program - there is no cross-talk between chains, though this could
be done using MPI or some more primitive means. Since chains are independent
they can be run on different machines, and since we are strict Markov-Chain,
computer crashes are harmless we can just read in the last set of parameters
and re-start.
There are two program files supplied "cosmomc" and "getdist". The first
does the actual Monte-Carlo and produces sets of .txt and .data output
files (the binary .data files include the theoretical CMB power spectra
etc.). The "getdist" program analyses the .txt files calculating statistics
and outputs files for the requested 1D and 2D plots (and could be used independently of the main cosmomc program).
The "cosmomc" program
also does post processing on .data files, for example doing importance
sampling with new data.
Running
-
Get the download
and unzip and untar it (run "gunzip cosmomc.tar.gz", then "tar -xf cosmomc.tar")
- Uncomment the relevant top parts of the Makefiles in the camb and source directories depending on what system you are using
-
Run "make all" in the camb subdirectory
-
Download the WMAP likelihood code from here.
CosmoMC comes with its own customized WMAP likelihood code, based on that by Licia Verde and Hiranya Peiris. Cite Verde, Kogut and Hinshaw papers if you use WMAP.
- Unzip and untar the file likelihood.tar.gz into the cosmomc WMAP directory (so the .dat files should be in cosmomc/WMAP)
-
Run "make all" in the source subdirectory
-
Run "cosmomc params.ini" from the main directory to run the program
If you get segmentation faults with is most likely due to SGI or Intel compiler bugs. Using a different compiler version or compiling without -openmp can fix it. Please let me know if you have specific fixes. You can download the Intel f90 compiler for Linux here.
Note that to compile cosmomc you need to link to LAPACK
(for doing matrix diagonalization, etc) - you may need to edit the
Makefile to specify where this on your system.
Edit the params.ini file to change the default parameters for the run.
To run multiple instances run "cosmomc params.ini 1", then "cosmomc params.ini
2", etc.; the corresponding output file names will have "_1","_2", etc.,
appended.
To change the l_max which is used for generating Cls you'll need to
edit the value in cmbtypes.f90, run "make clean" then "make" to
rebuild everything.
The default code includes polarization. You can edit the num_cls parameter in cmbtypes.f90 to include just temperature (num_cls=1), TT, TE and EE (num_cls=3) or TT, TE, EE and BB (num_cls=4). You will need the last option if you are including tensors and using polarized data. You can use temperature-only datasets with num_cls 3 or 4, the only disadvantage being that it will be marginally slower, and the .data files with be substantially larger. For WMAP data you need num_cls = 3 or 4.
Input parameters for Monte-Carlo
See the supplied params.ini file for a fairly self-explanatory list of
input parameters and options. The file_root entry gives the root
name for all files produced. The samples will be in file_root.txt, etc.
The CMB datasets that are used for computing the likelihood are given in
*.dataset files in the data directory. These are in my standard .ini format,
and contain the data points and errors, data name, calibration and beam
uncertainties, and window file directory. The num_threads parameter
will determine how fast it runs - scaling is linear up to about 8 then
falls off slowly. It is probably best to run several chains on 8 or fewer
processors. You can terminate a run before it finishes by creating a file
called file_root.read containing "exit =1" - it will read it in and stop.
The .read file can also have "num_threads =xx" to change the number of
threads dynamically.
The parameter limits, distribution widths and starting points are listed
as the paramxxx variables. The proposal density cycles between parameters
using a Gaussian with the given standard deviation. If you specify a propose_matrix
(which is assumed to be the covariance matrix for the parameters), the
parameter distribution widths are determined from its eigenvalues, and
the proposal density changes the parameter eigenvectors. The covariance
matrix can be computed using "getdist" once you have done one run. (See
also the chains
page for some you could use).
Each chain continues until there have been samples acceptances
(not at all the same as the number of independent samples, which
depends strongly on the number of dimensions, proposal density, etc) .
Since consecutive points are correlated and the output files can get quite large, you may want to thin the chains automatically: set the indep_sample parameter to determine at what frequency full information is dumped to a binary .data file (which includes the Cls, matter power spectrum, parameters, etc). You only need to keep nearly uncorrelated samples for later importance sampling.
You can specify a burn_in, though you may prefer to set this
to zero and remove entries when you come to process the output samples.
The action variable determines what to do. Use action=1 to process
a .data file produced by a previous MCMC run - this is used for importance
sampling with new data, correcting results generated by approximate means,
or re-calculating the theoretical predictions in more detail. If action=1
set the redo_xxx parameters to determine what to do. You should include all the data you want used for the final result,
importance sampling knows nothing about the data used used to generate the original samples. (e.g. if you generated original samples with CMB data, also include CMB data when you importance sample - it will not be weighted twice).
The temperature setting allows you to sample from P^(1/T) rather
than P - this is good for exploring the tails of distributions, discovering
other local minima, and for getting more robust high-confidence error bars.
Output files
The program produces a file_root.txt file listing each accepted set of
parameters; the first column gives the number of iterations staying at
that parameter set (more generally, the sample weight). If indep_sample
is non-zero, a file_root.data file is produced containing full computed
model information at the independent sample points. A file_root.log file
contains some info which may be useful to assess performance.
If action=1, the program reads in an existing .data file and processes
according to the redo_xxx parameters. At this point the acceptance multiplicities
are non-integer, and the output is already thinned by whatever the original
indep_sample
parameter was. The post-processed file are output to files with root redo_outroot.
Analysing samples and plotting
The getdist program analyses text files produced by the MCMC or
post-processing. These are in the format
weight like param1 param2 param3 ...
The weight gives the number of samples (or importance weight)
with these parameters. like gives -log(likelihood). The getdist
program could be used completely independently of the cosmomc program.
Run "getdist distparams.ini" to process the chains specified in the
parameter input file distparams.ini. This should be fairly self-explanatory,
in the same format as the cosmomc parameter input file.
It processed the file_root.txt file (or, if there are multiple chains
set the chain_num parameter), and outputs statistics, marginalized
plots, samples plots, and performs PCA (principle component analysis).
Set thin_factor to produce a file_root_thin.txt file containing
every thin_factorth sample. Set adjust_priors to adjust the
sample multiplicities for a new prior (write the corresponding code in
GetDist.f90). Set
ignore_rows to adjust the number of outputs that
are discarded as burn-in. The .ini file comments should explain the other
options.
If your variable has a prior which cuts off when the posterior is non-negligible
you need to set the limitsxx variables to the corresponding limits
- for example for the neutrino mass where one assume m_v >0. Otherwise
limits are computed automatically. DO NOT use limitsxx to change
the bounds on other plots - the program uses the limits information when
it is estimating the posteriors from the samples. When limitsxx are used the .margestats file
contains one tail statistics. If only one end is fixed you can use N the floating end, e.g. "limitsxx = 0 N" for the tensor amplitude which has a greater than zero prior.
The program produces MatLab '.m' and SuperMongo '.sm' files to do 1D plots. Run "sm
< file_root.sm" to produce the plot file file_root.ps containing the
1D marginalized posteriors, or type "file_root" into a MatLab window set to the correct directory.
Labels are set in distparams.ini - if any are
blank the parameter is ignored. It also produces a MatLab file_root_2D.m
file for plotting the 2D marginalized posteriors, and file_root_3D.m for
plotting colored 2D samples plots (like the ones on the home
page). The data files used by
SuperMongo and MatLab
are output to the plot_data directory. If parameters are not specified for
the 2D plots or the color of the 3D plots getdist automatically works out
the most correlated variables and uses them.
Set the PCA_num parameter to perform PCA for a subset of the
parameters, set in PCA_params. You can specify the mapping
used in PCA_func, for example 'L' to use logs (what you usually
want for positive definite parameters). The output file file_root.PCA contains
the analysis including the correlation matrix, eigenvectors and eigenvalues,
as well as the well constrained parameter combinations and their marginalized
errors. This is how you automatically work out constraints of the form
param1^a param2^b param3^c ... = x \pm y.
Alternatively you can make plots in MatLab using the scripts provided in the mscripts directory. Read and run example.m for more information.
Convergence diagnostics
The getdist program will also output convergence diagnostics.
- For multiple chains the code computes the Gelman and Rubin "variance of means"/"mean of variances" statistic for each parameter. The .converge file produced contains the numbers for each parameter individually. The program writes out the value for the worst eigenvalue of the covariance of the means, which should be a worst case, also catching poor chain coverage in directions not aligned with the base parameters. If the numbers are much less than one then the second half of each chain probably covers the space well and provides an accurate estimate of the means and variances. If the distribution is bimodal and no chains find the second mode low values can be misleading. Typically you want the value to be less than 0.2.
-
For individual chains (before importance sampling) getdist computes the Raftery and Lewis convergence diagnostics. This uses a binary chain derived from each parameter depending on whether the paramter is above or below a given percentile of its distribution. The code works out how much the binary chain needs to be thinned to approximate a Markov process better than a second order process, and then uses analytical results for the convergence of binary Markov chains to assess the burn in period. It also assesses the thin factor needed for the binary chain to approximate an independence chain, which gives an idea of how much you need to thin the chain to obtain independent samples (i.e. how much you can get away with thinning it for importance sampling, though thinning isn't entirely lossless). The .converge file contains the numbers for each chain used, getdist writes out the worst case.
Programming
The most likely need to modify the code is to change l_max or num_cls, both specified in cmbtypes.f90. To change the numbers of parameters you'll need to change the constants in settings.f90. Run "make clean" after changing settings before re-compiling.
You are encouraged to examine what the code is doing and consider carefully
changes you may wish to make. For example, the results can depend on the
parameterization. You may also want to use different CAMB modules, e.g.
slow-roll parameters for inflation, or use a fast approximator. The main
source code files you may want to modify are
-
params_H.f90
This defines what the input variables mean. Change this to use different
variables, etc. You can change which parameterization file
to use in the Makefile.
-
cmbtypes.f90
You need to change this file to specify the l_max used. Chains can
be generated at low l_max, then post-processed with a compile using a higher
l_max. You can also change the num_cls number of (temperature plus polarization) Cls to compute and store.
-
settings.f90
This defines the number of parameters and their types. You will need
to change this if you use more parameters.
-
cmbdata.f90
This reads in the CMB .dataset information and computes likelihoods.
You may wish to edit this, for example to use likelihood distributions
for the band powers, or to compute the likelihood from actual polarized data. This version assumes polarized datapoints are an arbitrary combination of the raw TT, TE, EE, and BB Cls, as specified in the window files in data/windows. WMAP data is handled as a special case.
-
propose.f90
This is the proposal density. The efficiency
of MCMC is quite dependent on the proposal. By default it changes a subset of 1 to 3 parameters with a gaussian probability given by the input parameter widths. Fast parameters are varied separately. If there is a convariance matrix it changes the diagonalized variables.
-
CMB_Cls_simple.f90
Routines for generating Cls, matter power spectra and sigma8 from CAMB.
Replace this file to use other generators, e.g. a fast approximator like
CMBfit or DASH.
-
twodf.f90
Routine for computing the likelihood from the 2dFGRS
data.
-
supernovae.f90
We don't understand how to use the full covariance matrix, so if you
do you may wish to modify this to include it (please tell us!)
-
postprocess.f90
Reads in .data files and re-calculates likelihoods or theory predictions.
-
calclike.f90
Add in calls to other likelihood calculators, etc., here.
-
driver.f90
Main program that reads in parameters and calls MCMC or post-processing.
-
GetDist.f90
The "getdist" program for analysing chains. Write your own importance
weighting function or parameter mapping.
Version History
- May 2003 Added support for 'triangle' plots to GetDist (example. Set triangle_plot=T in the .ini file). If truncation is required, the covariance matrix for CMB datasets is now truncated (rather than truncating the inverse covariance). Fixed CAMB bug with non-flat models, and problem setting setting CAMB parameters when run separately from CosmoMC.
- March 4 2003 Fixed bug in GetDist - the .margestats file produced contained incorrect limits (the mean and stddev were OK)
- Feb 2003 Support for WMAP data (customized code fixes TE and amplitude bugs). CMB computation now uses C_l transfer functions - complete split possible between transfer functions and the initial power spectrum, so improved efficiency handling fast parameters. Bug fixes and tidying of proposal function. Intial power spectrum no longer assumed smooth for P_k. GetDist limitsxxx variables can be N to auto-size one end (margestats are still one tail). Support of IBM XL fortran (workarounds for bug on Seaborg). GetDist will automatically compute some chain statistics to help diagnose convergence and accuracy. CAMB updated, including more accurate and faster handling of tight coupling. Option to generate chains including CMB lensing effect. Various other changes.
- Nov 2002 Added support for polarization, and improved compatibility with different compilers and systems.
Reference links
Probabilistic
Inference Using Markov Chain Monte Carlo Methods
Information
Theory, Inference and Learning Algorithms
MCMC Preprint Service
Raftery and Lewis convergence diagnostics
Antony Lewis.