Process Mappers in batch

We always want to run Mapper with multiple configurations and see the final result. The tool implemented here is a helper tool to run a batch of Mappers on data.

These are the main parts of processing one Mapper for one data item (subject):

  1. Load data item from a cohort file into a matrix. Detailed here

  2. Preprocess the matrix. Detailed here

  3. Run a Mapper configuration on our matrix

  4. Run all analysis steps. Detailed here

The run_main() script runs in parallel the steps above for all data items (subjects) and for all mapper configurations.

Usage Example

Outside of tests and small experiments, the process is run from command line, where the inputs are set as variables:

Example command line call to run_main
matlab -r "cohort_csv='...'; config_path='...'; run('/.../demapper/code/analysis/run_main.m'); "

Below, I present a more detailed example with all the required arguments set:

Example script (BASH / SBATCH)
PROJECT_ROOT=/Users/daniel/project
CONFNAME=mappers_v0.1.json

MATLAB_ARGS=""
MATLAB_ARGS="${MATLAB_ARGS} poolsize=8;"
MATLAB_ARGS="${MATLAB_ARGS} cohort_csv='${PROJECT_ROOT}/cohort_mapper.csv';"
MATLAB_ARGS="${MATLAB_ARGS} config_path='${PROJECT_ROOT}/${CONFNAME}';"
MATLAB_ARGS="${MATLAB_ARGS} data_root='${PROJECT_ROOT}/data/';"
MATLAB_ARGS="${MATLAB_ARGS} output_dir='${PROJET_ROOT}/results/${CONFNAME}';"

DEMAPPER_MAIN="${PROJECT_ROOT}/demapper/code/analysis/run_main.m"

# write command, submit, wait
CMD="matlab -r \"${MATLAB_ARGS} run('$DEMAPPER_MAIN')\"";
echo $CMD;
eval $CMD;
wait

where we have the following files:

cohort_mapper.csv
id0,id1,id2,path,TR
SBJ99,,,SBJ99_BOLD.npy,1.5
mappers_v0.1.json
{
  "preprocess": [
    { "type": "zscore" }
  ],
  "mappers": [{
    "type": "BDLMapper",
    "k": 32,
    "resolution": [20, 40],
    "gain": 60
  },
  {
    "type": "NeuMapper",
    "k": 16,
    "resolution": 100,
    "gain": [70, 80, 90]
  }],
  "analyses": [
    { "type": "plot_graph" },
    { "type": "compute_stats","args": { "HRF_threshold": 11 } },
    { "type": "compute_temp" }
  ]
}

The JSON configuration above will generate five Mapper results for each individual item in cohort file. Two BDLMapper results for 2 resolutions: 20 and 40, finally named: BDLMapper_32_20_60 and BDLMapper_32_40_60. Three NeuMapper results for 3 gain parameters: 70, 80, and 90, similarly named.

This configuration runs 1 preprocessing step (zscore) and 3 analysis steps: plot_graph, compute_stats, and compute_temp. Check the run_preprocess function and the run_analysis function for detailed descriptions on possible preprocessing and analysis possible.

Usage

code.analysis.run_main()

RUN_MAIN runs mapper on a large cohort of items based on the config. The mapper config is defined at config_path.

The items (cohort) to be processed it defined by each line of the CSV file at path cohort_csv. Each item’s path is relative to the the path data_root.

Finally, the output is saved at the path output_dir

Parameters:
  • cohort_csv – the path to a CSV file with the header: id0,id1,id2,path the id`s can be any string and the path is the location of the 1D file used for processing Mapper. The path is relative to the `data_root path.

  • data_root – the absolute root directory to the paths defined in the CSV file: cohort_csv

  • config_path – the path to the config file that defines what mappers and analysis to be used. Some example configurations are at tests/fixtures/config*.json or below. This configuration also defines the preprocessing step.

  • output_dir – the absolute root directory of where to save the results of the mappers and the analysis.

  • poolsize – (optional) number of threads to use for parallelizing the process and analysis. This is highly recommended for big jobs. Usually its set as the number of CPU cores.

  • rerun_analysis – (optional) if this is set, the process will use pre-computed mapper results from a previous run and reruns only the specified analysis. Possible Values are: “plot_graph”, “plot_task”, “compute_stats”, “compute_temp”. Or check run_analysis function for the latest.

  • rerun_uncomputed – (optional) if this boolean option is set, the process will rerun all items of the cohort that either failed or didn’t finish successfully based on the status.csv file.

Output structure:

The result of an individual item will be saved at: id0/id1/id2/MapperID/, where:

  • Each id (id0, id1, id2) is taken from the cohort_csv for the item

  • The id MapperID is generated from the specific mapper from config_path

Moreover, there is a status file saved at config_name/status.csv, that displays the status of the process running all mappers.

Individual Steps