.. _PIDCalibrationWeights: PID Calibration Weights ======================= The PID calibration weights were introduced in `BELLE2-NOTE-TE-2021-27 `_ as a novel means for improving PID performance without low-level variables. The gist of the method is to perform a weighted sum of detector log-likelihoods instead of a straight sum, with the weights to be trained using standard machine learning techniques. To facilitate this, we provide a script for training these weights as well as a number of methods for preparing data samples for training and for analyzing the resulting performance. Preparing Data Samples for Training ----------------------------------- In order to train weights on a data sample, we first need to prepare the data by saving out only the information necessary for training. This is done with several methods that are provided in the ``pidDataUtils`` module. 1. Read your data into a DataFrame. We provide a ``read_root()`` method to do this (it is just a wrapper around the uproot.concatenate method, which can read several files and automatically concatenate them all together.) You can use whatever method you like, though. All that is required is that your DataFrame **must** contain momentum ('p'), cosine-theta ('cosTheta'), phi ('phi'), and detector log-likelihood data for any particles of interest. 2. For each particle type of interest, identify the tag that serves as a prefix in the column names, and use the ``make_h5()`` method to make a *slim* H5 file containing only the track and log-likelihood information for this particle type. For example, in a D* analysis, the tag for kaons is 'DST_D0_K', and for pions the tags are 'DST_D0_pi' and 'DST_pi'. There are two nuances here to beware of, however. 1. If your DataFrame is from simulation and contains the 'mcPDG' column, you can provide the argument ``pdg=None`` and the 'pdg' column in the HDF5 file will be filled with the 'mcPDG' data. However, if you'd prefer to use the kinematic tags *or* you don't have 'mcPDG' data, provide the PDG code for the particle type you're extracting. (e.g. ``pdg=321`` for kaons, ``pdg=211`` for pions, etc.) 2. This method assumes that the detector log-likelihood columns are formatted as f"pidLogLikelyhoodOf{pdg}From{detector}". If the log-likelihood columns in your dataset are named differently, please provide a function to the ``column`` argument to yield these column names. The function should take two arguments: the particle name ('e', 'mu', 'pi', 'K', 'p', 'd') and the detector name ('SVD', 'CDC', 'TOP', 'ARICH', 'ECL', 'KLM'). It should return the corresponding detector log-likelihood column name as a string. 3. Once slim H5 files are made for each particle type, merge them together using the ``merge_h5s()`` method. This method has a ``pdgs`` argument. If ``pdgs=None``, the method will simply use the values from the 'pdg' columns in the given H5 files. One can also give a list of PDG values, one per filename, to be used instead. (If you set the PDG values in the last step, this shouldn't be necessary. If you filled the columns with the mcPDG values, however, and at *this* stage would prefer to use kinematic tags, this is how you could achieve it.) 4. Use the ``split_h5()`` method to finalize preparation by splitting the slim H5 file into train, validation, and test sets. (Or only train and validation sets, if ``test_size=0``.) Note that the arguments ``train_size``, ``val_size``, and ``test_size`` need not sum to 1; the sizes will be renormalized if not. Here is an example code snippet which we used to do this for a D* dataset. .. code-block:: python import pidDataUtils as pdu df = pdu.read_root(['dstar_1.root', 'dstar_2.root']) pdu.make_h5(df, 'DST_D0_K', 'slim_dstar_K.h5', pdg=321) pdu.make_h5(df, ['DST_D0_pi', 'DST_pi'], 'slim_dstar_pi.h5', pdg=211) pdu.merge_h5s(['slim_dstar_K.h5', 'slim_dstar_pi.h5'], 'slim_dstar.h5') pdu.split_h5('slim_dstar.h5', 'slim_dstar/') Training the Weights -------------------- To train the weights, we provide a script: ``pidTrainWeights.py``. It can be run using .. code-block:: bash python3 pidTrainWeights.py data/ models/net.pt -n 100 In this snippet, we assume that ``split_h5()`` has been run and the output files were written into the folder ``data/``. We then specify that we want our final model to be saved to ``models/net.pt`` and that we want the model to be trained for 100 epochs. Note that just the six-by-six array of weights will also be written to ``models/net_wgt.npy``. .. note:: The output filename needs to end in ``.pt`` or else the script will fail when attempting to write the weights to a ``.npy`` file. To train a model on the training dataset in ``data/`` but only over events within a certain interval in momentum and theta, one can specify the limits at the command-line. For example, to train only on events with momentum between 0.5 and 1.5 GeV and theta between 15 and 45 degrees, one could add the arguments ``--p_lims 0.5 1.5 --theta_lims 15 45``. One can also start training from an existing checkpoint with this script by using the ``--resume`` argument. There are three cases to consider with this argument. 1. Omitting ``--resume`` means that a fresh set of weights will be trained and written to the output filepath, overwriting any existing model at that location. 2. Including only ``--resume`` (as a flag) means that the network at the output filepath will be loaded, trained, and saved again, overwriting the model at the output location. 3. Including ``--resume path/to/existing/model.pt`` allows one to load and train the network at the given filepath, but the final model is saved to the output location (so the existing model is not overwritten). A fresh, new network is initialized with all weights equal to 1 by default. One can instead start the network with weights sampled from a normal distribution of mean 1 and width 0.5 if desired by using the ``--random`` flag. Lastly, one can limit the allowed particle types using the ``--only`` flag. For example, if we are studying D* decays and only care about pions and kaons, and therefore want the non-{pi, K} particle types to have zero weight in the PID computations, we can specify ``--only 211 321`` to zero and freeze the weights for the other hypotheses. Not specifying ``--only`` means that all particle types will be used. Below, you can find the full documentation for this script. .. argparse:: :filename: analysis/scripts/pidTrainWeights.py :func: get_parser :prog: pidTrainWeights.py :nodefault: Applying Weights to Data for Performance Analysis ------------------------------------------------- We also, through ``pidDataUtils`` provide methods for *applying* a set of trained weights to a data sample for analysis. Here's what that workflow might look like. 1. Read your data into a DataFrame. Much like when preparing data samples, you can use our ``read_root()`` method. We also provide ``read_h5()`` and ``read_npz()``, which will read in the slim format H5 files or the train, validation, or test set ``npz`` files used for training. 2. Use ``prepare_df()`` to prepare the data. This method will apply weights to the detector log-likelihoods and create a number of additional, useful columns in the DataFrame. These columns include likelihood ratios, binary likelihood ratios, single-detector and ablation PID, contribution metrics, and more. It does, however, have several arguments and features to be aware of. 1. The weights are specified with the ``weights`` keyword argument. This argument expects a six-by-six NumPy array (which could be obtained by using ``np.load()`` on the ``_wgt.npy`` file produced during training) or a dict, if there are individual weights for various momentum and theta bins. The dict should have int-tuples as keys and six-by-six NumPy arrays as values, where the int-tuple keys are ``(p_bin, theta_bin)``. 2. Even if you are not using per-bin weights, you should still specify momentum and theta bins for the analysis. These are done by giving 1D NumPy arrays to the ``p_bins`` and ``theta_bins`` keyword arguments. By default, the bins will be the standard Belle II systematics bins. Any events with momentum or theta outside of the bins will have their ``'p_bin'`` or ``'theta_bin'`` value set to -1. By default, these events are then cut from the DataFrame, but this can be disabled by setting ``drop_outside_bins=False``. 3. If there are certain particle types that you want to exclude entirely from PID computations, you can do so by setting a list of allowed particles with the ``allowed_particles`` keyword argument. This expects a list of particle *names* (not PDG codes). 4. As with the ``make_h5()`` function, there is again a keyword argument ``column`` for the case where your DataFrame contains detector log-likelihoods that are named in a special way. For this function, it is assumed by default that the detector log-likelihoods are simply named f"{detector}_{particle}", since that is the format used when we read in the slim h5 or npz files. If you're using a ROOT file that was read with ``read_root()`` and has columns in the format f"pidLogLikelyhoodOf{pdg}From{detector}", use ``column=root_column``. Otherwise, make your own function and give it here. After using ``prepare_df()``, you should have a DataFrame with a number of additional columns, including labels, likelihood ratios, binary likelihood ratios, PID predictions, single-detector and ablation PID predictions, momentum and theta bins, and contribution metrics. Feel free to analyze these however you like; however, there exists a Python package that expects DataFrames with these specific columns and produces a wide range of plots. It is ``pidplots``, which can be found `here `_. PID calibration weights on the basf2 path ----------------------------------------- The PID calibration weights can be registered in the database to utilize them on the basf2 path. The module :b2:mod:`PIDCalibrationWeightCreator` can produce the dbobject PIDCalibrationWeight with a unique name of the weight matrix. One can find an example of the usage of the module in ``analysis/examples/PIDCalibration/02_SamplePIDAnalysis.py``. By loading the data object, the basf2 variables, such as :b2:var:`weightedElectronID`, provide the weighted PID probability from the original likelihood and the given data object. One can specify the name of the weight matrix in the argument of the variables. .. code-block:: python import basf2 as b2 import modularAnalysis as ma # create path my_path = b2.create_path() # load the local dbobject localDB = 'localdb/database.txt' b2.conditions.append_testing_payloads(localDB) # or use the central global tag including the dbobject ma.fillParticleList('pi+:all', cut='', path=my_path) matrixName = "PIDCalibrationWeight_Example" ma.variablesToNtuple('pi+:all', ['pionID', 'weightedPionID('+matrixName+')'], path=my_path) pidDataUtils Functions ---------------------- .. automodule:: pidDataUtils :members: