Contacts:
- Abraham Tishelman-Charny - [email protected]
- Badder Marzocchi - [email protected]
- Toyoko Orimoto - [email protected]
Presentations:
- 29 June 2020 Overall Analysis Update
- 23 June 2020 Resonant Status
- 19 May 2020 Non-Res Status
- 11 November 2019 Update
- 21 October 2019 Status
Repositories:
These instructions describe how to run flashgg modules specific to the HH->WWgg analysis. The current plugin designed to work with workspaceStd is the HHWWgg Tagger.
The HHWWgg development branch is obtained in a similar fasion to the dev_legacy_runII branch:
export SCRAM_ARCH=slc7_amd64_gcc700 cmsrel CMSSW_10_5_0 cd CMSSW_10_5_0/src cmsenv git cms-init cd $CMSSW_BASE/src git clone -b HHWWgg_dev https://github.com/atishelmanch/flashgg source flashgg/setup_flashgg.sh
If everything now looks reasonable, you can build:
cd $CMSSW_BASE/src scram b -j 4
To access grid files to run the tagger on, you must run the following commands:
cmsenv voms-proxy-init --voms cms --valid 168:00
after the voms command, you should receive an output similar to:
Created proxy in /tmp/x509up_u95168
to set this proxy to your X509_USER_PROXY environment variable for the example above, simply use the command:
. proxy.sh x509up_u95168
where x590up_u95168 would be replaced by whatever your proxy name is.
The HHWWgg tagger is developed to tag events as coming from the HH->WWgg process, and is compatible with workspaceStd in order to include the standard systematics workflow, and if desired to include tagging of other flashgg tags on the same events.
The HHWWgg Tagger can be run locally on signal (with 2017 metaConditions) with:
cmsRun Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2017_RR-31Mar2018_v1.json campaign=HHWWgg_v2-6 dataset=ggF_X600_HHWWgg_qqlnu doHHWWggTag=1 HHWWggTagsOnly=1 maxEvents=500 doSystematics=0 dumpWorkspace=0 dumpTrees=1 useAAA=1 doHHWWggTagCutFlow=1 saveHHWWggFinalStateVars=1
and on 2016 data:
cmsRun Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2016_RR-17Jul2018_v1.json campaign=Era2016_RR-17Jul2018_v2 dataset=/DoubleEG/spigazzi-Era2016_RR-17Jul2018_v2-legacyRun2FullV1-v0-Run2016B-17Jul2018_ver2-v1-86023db6be00ee64cd62a3172358fb9f/USER doHHWWggTag=1 HHWWggTagsOnly=1 maxEvents=500 doSystematics=0 dumpWorkspace=0 dumpTrees=1 useAAA=1 processId=Data processType=Data doHHWWggTagCutFlow=1 saveHHWWggFinalStateVars=1
and on 2017 data:
cmsRun Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2017_RR-31Mar2018_v1.json campaign=Era2017_RR-31Mar2018_v2 dataset=/DoubleEG/spigazzi-Era2017_RR-31Mar2018_v2-legacyRun2FullV1-v0-Run2017B-31Mar2018-v1-d9c0c6cde5cc4a64343ae06f842e5085/USER doHHWWggTag=1 HHWWggTagsOnly=1 maxEvents=500 doSystematics=0 dumpWorkspace=0 dumpTrees=1 useAAA=1 processId=Data processType=Data doHHWWggTagCutFlow=1 saveHHWWggFinalStateVars=1
and on 2018 data:
cmsRun Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2018_RR-17Sep2018_v1.json campaign=Era2018_RR-17Sep2018_v2 dataset=/EGamma/spigazzi-Era2018_RR-17Sep2018_v2-legacyRun2FullV2-v0-Run2018A-17Sep2018-v2-dc8e5fb301bfbf2559680ca888829f0c/USER doHHWWggTag=1 HHWWggTagsOnly=1 maxEvents=509 doSystematics=0 dumpWorkspace=0 dumpTrees=1 useAAA=1 processId=Data processType=Data doHHWWggTagCutFlow=1 saveHHWWggFinalStateVars=1
All flags are either defined in MetaData/python/JobConfig.py, or workspaceStd.
An explanation of the flags in this example:
- metaConditions: A json file of tags and conditions defined for each year. In this example, 2017 conditions is specified, used to run with correct conditions on 2017 Data and MC.
- campaign: The flashgg campaign where the files you want to run on are defined.
- dataset: The dataset within the specified campaign where the files you want to run on are defined.
- doHHWWggTag: Setting this flag to 1 tells the workspaceStd flow to evaluate each event with the HHWWgg Tagger.
- HHWWggTagsOnly: This flag removes all taggers other than the HHWWgg Tagger, mainly all of the Higgs->gg taggers.
- doSystematics: In this example set to 0. If set to 1, the workspaceStd systematics flow is included where systematic labels are defined in workspaceStd. For each systematic, the tagger is rerun on the microAOD where the systematic quantity is either varied up or down one sigma. If you run with this flag on, there should be a tree (if running with dumpTrees) or a RooDataHist (if running with dumpWorkspace) for each systematic variation.
- maxEvents: Max events to run over in the specified dataset. Set to -1 to run on all events.
- dumpWorkspace: Save RooWorkspace in output file. Useful for input into fggfinalfit.
- dumpTrees: Save tree(s) in output file. Useful for running ntuple analysis afterwards.
- useAAA: Use prefix: “root://cms-xrd-global.cern.ch/” when looking for files.
- processId / processType: Set to “Data” when running on data.
- doHHWWggTagCutFlow: Categorize all events that pass preselection into HHWWgg categories. Without this flag, events that do not pass all analysis selections are cut.
- saveHHWWggFinalStateVars: Save many final state variables such as kinematics for leptons and jets before and after analysis level selections. Variables are defined in Systematics/python/HHWWggCustomize.py
With the options specified in the example, if this works properly, you should get an output file named: output_numEvent500.root containing a tree for each HHWWggTag.
The customization for the HHWWggTag class is defined in a few places, starting with Systematics/python/HHWWggCustomize.py. In this python module you can specify variables to save, and the number of categories to save HHWWggTag objects in. The selections are located in Taggers/plugins/HHWWggTagProducer.cc. For the moment, a tag object “tag_obj” is created if an event has a diphoton, exactly one good lepton, corresponding to the leptonically decaying W boson, and at least two ‘good’ jets, corresponding to the hadronically decaying W boson. For these objects, ‘good’ is defined by the selections specified in Taggers/python/flashggHHWWggTag_cfi.py. This tag object can be created and placed into one of three categories:
- HHWWggTag_0: Semileptonic electron final state (qqlnugg with l = electron)
- HHWWggTag_1: Semileptonic muon final state (qqlnugg with l = muon)
- HHWWggTag_2: Untagged (if doHHWWggTagCutFlow=1)
Note that the untagged category is only filled if you are running with the flag doHHWWggTagCutFlow=1. To add another category, the number of categories specified in Systematics/python/HHWWggCustomize.py should be changed like so: self.tagList = [ [“HHWWggTag”,3] ] -> self.tagList = [ [“HHWWggTag”,4] ]. Then, when saving a tag object of the new category, you would do so in Taggers/plugins/HHWWggTagProducer.cc with tag_obj.setCategoryNumber( 3 ) rather than tag_obj.setCategoryNumber( catNum ) where catNum = 0, 1, or 2.
When running over entire datasets, it’s useful to submit confor jobs instead of running locally. This is done with the script HHWWgg_Run_Jobs.sh.
Note : You must first follow the proxy steps above in order to have access to DAS datasets.
Note : There are two user specific parameters in the script: fggDirec and ntupleDirec, which are by default set to:
fggDirec="/afs/cern.ch/work/a/atishelm/21JuneFlashgg/CMSSW_10_5_0/src/flashgg/" # flashgg directory ntupleDirec="/eos/user/a/atishelm/ntuples/HHWWgg/" # condor output directory
- fggDirec: Your current working directory where you have flashgg cloned.
- ntupleDirec: The directory where you want your output files to go. Note that this is the directory where a directory will be created for each batch of jobs, so you don’t need to change this for every submission.
There are two submission types currently in HHWWgg_Run_Jobs.sh:
- Trees with many final state variables
- Workspaces with minimal variables
The many final state variables job is useful for studying the kinematics of all final state objects, including leptons and jets before and after selections, as well as the two photons associated with the diphoton candidate. As an example, to run over all events of signal and save trees with final state variables, one should run:
. HHWWgg_Run_Jobs.sh --labelName HHWWgg_v2-6_Trees_X600_Test --nEvents all --json Taggers/test/HHWWgg_v2-6/HHWWgg_v2-6_X600.json --condorQueue longlunch --year 2017 -g -c -v -t
An explanation of the flags:
- labelName: The name used for the output folder placed in ntupleDirec
- nEvents: The max events to run on. To run on all events, specify the flag argument: “all”
- json: The json file to use for fggrunjobs submission. This should contain the datasets to run on, and specify the campaign, and PU target for MC jobs
- condorQueue: The condor flavour for the condor jobs. Note that this needs to be carefully selected, otherwise jobs may timeout and no output will be produced. You may need to try multiple flavors to find the ideal one for your job type.
- year: Specifies the MetaConditions to use. 2016, 2017 or 2018
- g: Use workspaceStd as the cms configuration file
- c: Run HHWWgg cut flow. This means all events that pass preselection will be saved in output nTuples.
- v: Save HHWWgg final state variables. Currently set up to be MANY variables (this should be noted. It may take more computing time than normal)
- t: Save trees in output nTuples. Useful for python modules / c++ macros designed for nTuple analysis with TTrees / TBranches.
In this example the HHWWgg_v2-6 json is specified. This is a campaign with three signal mass points: 260, 600, 1000 GeV Radion decaying semileptonically with all lepton decays, including taus. Any json file can be specified as long as it is formatted properly. You should be able to find some examples under Taggers/test/*HHWWgg*. These input json files can also be created from text files of dataset names with SampleTools.py. l
Note : In order for flashgg campaigns to be defined and therefore accessed via the fggrunjobs json specified with the –json flag, they must be created with fggManageSamples.py. You can find instructions for performing this here and here.
If your campaign exists in MetaData/data/, specifying the campaign and datasets in the json should be defined properly for fggrunjobs. Note that HHWWgg_v2-6 should be defined for this state of the cloned repository.
To produce workspaces with minimal variables to be used by fggfinalfit, you can for example run:
. HHWWgg_Run_Jobs.sh --labelName HHWWgg_v2-6_Workspaces_X600_Test --nEvents all --json Taggers/test/HHWWgg_v2-6/HHWWgg_v2-6_X600.json --condorQueue microcentury --year 2017 -g -s -w
Explaining the new flags:
- s: Run flashgg systematics workflow. Required to obtain final results in fggfinalfit with systematic uncertainty. Note that even if you just want a stat only result, it is useful to add systematics as you can just choose not to include them in fggfinalfit.
- w: Save workspaces in output. Used by fggfinalfit.
If this works properly, the output will be files (to be hadded) containing a RooWorkspace with the variables required for fggfinalfit, namely CMS_hgg_mass and dZ (for signal).
To produce workspaces for 2017 data, you would run a similar command but with the 2017 DoubleEG dataset input for the json file:
. HHWWgg_Run_Jobs.sh --labelName HHWWgg_v2-6_2017_Data_Workspaces --nEvents all --json Taggers/test/HHWWgg_2017_Data_All/HHWWgg_Data_All_2017.json --condorQueue longlunch --year 2017 -g -s -w
To produce ntuples for 2017 Data (DoubleEG dataset) and MC, you would run HHWWgg_Run_Jobs with the json files specifying 2017 data and MC, and the flags that will save trees with many final state variables for many objects, including leptons and jets before and after any selections are applied. This is useful for MVA studies in order to input training information with limit selections to increase statistics.
To use the HHWWgg_Run_Jobs.sh script, make sure to first edit the fggDirec and ntupleDirec variables, as described above in the beginning of the “Running on Condor” section.
In order to submit jobs with 2017 Data, you would run the command:
. HHWWgg_Run_Jobs.sh --labelName HHWWgg_2017_Data_Trees --nEvents all --json Taggers/test/HHWWgg_2017_Data_All/HHWWgg_Data_All_2017.json --condorQueue longlunch --year 2017 -g -c -v -t
Note : In the above example the condor job flavour “longlunch” is specified, giving each job a maximum of two hours of running to complete. Depending on how long the job takes, it may be necessary to specify the next flavour “workday”. Specifying “workday” sets the max running time of each job to 8 hours, meaning it may be more likely that the job completes. However this may mean the job will take longer as it may have worse priority (I am not 100% sure of all the details of how condor works, hence the vague language).
At the moment for HHWWgg, there is a json file specifying the backgrounds that are relevant for this analysis: Taggers/test/DataMC/Flashgg_bkg.json. To run the same tagger on this json, you would simply submit with the command:
. HHWWgg_Run_Jobs.sh --labelName HHWWgg_2017_FggBackground_Trees --nEvents all --json Taggers/test/DataMC/Flashgg_bkg.json --condorQueue workday --year 2017 -g -c -v -t
For this example, workday may be a better choice of work flavour as there are some backgrounds with many events such as GJet, QCD, Drell Yan and DiPhotonJetsBox that may take a long time to run.
After your condor jobs are complete, you should have a number of output files for each signal point or data taking era. The first check is to make sure the output number of files equals the number of condor jobs. If there are output files missing, the condor .err .out and .log files may point to the reason why.
After checking you have all of the output files, this section will describe how to hadd the files properly.
If you ran with trees, these are hadded in the usual way with the hadd command (Documentation Needed).
If you ran with workspaces, you need to hadd the workspaces in order to obtain a root file with a single combined root workspace for each signal point to work with fggfinalfit. This can be done with the script HHWWgg_Process_Files.sh. As with the HHWWgg_Run_Jobs script, you need to first set your user specific variables, namely the nTupleDirec and fggDirec vars. After doing this, to hadd the workspaces from the previous job, assuming they’re in your ntuple directory with the name “HHWWgg_v2-6_Workspaces_X600”, you would run the command:
. HHWWgg_Process_Files.sh --inFolder HHWWgg_v2-6_Workspaces_X600 --outFolder HHWWgg_v2-6_Workspaces_X600_Hadded -s --signalType Res
Explaining each flag:
- inFolder: The directory in nTuplesDirec with files to be hadded
- outFolder: The directory in nTuplesDirec you want the hadded files to go into
- s: Look for file names with the format of signal files
- signalType: Look for file names with the name format of resonant signals. Ex: “output_ggF_X600_HHWWgg_qqlnu_6.root”. It’s important that the file names are of the expected format,
as this script and fggfinalfit scripts will use this to obtain quantities like the resonant masses.
This tells the script to hadd files in nTuplesDirec/HHWWgg_v2-6_Workspaces_X600 using the flashgg script Systematics/scripts/hadd_all.py, and put the output files in your desired outFolder. Note that this is setup to work for any number of resonant mass points, NMSSM mass pairs or EFT benchmarks located in the –inFolder.
If this works properly for this example, you should have a single hadded file in HHWWgg_v2-6_Workspaces_X600_Hadded for the 600 GeV resonant point. This will be the input signal file for fggfinalfit.
To do the same for data, after running HHWWgg_Run_Jobs on a data json and directing your ouput files to HHWWgg_v2-6_Data_Workspaces, you would run:
. HHWWgg_Process_Files.sh --inFolder HHWWgg_v2-6_2017_Data_Workspaces --outFolder HHWWgg_v2-6_2017_Data_Workspaces_Hadded -d
Explaining the new flag:
- d: Don’t look for special file name formats.
By default this should hadd by data era. For example for 2017 data, this should result in 5 hadded files in HHWWgg_v2-6_2017_Data_Workspaces_Hadded, one for each Era from B to F which should be named Data_0.root, Data_1.root, … You would then want to hadd these into a single hadded file for all of 2017 data to be used by fggfinalfit. This can be done with the command:
. HHWWgg_Process_Files.sh --inFolder HHWWgg_v2-6_2017_Data_Workspaces_Hadded --outFolder HHWWgg_v2-6_2017_Data_Workspaces_Hadded_Combined -d -c
Where the new flag is :
- c: Combine all data eras.
This command will hadd Data_*.root into a single file: HHWWgg_v2-6_2017_Data_Workspaces_Hadded_Combined/allData.root. This contains a single workspace will all data you ran on, and is used as the input for flashggfinalfit.