Using Scikit-Learn to classify signal using Secondary vertex characteristics

SwastiRathod · 8 August 2024 06:45

I have used Random Forest Classifier in Scikit-Learn to classify signal and background using secondary vertex characteristics. This study is related to the Long Lived Particle hunt in SUSY (BSM Physics).

My datasources are,

Background: Secondary Vertices from Standard Model QCD Process (/QCD_Pt_300to470_TuneCP5_13TeV_pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v1/MINIAODSIM | CERN Open Data Portal)
Signal: Secondary Vertices from Bulk Graviton to Higgs to b meson Channel (/BulkGravTohhTohbbhbb_narrow_M-600_13TeV-madgraph/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v1/MINIAODSIM | CERN Open Data Portal)

This is my feature importance curve,

These are the histograms from both the datafiles

Is the training correct, because the expected outcome was to have svMass, svDxy as more imp features compared to svZ. I want to know if there’s overtraining issue or this is how it’s supposed to be.

jmhogan · 8 August 2024 13:38

Hi Swasti,

Unfortunately, this isn’t really something the Open Data team can comment on. We are here to heal deal with issues accessing the Open Data or understanding how to interpret and use the various CMS data formats.

There’s no real way for us to know whether or not your network is overtrained or giving optimal separation. To my eye, both of those variables look like similar discriminators… You might consider giving the absolute value of the svZ?

If you have a question about whether you’ve accessed a variable in the correct way before feeding it to your network, we can try to advise if given some code snippets.

Regards,
Julie

SwastiRathod · 8 August 2024 14:53

Hello Julie, Thank you for the reply.

I have extracted these parameters using CMSSW, MiniAnalayzer. I am using the following datasets as my signals. It would be great if you can confirm whether the dataset is correct or not.
Signal: /BulkGravTohhTohbbhbb_narrow_M-600_13TeV-madgraph/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v1/MINIAODSIM | CERN Open Data Portal

For background I am using
https://opendata.cern.ch/record/63193
as suggested by you.

Can I use Physics Object Extractor Tool for 2016 data?

jmhogan · 8 August 2024 15:13

Hi Swasti,

It’s a valid dataset – whether it is “correct” will depend on what you hope it contains. If you are looking for an Rkk Graviton that decays to 2 Higgs bosons, then this should be a good sample.

When I suggested that QCD dataset, I didn’t understand that you needed to use this older Rkk Graviton dataset as your signal. It’s not optimal to mix simulation types within an analysis. So if you MUST use this Rkk dataset as you signal (because you have determined that there are no similar Graviton datasets in the newer 2016 data release), then I would go back and find a QCD background dataset that had the same “RunIISummer16MiniAODv2” label.

The Physics Object Extractor Tool has not been set up for 2016, since almost everyone will be able to use the official NanoAOD samples. However, you can start from the 2015MiniAOD branch of that tool and update the validated runs file, global tags, samples, etc, and be in a good starting place to run your own EDanalyzers. Some of the existig EDAnalyzer content might not work for 2016. We had added this to the wishlist for 2016 analysis support.

Julie