Large data set analysis using only Python

jillziegler · 22 July 2024 15:38

I have another question from the Open Data analysis team here at the University of Notre Dame’s Quark Net Center.

We’re currently working on an analysis in Python of one of the NanoAOD-like versions of the 2010 data. We’re hoping to keep the analysis entirely in Python, but are open to using other tools if it’s the quickest way.

We’re cutting the data set on exactly two muons, but requiring at least one Global muon and that the muons are of opposite sign. Because the Muon_isGlobal and the Muon_isTracker variables are booleans, we had to make those cuts in loops. That’s not the primary issue we have, though I’m open to finding another way to make those cuts (I’m currently exploring the “Getting Started” guide for uproot and saw some things that look promising).

Our primary problem is that we’ve been putting the collected data that passes our requirements into arrays. So far, we’ve needed to make those arrays small compared to the data set to work with our computer memory capacities.

We saw a tutorial from the HEP Software Foundation that uses a fill command from Hist, but installing Hist in the python container updated numpy which then was the wrong version of numpy to work with uproot (I think, from the error messages). We were hoping for some other command that works like that fill command in Hist or the fill command in ROOT to fill the values from the loop into a histogram (or a group of histograms, one for each variable) rather than an array in the hope that it would fit better in our memory.

Again, we hope to work just in Python. We’re grateful for any assistance anyone can give. Thanks in advance!

~Jill Ziegler

jmhogan · 22 July 2024 16:48

Hi Jill,

How about the histogram technique from this example?
https://cms-opendata-workshop.github.io/workshop2024-lesson-docker/03-docker-for-cms-opendata.html#challenge-2

I’m not sure I understand why booleans require a loop, but that might be better addressed in a Zoom session, or seeing your actual code.

Julie

jillziegler · 22 July 2024 23:20

Hi, Julie. Thanks for taking the time to respond.

We did try the technique you linked. It worked fine in the past (I’m getting an error now that tells me I probably need to restart my computer), but doesn’t do what we want. That seems to graph all of the data of a type. We wanted, instead, to:

graph just a portion of the data (cutting on exactly two muons, one of which is Global, where the two muons have opposite charge) and
graph data derived from the data in the TTree (px, py, and pz just as an example).

Our current code looks like this:

infile_name = 'root://eospublic.cern.ch//eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2010B_Mu_merged.root'

infile = uproot.open(infile_name)

events = infile['Events']

to read the file in. Then we use

muons = events.arrays(entry_stop=100000)

cut = muons["nMuon"] == 2

G1 = muons["Muon_isGlobal", cut, 0]
G2 = muons["Muon_isGlobal", cut, 1]
T1 = muons["Muon_isTracker", cut, 0]
T2 = muons["Muon_isTracker", cut, 1] 
mass1 = muons["Muon_mass", cut, 0]
mass2 = muons["Muon_mass", cut, 1]

etc. to read in all the variables we want. Then we define empty arrays like this:

m1 = []

and loop to get just the kinds of muons we want:

for index in range(0,len(G1)):
    if G1[index]&G2[index]:
        if Q1[index]*Q2[index] == -1:
            m1.append(mass1[index])
            m2.append(mass2[index])
            px1.append(pt1[index]*np.cos(phi1[index]))
    elif G1[index]&T2[index]:
        if Q1[index]*Q2[index] == -1:
            m1.append(mass1[index])
            m2.append(mass2[index])
            px1.append(pt1[index]*np.cos(phi1[index]))
    elif G2[index]&T1[index]:
        if Q1[index]*Q2[index] == -1:
            m1.append(mass1[index])
            m2.append(mass2[index])
            px1.append(pt1[index]*np.cos(phi1[index]))

(I’ve cut most of our calculations, but you should get the idea.)

Thoughts? I tried looking in the uproot “Getting Started” guide (https://uproot.readthedocs.io/en/latest/basic.html), specifically in the “Computing expressions and cuts” section. It looks like something there should help us, but I can’t quite get my head around some of the examples they use and how to structure our logic. Part of the problem seems to be that we’re interested in differentiating the muons from each other in our data, but that could be an illusion from my inexperience.

Sorry for the long message, it’s hard to describe otherwise and difficult to look up how to do in python. I feel like this would be a lot easier in ROOT, but we were trying to work up to ROOT through python first. (This might be a mistake . . .)

Thanks for any help you can give and no need to worry if you can’t assist or it would be too much of a burden! We appreciate all the help we’ve already gotten here!

~Jill

jmhogan · 23 July 2024 00:33

Hi,

You’re right, the exact exercise just opens an array and plots it. But given any array you’re happy with, you should be able to do (following the imports as in the exercise):

plt.hist(myFavoriteArray, …options…)
plt.show()

Julie

jillziegler · 23 July 2024 02:46

Julie,

Thanks as always for reading through my questions and answering.

I’ll have to look through the options available for plt.hist to see what we can make work. I have done versions of this strategy before, I just need to check out all of the tutorials to see what options I can use!

Whatever we come up with here, I’ll try to remember to post to this thread if anyone else wants to do something similar or if I forget by next summer.

Again, huge thanks!

~Jill