Using Jupyter Notebooks Without Docker?

Hi, all!

Jill Ziegler here again from the University of Notre Dame Quark Net Center. It’s that time of year again when I have questions to ask of the community.

We are trying to figure out whether we can use Jupyter Notebooks for Open Data analysis instead of using Docker containers. This is partially for our convenience (every member of our collaboration except me is running on a Windows computer) and partially to see if we can create an even lower barrier to entry for other high school teachers to do analysis.

Based on Matt Bellis’ talk/activity on event selection in last year’s Open Data Workshop at CERN, I think we should be able to use the pip command to grab the additional libraries we need for analysis in our Notebook (we know to preceed a bash command with !). I am going to try this on my personal Windows machine (it should be non-destructive) to see if it works, but are there any reasons we should not do it that we might not know? If it doesn’t work, can we just do the WSL2 install to get the commands to work and not worry about the Docker container? Or are we likely to find that Windows computers are just obstinate and we need to install Docker?

I know this is a somewhat vague line of questioning and I can report back once I’ve had a chance to try this out on a stock Windows computer, but I also thought I’d post this up in case it was a terrible idea and in case anyone else has reasons to do something similar for really lightweight investigations.

Thanks in advance for any comments!

Hi Jill,

I’m not sure which infrastructure you’re working with — maybe the CMS notebooks? — but if it’s any help, you’ll find that the ATLAS Open Data notebooks can run directly on mybinder or colab, so that you don’t need to use a container locally or pip install anything. I expect it’s possible to do the same with other notebooks, and that’s a super easy way to get people going.

Also, just a shameless promotion for this thing, which is even lower barrier-to-entry:

Best,
Zach

Hey Jill!

I think you’re right. We launched jupyter from within our docker container in the CMS tutorials, but if you have it installed separately I believe pip will do the trick for other packages.

Here’s the list for our python container, I would start by trying to pip install them (except for jupyterlab, obviously)

Hope you have a productive summer,
Julie

Zach,

Apologies, I didn’t specify what we are working with!

We’ve been working with CMS data for years. So yes, we’re working with the CMS notebooks.

While the group as a whole is set on working with CMS data, once I finish my current investigation with CMS data that I might want to do a similar investigation for ATLAS (I’m checking for cosmic ray muon traces in the CMS detector). I’ll check out the ATLAS notebooks regardless. Thanks for the tip!

~Jill

Julie,

Thanks as always for responding. I set up my Windows computer to match those of the rest of our collaboration (Windows, no WSL2, Anaconda Navigator installation for Python) and tried this out.

We have good news! While we haven’t gotten to test out using the various libraries, pip installing them from Jupyter Notebooks does seem to work. Anaconda Navigator helpfully has Jupyter Notebooks in its package, so we didn’t even have to separately install it.

The list of requirements is super useful as well. A good list not only makes it easier to ensure we have everything installed, but also is a good reference for making sure we include everything we need at the start of our programs.

Thanks again,and I’ll keep you updated on how we’re doing!

~Jill

We have good news and bad news from the ND QNC.

Our good news is that the pip installs worked great! Thanks @jmhogan for the list.

On to the bad news. We’re having trouble opening data files. I’m going to give a bit of background before I quote code and error messages, though.

We’re trying to use Jupyter Notebooks without Docker to try to make things easier in the long run. We’ve figured it was possible (even though the Workshops consistently use Docker) based partially on something from last year’s Workshop at CERN. From the “Exploring nanoAOD: Introduction” lesson:

In contrast to AOD and MiniAOD which is stored in CMSSW C++ objects, NanoAOD is stored using ROOT TTree objects. You therefore do not need to use the CMS Virtual Machine or docker container to analyze NanoAOD data. NanoAOD can be analyzed using the ROOT program and/or python libraries capable of interpreting the ROOT’s TTree structure.

So we tried to use the same structure I’d successfully used in Jupyter Notebooks run inside a Docker container in our stand-alone Notebook:

infile_name = ‘root://eospublic.cern.ch//eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2010B_Mu_merged.root’
infile = uproot.open(infile_name)
events = infile[‘Events’]

This gave us a rather lengthy error message that started:

ModuleNotFoundError Traceback (most recent call last)
File ~\AppData\Local\anaconda3\Lib\site-packages\fsspec\registry.py:249, in get_filesystem_class(protocol)
248 try:
→ 249 register_implementation(protocol, _import_class(bit[“class”]))
250 except ImportError as e:
File ~\AppData\Local\anaconda3\Lib\site-packages\fsspec\registry.py:284, in import_class(fqp)
283 is_s3 = mod == “s3fs”
→ 284 mod = importlib.import_module(mod)
285 if is_s3 and mod.version.split(“.”) < [“0”, “5”]:
File ~\AppData\Local\anaconda3\Lib\importlib_init
.py:88, in import_module(name, package)
87 level += 1

and eventually got to

ModuleNotFoundError: No module named ‘XRootD’
The above exception was the direct cause of the following exception:
ImportError Traceback (most recent call last)
Cell In[8], line 3
1 infile_name = ‘root://eospublic.cern.ch//eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2010B_Mu_merged.root’
----> 3 infile = uproot.open(infile_name)
5 events = infile[‘Events’]
7 events

and then continued a while from there.

I’ve probably just quoted either far too much or far too little to be useful in solving this problem but can provide screenshots of the whole thing if that would be useful, but I get the impression that something is either off with XrootD or that it isn’t playing nice with our system somehow. We did pip install it, as per the list.

Is there some way to fix our code or another way to read files to analyze? I remember seeing several other syntaxes and methods for reading in files and we don’t have any particular preference other than we ideally want to make use of awkward to make cuts rather than looping.

Thanks for any help here!

~Jill

Hi Jill,

Ok, I should have dug more deeply into the docker container’s command script. When you launch the docker container xrootd is actually being installed and built from its github repository.

I’ll work on seeing if we can figure something out. A simple pip install of xrootd fails for me in Anaconda Navigator.

Now, assuming your folks will not need to interact with lots and lots of files, a workaround is to download files locally, and then open them using uproot. Based on the file in your example, if you click on “list files” for either the individual or merged files in this record you can click “download” and have the file available locally.

Regards,
Julie

Julie,

Thanks for working to see if we can figure out how to implement xrootd without the Docker. We are, in fact, using Jupyter Notebooks through Anaconda Navigator, so if it works for you it should work for us as well.

Considering our big goal currently is to figure out applying cuts, downloading files should work for now. I should be able to choose a file or two for us to use in the short term. The rest of the team has already analyzed some of the derived .csv files locally, so they should be able to walk me through how open downloaded files, but I’ll let you know if we have more issues.

As always, many thanks for your help!

~Jill

Hi Jill,

Here’s how I can open a downloaded file in Anaconda Navigator:

Likely very similar for you and your users – if there are any differences in the paths you can hopefully find the ROOT files by checking out the directory trees within jupyter.

One other thought: depending on the physics goals of your group, I might consider using the 2012 (or even 2016?) data rather than 2010. Are you using only muons? Or many types of physics objects?

Regards,
Julie

This is probably just a short-term solution but I often use this pattern for accessing files in jupyter notebooks (e.g.):

datafile_name = '0AB09F5D-121F-9443-87C8-3B69FAF1D99E.root'

if not (os.path.isfile(datafile_name)): 
    ! curl -O http://opendata.cern.ch//eos/opendata/cms/Run2016H/MuOnia/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/100000/0AB09F5D-121F-9443-87C8-3B69FAF1D99E.root

which uses the url of the files (essentially replacing root://eospublic.cern.ch/ with http://opendata.cern.ch/

In the longer term I will look closer at what’s up with xrootd in this context…

Hi Jill et al!

Hope this doesn’t muddy the waters too much but just wanted to chime in with some recent experiences.

Jill, I found I was having issues installing xrootd and xrootd-fspec in a way that worked on Colab notebooks earlier this year. I used to have success with student projects with this, but at some point it just stopped consistently working so I turned back to local installations, which it sounds like you have been doing.

To engage students with the open data, we leaned into making python virtual environments, similar to some options we offered at the workshops. I’ve had great luck the past 1+ years with micromamba, a much faster replacement of conda. It works pretty smoothly with Linux and Mac, but they have instructions for how to install it on Windows with PowerShell. YMMV.

Once the students install micromamba, we use this command to create the virtual environment with the necessary libraries and then some


micromamba create --name pyhep -c conda-forge root matplotlib xrootd awkward uproot numpy jupyter tensorflow vector coffea python=3.11

This creates an environment called pyhep which we activate with

micromamba activate pyhep

From there, we can launch a jupyter-notebook with

jupyter-notebook

Again, this all works with Mac and Linux so I’d be curious to hear how it works with Windows.

Our notebooks can then read stuff in with

%load_ext autoreload
%autoreload 2

# The classics
import numpy as np
import matplotlib.pylab as plt
import matplotlib # To get the version

import pandas as pd

# The newcomers
import awkward as ak
import uproot

import vector
vector.register_awkward()

And then

filename = 'root://eospublic.cern.ch//eos/opendata/cms/mc/RunIISummer20UL16NanoAODv9/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v1/120000/7D120E49-E712-B74B-9E1C-67F2D0057995.root'

print(f"Opening...{filename}")
f = uproot.open(filename)

events = f['Events']

nevents = events.num_entries

print(f"{nevents = }")

If you get an error about xrootd or xroot-fspec, then do (either from the notebook or PowerShell)

!pip install --upgrade fsspec-xrootd

and restart the notebook.

This all worked pretty well the last few weeks with student research, but again, Linux or Mac.

Hope I haven’t added a bunch of confusion but let me know if any of this is helpful!

Matt

Thanks, @jmhogan, @mccauley, and @mattbellis for your replies.

We’re going to try the methods Julie and Tom suggested for our current work. I’ll probably try Matt’s suggestion of micromamba on my other Windows computer before the end of the summer. I think our group would prefer to have as little interaction with any kind of shell as possible, but I’m happy to try it out and report back if it works with WIndows.

As a by the way, Matt, we’re working through your data selection tutorial notebook from last year’s CMS data workshop. I’ll keep you updated on how it goes.

Thanks again, all!

~Jill

Awesome @jillziegler ! Yes, please do and good luck!

Updates for everyone!

We were able to find and download a data file we wanted. We ended up going with one of the Charmonium datasets and pulling the smallest file. That still gave us approaching 300k events to work with and should let us try looking at electrons as well as muon if we wish.

Through a lot of trial and error, reading and rereading various tutorials from the CMS Open Data Workshops, and reading through some of the tutorials for Awkward at awkward-array.org, we finally figured out how to do get the 4 momentum in Cartesian coordinates and then calculate and plot invariant mass for opposite charge dimuon mass quickly enough that we could mess around with what our range was and how many bins we wanted in real time. Major achievement for us here! We’ve learned a lot about how Awkward and Vector work (what a revelation Vector is!).

I haven’t gotten to try out micromamba yet, but that’s still on my list. I’ll let you know how that goes.

Thanks again for all the help! If you want images of our histogram with the invariant mass, let me know. I know it isn’t much in terms of particle physics, but it’s a big achievement for us at Quark Net.

~Jill

That’s excellent!

We would definitely love to see your work and hear more about the experience. I’ll send you a message off the forum and hopefully we can arrange a chat :smiley:

Regards,
Julie