Skip to content
Snippets Groups Projects
Commit b307bd9a authored by Giacomo De Pietro's avatar Giacomo De Pietro :goat:
Browse files

Merge branch 'ws2425-kick-off' into 'main'

WS2425

See merge request Lehre/tp1_forstudents!5
parents a636caa7 7e3c4e03
Branches
No related tags found
No related merge requests found
Showing
with 57 additions and 4032 deletions
File moved
%% Cell type:markdown id: tags:
---
# Jupyter Tutorial:
# *pandas* basics for physicists
---
# Übungen zu Teilchenphysik I
## Exercise 01 - Jupyter Tutorial:
## *pandas* basics for physicists
G. Quast, A. Monsch, October 2021
G. Quast, A. Monsch, October 2021
%% Cell type:markdown id: tags:
# Jupyter Notebook Fundamentals
---
## Jupyter Notebook Fundamentals
This file of type `.ipynb` contains a tutorial as a `Jupyter notebook`.
`Jupyter` provides a browser interface with a (simple) development environment for *Python* code
and explanatory texts in intuitive *Markdown* format.
The input of formulas in *LaTeX* format is also supported.
A summary of the most important commands for using *Jupyter* as a working environment can be
found in the notebook
[*JupyterCheatsheet.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter-examles/JupyterCheatsheet.ipynb)
(German).
Basics for statistical data analysis can be found in the notebooks
[*IntroStatistik.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter-examles/IntroStatistik.ipynb)
(German) and
[*Fehlerrechnung.ipynb*](https://git.scc.kit.edu/yh5078/datenanalyse/-/blob/master/jupyter-examles/Fehlerrechnung.ipynb) (German).
In *Jupyter*, code and text are entered in individual cells.
Active cells are indicated by a blue bar in the margin.
They can be in two states: in edit mode the input field is white, in command mode it is grayed out.
Clicking in the border area selects the command mode, clicking in the text field of a code cell
switches to edit mode.
The `esc` key can also be used to leave the edit mode.
Pressing `a` in command mode creates a new empty cell above the active cell, `b` creates one below.
Entering `dd` deletes the corresponding cell.
Cells can be either of the type `Markdown` or `Code`.
Entering `m` in command mode sets the type Markdown, entering `y` selects the type Code.
The cell content is processed - i.e. setting text or executing code - by entering `shift+return`,
or `alt+return` if a new, empty cell should also be created.
The settings mentioned here as well as the insertion, deletion or execution of cells can also be
executed via the pull-down menu at the top.
---
%% Cell type:markdown id: tags:
---
# Introduction: *pandas* basics for physicists
## Introduction: *pandas* basics for physicists
---
The [*pandas*](https://pandas.pydata.org/docs/getting_started/overview.html) package
is widely used in scientific data analysis and is well supported by a large community,
aiming at "becoming the most powerful and flexible open source data analysis/manipulation
tool" (quoted from the pandas documentation).
The flexibility of *pandas* in data management and analysis by far exceeds what is possible
with e.g. *numpy* arrays alone. It shows its strength when large data sets with complex
data structures need to be handled. Operations on data in *pandas* are vectorized,
similar to operations on arrays in *numpy*, and therefore efficient data processing
is possible if the vectorization capabilities are used. In some cases, this requires
some re-thinking compared to classical approaches like loops
over data elements. Vectorization techniques optimally utilize the capabilities of
modern CPUs, and make use of parallel execution, which drastically reduces computational time on
large data sets.
Another powerful feature of *pandas* is its advanced indexing capability,
called "multi-index" or "hierarchical index". This can be applied at both the
index and the column level of a *pandas* DataFrame, turning it into a
multi-dimensional data structure.
This tutorial emphasizes use-cases in data analysis in physics, e.g. of measurement
data from laboratory courses or for the computer exercises in particle physics.
A large number of
[tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
and the very comprehensive [*pandas* documentation](https://pandas.pydata.org/docs/) are
available for a more detailed look into the special features and the overwhelmingly large
functionality of *pandas* – way more than what can be given in this brief and fundamental introduction.
For the typical use cases in physics analysis, however, the following basic introduction
is sufficient as a starting point.
---
%% Cell type:markdown id: tags:
To get started, import the relevant *python* packages by executing the following code cell:
%% Cell type:code id: tags:
``` python
# execute this cell for necessary imports
# - pandas itself
import pandas as pd
# - efficient (vecotorized) computations
import numpy as np
# - for nice plotting
import matplotlib.pyplot as plt
# - nice output format
from IPython.display import display, HTML
```
%% Cell type:markdown id: tags:
---
## Revisiting *numpy* arrays
---
Arrays constitute the central data structure in *numpy* and have much in common with *pandas* data
structures. As you are probably already familiar with *numpy* *arrays*, it is worthwhile to revisit
some of the advanced features.
Contrary to the data type *list* in *python*, a numpy array must contain data of only one type.
The array is an indexed structure giving access to individual elements. In computer memory,
it consists of a sequential arrangement of objects each occupying a fixed number of bytes.
The *shape* of the array determines the indexing scheme, i.e. the dimension of the array.
%% Cell type:markdown id: tags:
### *numpy* arrays in memory
---
Consider the following example where we construct a *numpy* array from a list of numbers:
```
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(a)
```
Using *shape()* method of *numpy*, different ways of indexing the array can be
defined:
```
b = a.reshape( (2, 6) )
print(b)
c = a.reshape( (2, 2, 3) )
print(c)
```
The same data in memory is now viewed as one-, two- or three-dimensional object.
Note that during such reshaping operations, no copying of the data occurs,
i.e. the arrangement of data in memory remains the same.
To prove this, we can just print the memory location of the last element:
```
print('location of last element in a: ', hex(id(a[11])))
print('location of last element in b: ', hex(id(b[ 1, 5])))
print('location of last element in c: ', hex(id(c[ 1, 1, 2])))
```
%% Cell type:code id: tags:
``` python
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(a)
b = a.reshape( (2, 6) )
print(b)
c = a.reshape( (2, 2, 3) )
print(c)
print('location of last element in a: ', hex(id(a[11])))
print('location of 2nd-last element in a: ', hex(id(a[0])))
print('location of last element in b: ', hex(id(b[ 1, 5])))
print('location of last element in c: ', hex(id(c[ 1, 1, 2])))
```
%% Cell type:markdown id: tags:
Copying of the data part of an array is also avoided when an array is copied
or when "slices" of the data are assigned to a new array:
```
copy_of_a = a
slice_from_a = a[5:]
print('location of last element in slice: ', hex(id(slice_from_a[6])))
print('location of last element in copy: ', hex(id(copy_of_a[11])))
```
%% Cell type:code id: tags:
``` python
slice_from_a = a[5:]
copy_of_a = a
print('location of last element in a: ', hex(id(a[11])))
print('location of last element in copy: ', hex(id(copy_of_a[11])))
print('location of last element in slice: ', hex(id(slice_from_a[6])))
```
%% Cell type:markdown id: tags:
If copying is really necessary, use *np.deepcopy()* instead!
If copying is really necessary, use `np.deepcopy()` instead!
Avoiding unnecessary copying of the data parts of arrays is a very important
ingredient when handling large data sets, where the available computer memory may
otherwise become a severe limitation.
%% Cell type:markdown id: tags:
### Properties of *numpy* arrays
---
Internaly, a *numpy* array contains a number of objects, the *shape* as discussed above,
the *data*, i.e. the contiguous memory space where the data is stored,
the type of data, *dtype*, of the stored objects, the *size*, i.e. number of stored
objects, the *itemsize*, i.e. the length of one array element in bytes and some more.
The following code shows how to access these attributes:
```
print('shapes of a, b, c, slice, copy: ',
a.shape, b.shape, c.shape, slice_from_a.shape, copy_of_a.shape)
print('buffer of a: ', a.data)
print('data type of a: ', a.dtype)
print('sizes of a, b, c, slice, copy: ',
a.size, b.size, c.size, slice_from_a.size, copy_of_a.size)
print('size of one array element : ', a.itemsize)
```
As you see, the integers stored in the array *a* take up 8 bytes, each.
A different data type can be specified when constructing the array, e.g.
```
a16 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=np.int16)
print('data type of a16: ', a16.dtype)
print('sizes of a32: ', a16.size)
print('size of one array element: ', a16.itemsize)
```
%% Cell type:code id: tags:
``` python
print('shapes of a, b, c, slice, copy: ',
a.shape, b.shape, c.shape, slice_from_a.shape, copy_of_a.shape)
print('buffer of a: ', a.data)
print('data type of a: ', a.dtype)
print('sizes of a, b, c, slice, copy: ',
a.size, b.size, c.size, slice_from_a.size, copy_of_a.size)
print('size of one array element :', a.itemsize)
# constructing an array
a16 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=np.int16)
print('data type of a16: ', a16.dtype)
print('sizes of a32: ', a16.size)
print('size of one array element: ', a16.itemsize)
```
%% Cell type:markdown id: tags:
### Advanced indexing with *numpy* arrays
---
#### Slicing
Selecting sub-ranges of data in *numpy* *arrays* via "slicing" is a very efficient concept to extract
interesting data. The general syntax is using an *start:stop:step* as index. This selects the elements from
*start* to *stop-1*, skipping *step-1* elements. If *step* is not provided, a default value of one is assumed.
If *start* or *stop* is missing, the first respectively last element of the array is implicitely selected.
Try the following example:
```
a = np.array(range(10))
a_selected = a[2:7:2]
print(a_selected)
```
%% Cell type:code id: tags:
``` python
a = np.array(range(10))
a_selected = a[2:7:2]
print(a_selected)
```
%% Cell type:markdown id: tags:
#### Select elements via a boolean masking
Elements of an array can be selected by giving a boolean array of the same shape as the array as an index.
Only elements with a True value of the mask are selected. This is a very efficient and elegant method
to apply selection criterea to the elements of an array.
Try the follwing example:
```
a = np.random.rand(10)
# select only elements < 0.5
mask = a<0.5
print(a[mask])
```
A more compact way of writing this would be:
```
print(a[a<0.5])
```
%% Cell type:code id: tags:
``` python
a = np.random.rand(10)
# select only elements < 0.5
mask = a<0.5
print(a[mask])
print(a[a<0.5])
```
%% Cell type:markdown id: tags:
---
### Vectorized calulations in *numpy*
---
The key to understand why calculations in *numpy* are so fast is the concept of vectorization.
Numpy provides its own set of mathematical functions which operate on a whole numpy array, i.e.
a "vector" of data, within one *Python* instruction. Modern CPUs also support this type of
vector calculations by execution machine instructions on all data in a "vector register".
It is therefore highly recommended to use only such vectorized operations when dealing with
large data sets. To achieve this, data often needs to be organized appropriately such that
the elements, on which the same operation is to be performed, are stored sequentially in memory.
A very simple example illustrates this:
```
# create two large arrays
N=100000
a = np.random.rand(N)
b = np.random.rand(N)
```
Taking the sum of the two arrays is now done in a classical approach
with a loop:
```
# classical apporach using explicit coding
%%time
c=[]
for i in range(N):
c.append(a[i]+b[i])
```
And directly with *numpy* objects:
```
%%time
c=a+b
```
Note that the magic jupyter command "%%time" must be given as the first line in a new cell.
---
Note that the magic jupyter command `%%time` must be given as the first line in a new cell.
%% Cell type:code id: tags:
``` python
N=100000
a = np.random.rand(N)
b = np.random.rand(N)
```
%% Cell type:code id: tags:
``` python
%%time
c=[]
for i in range(N):
c.append(a[i]+b[i])
```
%% Cell type:code id: tags:
``` python
%%time
c=a+b
```
%% Cell type:markdown id: tags:
As you can see, the calulation with *numpy* is one order of magitude faster than a *Python* loop in this
simple case !
Exploiting the vectorization capability of *numpy* can really make a huge difference when
complex computations need to be performed on large data sets. Often, the vectorized approach
needs some extra thinking when writing code, but it does pay off, as you just saw.
---
%% Cell type:markdown id: tags:
---
## *pandas*, a Python Data Analysis Library
---
*Pandas* uses `Series` and `DataFrame` as its main data structures.
A Series is a one-dimensional, labeled array capable of holding any
*python* or *numpy* object. A Series has much in common with a *numpy*
*ndarray* and supports the same indexing scheme, slicing and boolean indexing as
well as efficient vectorized operations.
A DataFrame is a two-dimensional labeled data structure with columns
of potentially different types. In typical cases, the column index
is a text label, and the row label is an index, i.e. an integer number.
More than words, an example can illustrate what a *pandas* *DataFrame* is.
Execute the code cell below to get a graphical representation of a *DataFrame*
initialized from a *python* *dictionary*.
%% Cell type:code id: tags:
``` python
l = 8
example_DataFrame = pd.DataFrame( {'n' : [n for n in range(l)],
'n**2' : [n*n for n in range(l)],
'Sum_n': [n*(n+1)/2 for n in range(l)],
'n**n' : [n**n for n in range(l)]} )
display(example_DataFrame)
```
%% Cell type:markdown id: tags:
### 1. Constructing *pandas* *DataFrames*
---
## 1. Constructing *pandas* *DataFrames*
To explore *pandas* the first step is the construction of a *pandas* data structure.
Most useful are the *pandas* *DataFrames*, which can be extended to store
multi-dimensional data structures.
We start with a simple example, the instantiation of a *pandas* *DataFrame*
from a series of measurements of x-y data and an estimate of the uncertainties.
The following code cell produces some toy data:
%% Cell type:code id: tags:
``` python
# generate some data
from PhyPraKit import generateXYdata
def model(x, a=1,b=0.5):
return a *x +b
def makeToydata():
# xm = measured x
# ym = measured y
# Dx = x uncertainty
# Dy = y uncertainty
xm = np.linspace(0., 1., 11)
Dx = 0.035*np.ones(len(xm))
Dy = 0.055*np.ones(len(xm))
xt, yt, ym = generateXYdata(xm, model, Dx, Dy)
return xm, Dx, ym, Dy
x, Dx, y, Dy = makeToydata()
```
%% Cell type:markdown id: tags:
The most direct way to create a *DataFrame* from the data is via a dictionary,
which also provides names for the columns of the two-dimensional table.
The DataFrame constructor takes a *python* dictionary as an argument,
as is shown in the following code segment, which also provides a nice print-out:
```
#create DataFrame from dict
df_dict = pd.DataFrame( {'x':x, 'Dx':Dx, 'y':y, 'Dy':Dy})
# print nice title, html-formatted
display(HTML('<h3>DataFrame from python dictionary</h3> '))
# print DataFrame ...
display(df_dict)
# ... and its shape
print('shape:', df_dict.shape)
```
%% Cell type:code id: tags:
``` python
#create DataFrame from dict
df_dict = pd.DataFrame( {'x':x, 'Dx':Dx, 'y':y, 'Dy':Dy})
# print nice title, html-formatted
display(HTML('<h3>DataFrame from python dictionary</h3> '))
# print DataFrame ...
display(df_dict)
# ... and its shape
print('shape:', df_dict.shape)
```
%% Cell type:markdown id: tags:
It is more convenient in many cases to initialize the *DataFrame* directly from a *numpy* array.
Note that *pandas* in this case expects an array of row vectors as input. Therefore it is necessary
to transpose a 2-d *numpy* array created from the x-, Dx-, y- and Dy- arrays,
as is shown in the following code snippet:
```
# create data frame from numpy-arrays; NOTE: pandas wants array of row-vectors
data = np.asarray([x, Dx, y, Dy])
df_numpy = pd.DataFrame( data.T,
columns = ['x','Dx', 'y', 'Dy'] )
display(HTML('<h3>DataFrame from numpy</h3>'))
display(df_numpy)
```
%% Cell type:code id: tags:
``` python
# create data frame from numpy-arrays; NOTE: pandas wants array of row-vectors
data = np.asarray([x, Dx, y, Dy])
df_numpy = pd.DataFrame( data.T,
columns = ['x','Dx', 'y', 'Dy'] )
display(HTML('<h3>DataFrame from numpy</h3>'))
display(df_numpy.head(3))
```
%% Cell type:markdown id: tags:
**multi-index** or **hierarchical indexing** can be very helpful in dealing with complex data structures.
*multi-index* or *hierarchical indexing* can be very helpful in dealing with complex data structures.
Imagine cases where you want to manage several series of measurements, e.g. to compare or combine them.
Another use case are data recorded in particle collisions in high-energy physics experiments containing
many particles with always the same properties.
As a very simple case, we start by creating a two-level column index with name tag "Measurement 1",
under which we store the x-, Dx-, y- and Dy- values. The colum index is generated by the *pandas* method
*MultiIndex.from_arrays*, which initializes a proper structure from two arrays with names, one for the
top-level index (level-0) and one for the sub-indices.
```
# create a multi-index DataFrame
df_multi = pd.DataFrame(
data = data.T,
columns = pd.MultiIndex.from_arrays([4*['Measurement 1' ],['x','Dx', 'y', 'Dy']])
)
display(HTML('<h3>Multi-Index DataFrame</h3>'))
display(df_multi.head(11))
```
%% Cell type:code id: tags:
``` python
# create a multi-index DataFrame
df_multi = pd.DataFrame(
data = data.T,
columns = pd.MultiIndex.from_arrays([4*['Measurement 1' ],['x','Dx', 'y', 'Dy']])
)
display(HTML('<h3>Multi-Index DataFrame</h3>'))
display(df_multi.head(3))
```
%% Cell type:markdown id: tags:
To appreciate the power of hierarchical indexing, consider
a second set of measurements, which we now include together with
the first set in a new *DataFrame*:
```
# create second set of toy data
data2 = np.asarray(makeToydata())
# create multi-index DataFrame for 2 measurement series
df = pd.DataFrame(
data = np.concatenate( (data, data2)).T,
columns = pd.MultiIndex.from_arrays( [4*['Measurement 1']+4*['Measurement 2'],
2*['x','Dx', 'y', 'Dy']])
)
display(HTML('<h3> Multi-Index DataFrame with two level-0 column labels</h3>'))
display(df)
```
%% Cell type:code id: tags:
``` python
# create second set of toy data
data2 = np.asarray(makeToydata())
# create multi-index DataFrame for 2 measurement series
df = pd.DataFrame(
data = np.concatenate( (data, data2)).T,
columns = pd.MultiIndex.from_arrays( [4*['Measurement 1']+4*['Measurement 2'],
2*['x','Dx', 'y', 'Dy']])
)
display(HTML('<h3> Multi-Index DataFrame with two level-0 column labels</h3>'))
display(df)
```
%% Cell type:markdown id: tags:
The same dataframe could have been obtained by adding the new columns to
the *DataFrame* *df_multi* that has been created above. The convenient
syntax for this operation is shown in the example here:
```
# same effect by adding individual columns to df_multi
df_multi['Measurement 2','x'] = data2[0]
df_multi['Measurement 2', 'Dx'] = data2[1]
df_multi['Measurement 2', 'y'] = data2[2]
df_multi['Measurement 2', 'Dy'] = data2[3]
display(df_multi)
```
%% Cell type:code id: tags:
``` python
# --> enter code to try it
```
%% Cell type:markdown id: tags:
Handling **missing data** is one of the powerful features of *pandas* that permits
Handling *missing data* is one of the powerful features of *pandas* that permits
variable data sizes in *DataFrames*. Instead of adding a *numpy* array, the data
should first be transformed to a *pandas.Series*, and then added to the *DataFrame*.
Now, the second set of measurements could have less or even more data than the
already existing one. Here is the code example:
```
# add columns with missing data
df_multi['Measurement 2','x'] = pd.Series(data2[0][:7])
df_multi['Measurement 2', 'Dx'] = pd.Series(data2[1][:7])
df_multi['Measurement 2', 'y'] = pd.Series(data2[2][:7])
df_multi['Measurement 2', 'Dy'] = pd.Series(data2[3][:7])
display(df_multi)
```
The missing elements are filled with *NaN* (Not a Number) and will
be neglected in most *pandas* operations on the data.
Try it out in the following code cell:
%% Cell type:code id: tags:
``` python
# --> enter code here
```
%% Cell type:markdown id: tags:
In addition to the direct methods described so far, the *pandas* package contains
numerous other methods to **read data from files** in various formats, among them
numerous other methods to *read data from files* in various formats, among them
```
read_table() # very general and flexible method to read tables
read_csv() # load data from a CSV (=comma separated values) file
read_json() # load data from JSON file
read_excel() # load data vom MS Excel file
read_html() # read HTML tables to pandas DataFrames
```
As you may expect, the corresponding *write*-methods also exist, allowing you
to export *pandas* *DataFrames* in a variety of file formats. As these human-readable
data formats consume much storage space, Pandas accepts data in compressed format -
this feature is enabled automatically if the file extension *.gz* is encountered.
%% Cell type:markdown id: tags:
### 2: Accessing data in a *pandas* *DataFrame*
---
## 2: Accessing data in a *pandas* *DataFrame*
%% Cell type:markdown id: tags:
There is a wealth of methods to access data in a *DataFrame* *df*.
If considered an array, very low-level addressing via the *.iloc*-method
is possible, e.g.
```
df.iloc[1,2]
```
Slicing like in *numpy* is supported, e.g.
```
df.iloc[:5,2]
```
returns data from the 3rd column and from the row 0 to the row 5-1.
Using *.loc* is for access by label, e.g. to access the y-value of
Measurement 1 in row three, use:
```
df.loc[3, ('Measurement 1', 'y')]
```
To acces rows 1-3, use
```
df.loc[1:3]
```
<div class="alert alert-block alert-warning"><b>Warning:</b>
this is different from indexing in python and the <code>df.iloc</code> method,
what is meant here (<code>df.loc</code> method) is the row label!
</div>
There are two convenient ways to address a column in a dataframe `df` with name 'tag'
There are two convenient ways to address a column in a dataframe *df* with name 'tag'
and use it in a calculation or assign it to a variable:
```
a = df['tag']
```
or
```
a = df.tag
```
For cases with multi-indexing, i.e. sub-tags, use
```
a = df['tag']['subtag']
```
or
```
a = df['tag', 'subtag']
```
or
```
a = df.tag.subtag
```
For our example here, try
```
type(df['Measurement 1']['y'])
```
As you see, this returns a *pandas* series; to access the data elements, use the
*.values* property:
```
df['Measurement 1']['y'].values
```
Equivalently and more future-proof, use the following code to export values
as a *numpy* array:
```
df['Measurement 1']['y'].to_numpy()
```
The original *numpy* arrays that were used as an input to the *DataFrame*
can be recovered in this way:
```
df['Measurement 1'].to_numpy()
```
This line returns an array of rows in the data frame; for efficient, vectorized
processing of individual columns, e.g. a correlation plot of x vs. y,
transpose the array using the *numpy* *.T* attribute.
As we had already discussed above, adding a new variable, e.g. calculated with
some function from the existing columns, works as follows:
```
df['newtag'] = np.func(df.tag)
```
*pandas*, like *numpy*, supports boolean or masked indexing; to replace, e.g.,
all negative numbers in a column by zeros, one can use:
```
df[df.tag < 0] = 0
```
Of course, any other numer or function can be inserted instead of zero; to turn
the column to positive numbers, one could use
```
df[df.tag < 0] *= -1
```
<div class="alert alert-block alert-info"><b>Note:</b>
The multiplication is an expensive operation, in this case it is more
efficient to use the numpy function <code>np.abs()</code>.
</div>
With boolean indexing, it is easy to implement filters for the selection of
rows where the column entries fulfill given conditions.
%% Cell type:code id: tags:
``` python
# --> enter code to try the access methods from above
```
%% Cell type:markdown id: tags:
### Some words on calculation efficiency
#### Some words on calculation efficiency
When data sets become large, execution speed matters, and there are huge differences
in some cases depending on whether data is sorted along rows or columns in memory.
For most steps in data analysis, most operations work on columns, and therefore
ordering in memory along columns, as used by *pandas*, is most optimal. Care
must be taken when exporting data to *numpy* arrays, as we saw above. Exporting
to one-dimensional *numpy* arrays, however, is safe and usually the fastest
option when further processing the data with modules and methods from the
*numpy* library.
If in doubt, the *jupyter* command `%%time` or `%%timeit` should be used to find the optimal
solution for a given problem.
`%%time` executes the code once and returns the time it took to execute the cell,
`%%timeit` executes the cell 100 times and calculates the mean time.
As the system might be doing other tasks in the background the results of `%%time` can vary but
is still a good indicator if the compuation time differs by an order of 1o or higher.
`%%timeit` will give a better reuslt when comparing functions with similar speed.
%% Cell type:markdown id: tags:
### Plotting data
#### Plotting data
Data visualization is one of the first and most important aspects of any data analysis. Therefore, *Pandas* offers
simple wrapper methods around the methods of *matplotlib.pyplot*.
*.plot()* displays the data in a *Series*, *.hist()* provides a frequency distribution.
`.plot()` displays the data in a *Series*, `.hist()` provides a frequency distribution.
Try the following examples:
```
df['Measurement 1']['y'].plot(linestyle='', marker='x')
plt.suptitle('Series y', size='xx-large')
plt.xlabel('index')
plt.ylabel('y')
plt.show()
df['Measurement 1']['y'].plot.bar()
plt.suptitle('Series y as bar plot', size='xx-large')
plt.xlabel('index')
plt.ylabel('y')
plt.show()
df['Measurement 1']['y'].plot.hist(edgecolor='yellow')
plt.suptitle('Histogram of y', size='xx-large')
plt.xlabel('y')
plt.ylabel('frequency')
plt.show()
df['Measurement 1'].plot.scatter(x='x', y='y')
plt.suptitle('Scatter Plot of x and y', size='xx-large')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
```
More features are documented in the pandas [user guide on plotting](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html). With basic knowledge of *matplotlib*, plotting with
*pandas* should not pose any problem.
Extracting *numpy* arrays from *DataFrames* and using *matplotlib* directly is also always possible.
%% Cell type:code id: tags:
``` python
# --> try to plot the data
```
%% Cell type:markdown id: tags:
### 3. Filtering data in *pandas*
---
## 3. Filtering data in *pandas*
So far the main focus of *pandas* *DataFrames* was as a data storage from which
data can be extracted in a very flexible way.
It is now time to discuss some of the most
common operations when trying to understand, clean and select data. This is done very
efficiently, because *pandas*, like *numpy* *arrays*, supports boolean or masked indexing.
If an array of boolean values containing only "True" or "False" is passed as the index,
only rows with True-values are kept. Using the selection critera described above on
the column indices and generating such boolean filter masks, thus allows one to efficiently
select data for the final analysis by applying the logical AND of such criterea as a filter
mask.
As an example, we may wish to exclude data at the lower border of the measurement range
from the above data sets, thus all data should fulfill the condition `x > 0.15` A logical mask is
efficiently generated by the vectorized operation
```
mask = (df['Measurement 1', x'] > 0.15) | (df['Measurement 2', 'x'] > 0.15)
```
Note that the logical operations *and* (&), *or* (|), *xor* (^) and *not* (~), and of
course less (<), less or equal (<=), greater (>) and greater or equal (>=) are
Note that the logical operations *and* (`&`), *or* (`|`), *xor* (`^`) and *not* (`~`), and of
course less (`<`), less or equal (`<=`), greater (`>`) and greater or equal (`>=`) are
supported in *pandas*.
%% Cell type:code id: tags:
``` python
# --> enter code to remove measurements with x<0.15
```
%% Cell type:markdown id: tags:
With a second mask, we could remove measurements with `x >= 0.85`.
```
mask2 = (df['Measurement 1', 'x'] < 0.85) | (df['Measurement 2', 'x'] < 0.85)
```
The logical *and* of these mask can then be used as a filter mask:
```
df_selected = df[(mask & mask2)]
```
%% Cell type:code id: tags:
``` python
# --> enter code here
```
%% Cell type:markdown id: tags:
Of course, we could have applied the filter in a single line of code:
```
df_sel = df[~( (df['Measurement 1', 'x'] <= 0.15) |
(df['Measurement 2', 'x'] <= 0.15) |
(df['Measurement 1', 'x'] >= 0.85) |
(df['Measurement 2', 'x'] >= 0.85) ) ]
display(df_sel)
```
Note: here we have used a slightly different logic, we first select the entries to be removed and then invert the selection.
%% Cell type:code id: tags:
``` python
# --> please try it
```
%% Cell type:markdown id: tags:
As seen in this simple example, very complex and powerful
filter criterea can be efficiently applied to data with *pandas* *DataFrames*.
%% Cell type:markdown id: tags:
### 3.1 Filtering and manipulating data in *pandas* using `.pipe` method
%% Cell type:markdown id: tags:
### 4 Filtering and manipulating data in *pandas* using `.pipe` method
---
Often a situation may arise in which the application of a single filter is not sufficient. The applied filters, however, will reduce the size of the data set, whereby further filtering or calculation of specific values can be sped up. To take advantage of this, and the ability to quickly and easily view the effects of the filters on the dataset keeping a simple, non nested structure, the `.pipe` method can be used as a possible alternative, which creates a procedure flow of all application steps. For this purpose, functions are required that take a *pandas* *DataFrame* as the first argument and return a (modified) *pandas* *DataFrame*.
As an example, consider a procedure that includes the simple four steps:
* Filtering the data according to the x value
* Filtering the data according to the ratio Dy/y
* Calculating a value f(x, y)
* Filtering of specific f(x, y) values
For the selection of an x value, wrapping the existing code from above into a function that returns the reduced *pandas* *DataFrame* is sufficient:
```
def select_x_value(df, lower_value = 0.15, upper_value = 0.85):
return df[~( (df['Measurement 1', 'x'] <= lower_value) |
(df['Measurement 2', 'x'] <= lower_value) |
(df['Measurement 1', 'x'] >= upper_value) |
(df['Measurement 2', 'x'] >= upper_value) )]
```
%% Cell type:code id: tags:
``` python
# --> your code here
```
%% Cell type:markdown id: tags:
The filter for Dy/y is shown here:
```
def select_y_value(df, threshold_value = 0.1):
return df[((df["Measurement 1"]["Dy"] / df["Measurement 1"]["y"] < threshold_value) &
(df["Measurement 2"]["Dy"] / df["Measurement 2"]["y"] < threshold_value))]
```
%% Cell type:code id: tags:
``` python
# --> your code here
```
%% Cell type:markdown id: tags:
At the end, the function value f(x, y) is calculated on the (reduced) *pandas* *DataFrame* and a final filter
criterion is applied:
```
def calculate_f_value(df):
df.loc[:, ("Measurement 1", "f(x, y)")] = df["Measurement 1"]["x"] ** 2 + 2 * df["Measurement 1"]["y"]
df.loc[:, ("Measurement 2", "f(x, y)")] = df["Measurement 2"]["x"] ** 2 + 2 * df["Measurement 2"]["y"]
return df
def select_calculated_f_value(df, value=3.5):
return df[((df["Measurement 1", "f(x, y)"] < calculated_value) &
(df["Measurement 2", "f(x, y)"] < calculated_value))]
```
%% Cell type:code id: tags:
``` python
# --> your code here
```
%% Cell type:markdown id: tags:
Now the complete procedure is applied using the `.pipe` method:
```
(df.pipe(select_x_value)
.pipe(select_y_value, threshold_value = 0.05) # changing a criterion here is also possible
.pipe(calculate_f_value)
.pipe(select_f_value))
```
%% Cell type:code id: tags:
``` python
# --> test the pipe method
```
%% Cell type:markdown id: tags:
The same procedure flow without using the `.pipe` method can have approximately the following structure,
where `x_value_mask`, `y_value_mask` and `f_value_mask` are boolean masks that accomplishes the same filter
result as the mask combinations inside the `select_x_value`, `select_y_value`, `select_f_value` functions:
```
# let x_value_mask be a mask that is created on df
df_sel = df[x_value_mask]
# let y_value_mask be a mask that is created on df_sel
df_sel = df_sel[y_value_mask]
df_sel = calculate_value(df_sel)
# or in a single step: calculate_value(df[x_value_mask & y_value_mask]), but both masks must be created on df!
# let f_value_mask be a mask that is created on df_sel after application of calculated_value
df_sel = df_sel[f_value_mask]
```
%% Cell type:markdown id: tags:
In summary, the `.pipe` method avoids the possibly messy `func2(func1(df[filter1])[filter2])` nesting
that might otherwise be encountered in a complex analysis chain. Which of the methods, boolean filter
masks or the *.pipe* method, is eventually applied depends on the user's choice.
masks or the `.pipe` method, is eventually applied depends on the user's choice.
%% Cell type:code id: tags:
``` python
```
......
No preview for this file type
File deleted
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
../geant4_simulation
\ No newline at end of file
%% Cell type:markdown id: tags:
---
# Particle Physics 1
## Exercise 4: Working principle of calorimeters
**Institut für Experimentelle Teilchenphysik** <br>
Lecture: Prof. Dr. T. Ferber <br>
Exercise: Dr. T. Chwalek <br>
Assistance: O. Lavoryk, M. Molch, M. Mormille, R. Quishpe <br>
Hand-In: 23.11.2023
---
%% Cell type:markdown id: tags:
This exercise sheet gives an overview of the working principle of calorimeters.
The [introduction](#Intro) provides an overview of the different types of calorimeters, connects the knowledge from the undergrad courses to the current lecture and sets the environment for the following exercises.
[Exercise 1](#Exercise1) performs a simplified calibration to connect a measured detector readout to the observed deposited energy of photons.
Subsequently, [Exercise 2](#Exercise2) repeats the methods from the first exercise for different particle types.
Finally, [Exercise 3](#Exercise3) discusses and determines the energy resolution in calorimeters.
%% Cell type:markdown id: tags:
---
# Introduction: Calorimeter <a id="Intro"></a>
---
The [fifth lecture](https://ilias.studium.kit.edu/goto.php?target=file_1976263_download&client_id=produktiv) introduces two specialised calorimeter concepts with the common task of measuring the energy of incoming particles.
The distinction between these two types is often made because of the respective challenges they face.
Electromagnetic calorimeters are primarily optimised to measure as precisely as possible the energy of photons and electrons.
In the previous exercise, you were already made familiar with the basic concepts of this type of shower.
To keep things simple, this exercise sheet will build on the learned principles.
The other type of calorimeters, namely hadron calorimeters, specials on dealing with hadronic showers.
The key differences for hadronic showers are:
- The hadronic interaction length is much larger than the radiation length of photons and electrons.
- Hadronic showers consist to about a third of neutral pions that decay into two photons.
- Lastly, hadronic showers contain a large fraction (20-40%) of "invisible" particles which reduces the possible energy resolution in comparison to electromagnetic calorimeters.
<div>
<img src="HadronicShower.png" width="800"/>
</div>
[Source](https://www.physi.uni-heidelberg.de/~sma/teaching/ParticleDetectors2/sma_HadronicCalorimeters.pdf)
Another design choice for calorimeters is (are) the used material(s). A first-order distinction can be made for the homogenous and sampling calorimeter:
- **Sampling calorimeter** <br>
Sampling calorimeters have a layered structure with alternating layers of absorber (passive) and detector (active) material.
On the one hand, the design and material choice are flexible and optimisable to a specific task or financial budget.
On the other hand, some of the energy is deposited in the absorber which limits the energy resolution.
- **Homogenous calorimeter** <br>
Here, one material is used simultaneously as an absorber and active material.
Consequently, all the energy is deposited in the detector which allows for an optimal energy resolution.
Also, the choice of the material can be specialised to the task, e.g. CsI has a very low light yield but allows for fast readout; CsI(Tl) has high light yield but slow readout.
However, the materials capable of this task (mostly scintillator crystals) are custom made which makes them often very expensive and very heavy.
The following exercises will focus on a sampling calorimeter structure with lead absorbers and liquid argon as active material.
A key aspect of calorimeters is the readout of the detector which determines the measured energy of the particle.
As discussed in the [third lecture](https://ilias.studium.kit.edu/goto.php?target=file_1966617_download&client_id=produktiv), the total number of particles in an electromagnetic shower is proportional to the initial energy of the particle.
One very common use case in high-energy particle experiments is the use of scintillation materials.
In this case, the electromagnetic shower leads to the production of photons in the visible spectrum which can be measured by photo multipliers, photo diodes, etc.
For example, the voltage measured at a silicon photomultiplier is proportional to the number of incoming photons which in turn is proportional to the initial energy.
The following exercises will discuss a closely related measurement principle: The count of charged particles in the active material.
%% Cell type:markdown id: tags:
---
# Exercise 1: Calorimeter calibration <a id="Exercise1"></a>
---
This exercise presents a very simplified version of the calibration of the photon energy deposition in a calorimeter.
The goal is to determine the relation between the number of charged shower particles (in reality this would be a current) in the detector and the corresponding initial energy.
The setup is inspired by the [sampling electromagnetic calorimeter (ECAL) of ATLAS](https://atlas.cern/Discover/Detector/Calorimeter).
Here, incoming particles can shower in the lead absorbers.
The shower particles then ionize the liquid argon, ions drift to electrodes and the resulting number of charged particles (current) is measured.
%% Cell type:markdown id: tags:
To set up the simulation for this exercise follow the initialization instructions from the previous exercises.
In this and the following exercises, we will use a different physics list than in the previous exercises.
We will use the physics list `MyQGSP_BERT` which includes hadronic interactions as well.
The desired sampling calorimeter can be built by the `samplingcalo(lenAbsorber, lenActive, numLayers)` geometry.
This method produces `numLayers` layers of $(10\times 10\times\text{lenAbsrober/lenActive})\,$cm alternating absorber/active material. <br>
For example,
```python
g4.set_geometry('samplingcalo(2.,1.,15)')
```
will create a calorimeter with lead absorbers of thickness 2 cm and scintillator tiles of thickness 1 cm, and a total of 15 absorber layers.
Initialize the simulation:
- Construct the `ApplicationManager()`,
- build the calorimeter from the example,
- set the correct physics list and
- initialize.
%% Cell type:code id: tags:
``` python
import geant4_simulation as g4sim
import numpy as np
import matplotlib.pyplot as plt
from kafe2 import XYContainer, Fit, Plot
# Initialize the simulation.
# Your code ...
```
%% Cell type:markdown id: tags:
In this exercise, calibrate the given calorimeter for photons, i.e. determine the constant that translates the number $N_{c}$ of charged particles passing the scintillator into the energy $E_{\text{in}}$ of the initial particle.
To compute $N_{c}$, simulate for 20 different energies in the interval of $[1, 15]\,$GeV the interaction of 100 photons with the given calorimeter.
For each simulated photon, read the number of charged particles traversing the scintillator layers.
The number can be obtained using the `calo_readout()` method from the `ApplicationManager` class.
Be aware, this method only works for one event, therefore, you need to loop over the desired number of photon interaction simulations.
Compute the mean and its uncertainty (standard deviation divided by the square root of the number of samples) for each energy and store the result.
Finally, fit the mean number of charged tracks per event and the simulated energy with an appropriate model.
**Hint**: A smaller sample size while debugging the code can be useful here. <br>
**Hint**: For a fit with `kafe2` you can use the `XYContainer` and `Fit` modules.
%% Cell type:code id: tags:
``` python
# Your code ...
```
%% Cell type:markdown id: tags:
---
# Exercise 2: Calorimeter response to different particle types <a id="Exercise2"></a>
---
In this exercise, investigate the response of our sampling calorimeter to different types of particles.
Write a module that repeats the previous exercise as follows:
- Set the particle type.
- Define the desired energies. Here, choose 10 different steps in the interval of $[1, 10]\,$GeV.
- Loop over 100 samples.
- Store the mean number of charged tracks per event and its uncertainty.
- Fit the mean number of charged tracks.
Study the detector response for electrons (PDG code 11), photons (PDG code 22) and charged pions (PDG code 211).
To contain the full pion shower, increase the number of sampling layers to 50.
%% Cell type:markdown id: tags:
<div class="alert alert-info">
<strong>Question 2.1:</strong>
Do you see any differences between electron and photon induced showers? Why?
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-success">
<strong>Answer 2.1:</strong>
*Your answer...*
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-info">
<strong>Question 2.2:</strong>
Do you see any differences between charged pion and photon induced showers? Why?
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-success">
<strong>Answer 2.2:</strong>
*Your answer...*
</div>
%% Cell type:code id: tags:
``` python
# Your code ...
```
%% Cell type:markdown id: tags:
---
# Exercise 3: Calorimeter resolution <a id="Exercise3"></a>
---
In this exercise, we want to determine the stochastic factor $a$ of the energy resolution $\frac{\sigma(E)}{E}$ of the calorimeter for photons as a function of the photon energy $E_{\text{in}}$.
%% Cell type:markdown id: tags:
<div class="alert alert-info">
<strong>Question 3.1:</strong>
* Why do we only care about the stochastic contribution here?
* How does the resolution depend on the energy $E_{\text{in}}$?
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-success">
<strong>Answer 3.1:</strong>
*Your answer...*
</div>
%% Cell type:markdown id: tags:
To compute the energy deviation $\sigma(E)$, simulate for 20 different energies in the interval of $[1, 10]\,$GeV the interaction of 100 photons.
Use a liquid argon sampling calorimeter (15 layers) from the [first exercise](#Exercise1).
For each event, calculate the responding energy by calculating the number of charged tracks $N_c$ and apply the previously computed calibration.
Store the deviation of the response energy $\sigma(E)$ as the standard deviation over all samples.
The uncertainty for $\sigma(E)$ can be calculated as $\frac{\sigma(E)}{\sqrt{N_\text{Sample}-1}}$.
Finally, fit $\frac{\sigma(E)}{E}$ with an appropriate model and determine the stochastic factor $a$.
%% Cell type:markdown id: tags:
<div class="alert alert-info">
<strong>Question 3.2:</strong>
Does the result fulfil the expectation?
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-success">
<strong>Answer 3.2:</strong>
*Your answer...*
</div>
%% Cell type:code id: tags:
``` python
# Your code ...
```
%% Cell type:markdown id: tags:
<div class="alert alert-info">
<strong>Bonus question:</strong>
How does the energy resolution of a sampling calorimeter depend on the thickness of the absorber layers?
</div>
%% Cell type:markdown id: tags:
<div class="alert alert-success">
<strong>Answer Bonus:</strong>
*Your answer...*
</div>
File deleted
Exercise04/HadronicShower.png

201 KiB

../geant4_simulation
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
Exercise05/figures/q2max.png

14.3 KiB

Exercise05/figures/q2min.png

12.5 KiB

Exercise05/figures/sl.png

48.5 KiB

Exercise05/figures/spin.png

31.1 KiB

File deleted
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment