cern-prototype-proposal/computing.tex at master · jrklein/cern-prototype-proposal · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
% moved to main doc \section{Computing requirements, data handling and software}

\subsection{Overview}
The proposed ``Full Scale'' test of a single-phase Liquid Argon TPC at CERN will build upon the technology and expertise developed in the
process of design and operation of its smaller predecessor, the 35t detector at FNAL.
This includes elements of front-end electronics, data acquisition, run controls and related systems. We also expect that for the most part,
Monte Carlo studies necessary to support this program will be conducted utilizing software evolved from tools currently used (2015).

In the test-beam setup, the detector performance will be characterized for different types of particles, e.g. $p$, $\pi^{\pm}$, $\mu^{\pm}$ etc (see~\ref{detbeam_main}).
Current plans call for measurements in pre-defined bins of the incident particle momentum, which will have widths ranging from tens to
hunderds of MeV (see~\ref{bin_table}). The volume of data to be recorded shall be determined by the number of events to be collected in each measurement, such that is
provides adequately low statistical uncertainty of the parameters measured.

In our view, it is optimal to first stage the ``precious'' data collected from the prototype on disk at CERN and then sink it to tape (also at CERN),
while simultaneously performing replication to data centers in the US. For the latter, FNAL is the prime candidate, with additional data centers at Brookhaven National Laboratory
and the NERSC facility as useful additional locations for better redundancy and more efficient access to the data from greater number of locations.

\subsection{Collecting and Storing Raw Data}

\subsubsection{Considerations for Event Size Estimates}

To set the scale, let's consider an approximate \textbf{upper limit} on the number of channels with signals above the threshold
of zero-suppression, for a single track. For this we assume digitization rate fixed at 2MHz, and 4312 samples per drift window. In a single
anode plane assembly (APA), it will be approximately of the order of channel count in the APA, i.e. around 2500, \textit{when the track is parallel to the APA plane}.
Since each sample is 16 bit (or 12 bit in more design), we arrive to the limit of approximately 20MB per single charged track.
For this class of events, the amount of data will scale roughly linearly with the length of the track, i.e. in cases when a track is stopped or leaves the sensitive volume
there will be less data. Further, in most cases the data will be zero-suppressed by the front-end electronics (e.g signals below a certain threshold
will not be included into the outgoing data stream). The exact data reduction factor will depend on a variety of factors (cf. threshold), but as a rule of
thumb it's an order of magnitude. \textit{We conclude therefore the events will typically be a few megabytes in size}. This in fact is supported
be previous Monte Carlo studies performed for ealier versions of LBNE LAr TPC. More detail will be presented below.

At the time of writing, work is being done on the physical design of the Liquid Argon prototype, and the number of the Anode Plane Assemblies (APA)
to be installed in the detector is not yet finalized. It will likely be 2 or 3, however there is a possibility of this number to be as high as 6.  There is therefore a
factor of two or three  uncertainty in the number of readout channels in the detector (e.g. 7680 with 3 APA vs 15360 with 6).

This would affect the amount of data produces by DAQ, although not necessarily by a factor of two since a large part of the raw data will be zero-suppressed
and occupancy in general is expected to be low. With each additional APA, the number of background tracks (or track segments) produced by cosmic ray
muons will scale very approximately at a rate of O(1) per APA. Because of the direction of incidence of these tracks and the fact that in most cases they will
be crossing only part of the active volume, we will account for this by adding data equivalent to approx. 5 extra tracks to each event. This will very roughly correspond
the upper limit of 6 APAs in the apparatus and thus the estimate will be conservative.

In addition to the principal sensitive volume where Liquid Argon will serve as active medium for the TPC, the prototype will also contain a Photon Detector designed
to record light pulses produced in Argon due to scintillation caused by ionizing radiation. In any realistic scenario, the amount of data to be produced by
the Photon Detector will be quite small compared to that of the Liquid Argon TPC. Same goes for other elements of the experimental apparatus (hodoscopes,
trigger systems etc) and as a result, for the purposes of this section, we shall focus only on the Liquid Argon TPC as the critical source of data.


\subsubsection{``Before'' and ``After'' Readout Windows}
\label{readout_windows}

%According to various estimates, given the volume of the LAr prototype, we can expect $O(1)$ cosmic ray tracks to overlay
%the triggered ``beam'' events. This leads to extra data included in each event by a value commensurate with single muons.

As we just mentioned, the detector will be sensitive to background tracks due to cosmic ray particles.  These must be properly identified
and accounted for, in order to ensure high quality of the measurements and subsequent detector characterization. Since overlay of cosmic
ray muons over beam events is stochastic in nature, this can be achieved by recording signals which were produced ``just before'' and ``just after''
the arrival of the test particle from the beamline (this will enables us to reconstruct and account for partial or complete background
tracks present in the ``main'' event).

To ensure complete collection of charge due to such tracks, the additional readout windows before and after
the beam event should equal the nominal total drift time for the collection volume (approx. 2.1ms). This will triple the amount of
data due to cosmic rays, collected from the detector.


\subsubsection{Statistics and the Volume of Data}
\label{bin_table}
Experimental program for the test includes triggering on a few types of particles over a range of momenta (see~\ref{detbeam_main}). We introduce
bins for the particle momenta as shown in the table below. The estimated event sizes listed in the table are based on
Monte Carlo studies performed earlier for the 10kt version of the LBNE Far Detector and must be considered ballpark estimates. Hadronic and electromagnetic
showers were included in the MC samples so their effect is accounted for. As a concrete example, for an incident electron of 4GeV/c momentum
calculations indicate an average event size of $\sim$2MB, after zero-suppression.


\begin{center}
\begin{tabular}[h]{|l|r|r|r|r|}
  \hline
 Particle & Momentum Range& Bin         & Approx. event \\ % & Approx. \#\\
 Type    &  (GeV/c)                 & (MeV/c) & size, MB \\ %& of events, $10^6$\\
  \hline
  $p$ & 0.1-2.0 & 100 & 1 \\ % & 1 \\
  $p$ & 2.0-10.0 & 200 & 5 \\ %& 1 \\
  \hline
   $\mu^{\pm}$ & 0.1-1.0 & 50  & 1 \\ % & 1 \\
   $\mu^{\pm}$ & 1.0-10.0 & 200  & 5 \\ % & 1 \\
%   $\mu^{\pm}$ & $>$100 & ?  & 20 & 1 \\
  \hline
   $e^{\pm}$ & 0.1-2.0 & 100 & 1  \\ % & 1 \\
   $e^{\pm}$ & 2.0-10.0 & 200  & 4 \\ % & 1 \\
  \hline
   $K^{+}$ & 0.1-1.0 & 100 & 1  \\ % & 1 \\
  \hline
   $\gamma(\pi^{0})$ & 0.1-2.0 & 100  & 1 \\ % & 1 \\
   $\gamma(\pi^{0})$ & 2.0-5.0 & 200  & 5 \\ % & 1 \\
  \hline
\end{tabular}
\end{center}

Prelimiary plans call for statistics of the order of $10^4 - 10^5$  events to be collected in each bin.
% so in the table above we marked it as nominal 1M events for each entry. The requirements to the scale of these statistics are currently being refined.
Depending on the assumptions, this translates into $\sim$20 million events total (for all event classes) to ensure enough statistics for subsequent analysis.
Taking into account the cosmic ray overlay and additional readout windows as explained in~\ref{readout_windows}, we arrive to a number
of $\sim$1PB for total storage space necessary to host \textbf{the raw data}. This needs to be looked at as the basis for tape budget. As explained below, this
volume of data needs to be replicated for assured preservation, in at least one more additional facility, hence in effect this number must be doubled when budgeting
tape.

\subsubsection{Summary of the Data Volume Estimates}
The total amount of data to be collected during the prototype operation will be considerable under all assumptions and estimates.
To fulfill the mission of this test beam experiment, we expect that we will need tape storage of $O(PB)$ size,
and a more modest disk space for raw data staging at CERN, for replication purposes. We envisage storing the primary copy of raw data
at CERN, with replicas at additional locations. There will be additional requirements for processed and Monte Carlo data placement.

%\subsubsection{Data Acquisition and Storage}
%\label{daq_and_storage}
%The Data Acquisition System used in this experiment will be derived from the system currently used in the 35t Liquid Argon prototype being commissioned at FNAL. We foresee three Linux %workstations equipped with interface cards to be used for data readout. It will be necessary to provide a stage-out disk space (a few tens of TB in size) to serve as a buffer for the DAQ %system, from which the data will be transmitted to
%\begin{itemize}
%\item Tape storage at CERN.
%\item Storage facility at Fermilab, where the data will be written to tape and also staged in dCache as necessary for express analysis.
%\item \textit{Optional} other locations, such as auxiliary storage at Brookhaven National Laboratory , which is now being planned according to the data volumes
%estimated in this paper. NERSC facility has considerable potential in serving as an additional data hub. Partial copies will be transmitted to other National Laboratories
%and research centers.
%\end{itemize}

%Considerable volume of the raw data makes unlikely that it will be all staged on disk at any single location, in its entirety. We foresee the
% need for ample amount of tape storage, and a smaller disk space ($\sim$0.5-1.0PB) for staging raw data coming out of the detector - in transit to tape
%storage and CERN and the US facilities.

\subsubsection{Raw Data Transmission and Distribution}
As mentioned in~\ref{daq_and_storage}, at least two full replicas will exist for the raw data - at CERN and at FNAL (and likely an additional replica at BNL). Moving data outside CERN
is subject to a number of requirements that include:
\begin{itemize}
\item automation
\item monitoring
\item error checking and recovery (redundant checks to ensure the ``precious'' data was successfully sunk to mass storage at the endpoint)
\item compatibility with lower-level protocols that are widespread, well uncerstood and maintained (cf. gridFTP)
\end{itemize}

There are a number of systems that can satisfy these requirements, and one of them where we possess sufficient expertise and experience is Spade, first used in IceCube~\cite{spade_icecube} and then enhanced and successfully utilized in Daya Bay experiment~\cite{spade_dayabay}.

%Note that at this stage of the lifecycle of the data we foresee transmission between fixed endpoints (with sufficient degree of automation, checking, redundancy etc) but we haven't %discussed yet the topic of how such
%data is made available to researchers who wish process it at their home institution or on the Grid/in the Cloud - this will be addressed in~\ref{wms}.


\subsection{Databases}
A few types of databases will be required:
\begin{itemize}
\item Run Log, Conditions and Slow Controls records
\item Offline Calibrations
\end{itemize}

Databases listed in the former item will need to be local to the experiment in order to reduce latency, improve reliability, reduce downtime dut to
network outages etc.
A replication mechanism will need to be put in place the data is readily available at the US and other sites. The volume of data
stored in these databases will likely to be quite modest.

%As to the offline calibrations, for optimal access, such databases should be located closer to the location where most processing will take place.
%FNAL is a good candidate for that (also from the support point of view), and we shall also consider replication of these database to other
%research institutions.

\subsection{A note on Simulation and Reconstruction Software}
Research effort connected to the ``Full Scale'' prototype at CERN will benefit from utilizing simulation toolkits, and tracking and other reconstruction
algorithms created by communities such as former LBNE, and especially during the 35t test at FNAL. In order to leverage this
software and expertise, appropriate manpower will need to be allocated in order to create and maintain physics analysis tools
necessary to fulfill the research goals of this experiment.

These tools will reply on software components which will need to be portable, well maintained and validated, given
the widely distributed nature of the Collaboration and the need to use geographically dispersed resources. To ensure that this happens,
we plan to establish close cooperation among participating laboratories and other research institutions. The software will also need
to be amenable to running on Grid facilities, and will require Distributed Data capability (see~\ref{wms}).

\subsection{Distributed Computing, Workload and Workflow Management}
\label{wms}
\subsubsection{Scale of the Processed Data}
According to our estimates, the volume of raw data will be in the petabyte range. The offline data can be classified as follows:
\begin{itemize}
\item Monte Carlo data, which will contain multiple event samples to cover various event types and other conditions during the measurements with the prototype detector
\item Data derived from Monte Carlo events, and produced with a variety of tracking and pattern recognition algorithms in order to create a basis for the detector characterization
\item Intermediate calibration files, derived from calibration data
\item Processed experimental data, which will likely exist in a few brunches corresponding to a few reconstruction algorithms being applied, with the purpose of their evaluation
\end{itemize}

In the latter, there will likely be more than one processing step, thus multiplying data volume. There is sometimes a question about how much of the raw data should be preserved in
the processed data streams. Given a relatively large volume of raw data, the answer in this case will likely be ``none'' - for practical reasons, meaning that the derived data will
be just that, and that the size of the processed data will likely by significantly smaller than the input (the raw data). Given consideration presented above, we will plan for
$\sim$1PB of tape storage to keep the processed data. For efficient processing, disk storage will be necessaty to stage a considerable portion of both raw data (inputs) and
one or a few steps in processing (outputs).

Extrapolating from our previous experience running Monte Carlo for the former LBNE Far Detector, we estimate that we'll need a few hundred TB of continuously available
disk space. In summary, we request 2PB of disk storage at FNAL to ensure optimal data availability and processing efficiency. Access to distributed data is discussed below.

\subsubsection{Distributed Data}
We foresee that data analysis (both experimental data and Monte Carlo) will be performed by collaborators residing in many institurions and geographically dispersed. In our
estimated above, we mostly outlined storage space requirements for major data centers like CERN and FNAL. When it comes to making these data available to the researchers,
we will utilize a combination of the following:
\begin{itemize}
\item Managed replication of data in bulk, performed with tools like Spade discussed above. Copies will be made according to wishes and capabilities of participating institutions.
\item Network-centric federated storage, based on XRootD. This allows for agile, just-in-time delivery of data to worker nodes and workstations over the network. This
technology has been evolving rapidly in the past few years, and solutions have been found to mitigate performance penalty due to remote data access, by implementing caching
and other techniques.
\end{itemize}

In order to act on the latter item, we plan to implement a global XRootD redirector, which will make it possible to transparently access data from anywhere.
A concrete technical feature of storage at FNAL is that there is a dCache network running at this facility, with substantial capacity which can be leveraged
for the needs of the CERN prototype analysis. This dCache instance is equipped with a XRootD ``door'' which makes it accessible to outside world, subject
to proper configuration, authentication and authorization.

As already mentioned, we plan to host copies of a significant portion of raw and derived data at Brookhaven National Laboratory, where substantial expertise
exists in the field of data handling and processing at scale, due to the principal role this Laboratory plays in both RHIC (e.g. STAR) and ATLAS experiments.
Initial simple tests of XRootD federation to access data residing at FNAL, from a XRootD instance located at BNL have been successful. At this point in time we
are formulating the hardware requirements that need to be fulfilled in order for BNL to play the role of an additional data center as described here.


\subsubsection{Distributed Processing}
At the time of writing, FNAL provides the bulk of computational power for LBNE (not to mention a few other IF experiments), via Fermigrid and other facilities.
We plan to leverage these resources to process the prototype data. At the same time, we envisage a more distributed computing model where Grid resources
are available transparently and sometimes on the opportunistic basis, using facilities made available by national Grids and in the case of the United States, by
the Open Science Grid Consortium.

There are currently very large uncertainties regarding what scale of CPU power will be required to process the data, given that tracking, reconstruction and
other algorithms are in a fairly early stage of development. The best estimates we have at this point range from 10 to 100 seconds required by a typical
CPU to reconstruct a single event. This means that utilizing a few thousand cores through Grid facilities, it will be possible to ensure timely processing of these data.

We have not chosen yet the type of Workload Management System to be used for the purposes of the CERN prototype test. Currently, many researchers at FNAL
are using the \textit{jobsub} tool, which opens access to Fermigrid and additionally to the Open Science Grid and Cloud resources. There are other capable (and arguably
more sophisticated) system being used, for example by the LHC experiments. We plan to conduct an evaluation of these systems with a view to adopt a different
solution if necessary.


%\end{document}