Commit e710c9dc authored by Uwe Schulzweida's avatar Uwe Schulzweida
Browse files

Merge of branch cdo-pio into trunk cdi

parent ef8aa4eb
......@@ -32,6 +32,25 @@ doc/cdi_fman.pdf -text
doc/coding_standards/README -text
doc/coding_standards/footer.c -text
doc/coding_standards/ref-example.c -text
doc/pio/graphics/communicators.pdf -text
doc/pio/graphics/encodeNBuffer.pdf -text
doc/pio/graphics/legend.pdf -text
doc/pio/graphics/pioBuffer.pdf -text
doc/pio/graphics/pio_fpguard.pdf -text
doc/pio/graphics/pio_mpi.pdf -text
doc/pio/graphics/pio_none.pdf -text
doc/pio/graphics/pio_writer.pdf -text
doc/pio/graphics/serial.pdf -text
doc/pio/graphics/timestep.pdf -text
doc/pio/pio_docu.pdf -text
doc/pio/tex/communicators.tex -text
doc/pio/tex/intro.tex -text
doc/pio/tex/makepdf_f -text
doc/pio/tex/modes.tex -text
doc/pio/tex/pio_docu.tex -text
doc/pio/tex/stages.tex -text
doc/pio/tex/subprograms.tex -text
doc/pio/tex/timestepping.tex -text
doc/tex/FUNCTIONS -text
doc/tex/Modules -text
doc/tex/attribute.tex -text
\section{{\tt MPI} communicators}
\caption {Participation of {\tt MPI} processes in communication
The scalability of Earth System Models (ESMs) is the leading target of the
project, in particular with regard to future computer development. Our work
focuses on overcoming the I/O bottleneck. The
\href{}{Climate Data Interface} ({\CDI}) is a
sophisticated data handling library of
the Max-Planck-Institute for Meteorology with broad acceptence in the
\caption{{{\CDI} {\tt streamWriteVar}}}
Its I/O is carried out synchronously relative to the model calculation.
We have decided to parallelize the file writing with the {\CDI}, because of the
great benefit for many ESMs.
We analyzed
some HPC systems concerning the impact of their architecture, filesystem and
MPI implementation on their I/O performance and scalability. The investigation
delivered a large spectrum of results. As a consequence,
we introduce different modes of low level file writing. The tasks
of the I/O processes, which are now decoupled from the model calculation, are split.
We destinguish into collecting the data, encode, compress
and buffer it on one side and writing the data to file on the other. The extension
of the {\CDI} with parallel writing ({\pio}) provides five different I/O modes
making use of this role allocation. The modulation of I/O mode, numbers and
location of processes with the best performance on a
special machine can be determined in test runs.
On some systems it is impossible to write from different physical nodes to one
file. Taking the architectural structures into consideration we made a key
pillar for the design of {\pio} from the distribution of I/O processes and files
on physical nodes. If the I/O processes are located on different physical nodes,
the {\tt MPI} communicator for the group of I/O processes is split and each
subgroup gets a loadbalanced subset of the files to write. On some machines
this approach increases the throughput remarkably.
The application programming interface of the {\CDI} is kept untouched,
we introduce only a few indispensable functions and encapsulate the other
developments inside the library. The models have to eliminate the special
position of the root process with respect to file writing. All model processes
``write'' their chunk of the decomposed data and save the time former needed
for gathering and low level writing.
\ No newline at end of file
pdflatex ${FILE}
pdflatex ${FILE}
cat > ${FILE}.ist << 'EOF'
delim_0 "{\\idxdotfill} "
headings_flag 1
heading_prefix "{\\centerline {\\Large \\bf "
%heading_prefix "{\\centerline {\\bfseries "
heading_suffix "}}"
makeindex -s ${FILE}.ist ${FILE}.idx
pdflatex ${FILE}
thumbpdf ${FILE}
pdflatex ${FILE}
The I/O performance and scalability of a supercomputer depends on the combination
of the hardware architecture, the filesystem and the {\tt MPI} implementation. We
made testruns on several machines invoking {\tt MPI\_File\_iwrite\_shared}, the
obvious way of parallel file writing.
Especially on the IBM Blizzard, naturally in our primary focus, the
benchmark programs achieved surprisingly poor results.
Accordingly the {\CDI} has to provide possibilities to write files in parallel
using {\tt POSIX IO}.
The tasks formerly carried out by the root process are split into the subtask
gather, encode and compress on one side and write the files on the other. The
variable partition of these subtasks in the group of I/O server led us to five
modes of low level writing.
{\pio} is backwards compatible due to the differentiated behavior of the {\CDI}
calls {\tt streamClose}, {\tt streamOpenWrite}, {\tt streamDefVlist}
and {\tt streamWriteVar} on model side, depending on local writing and I/O stage.
The {\tt stream} functions were primarily written for low level file writing,
as a matter of course
also the collecting I/O processes invoke them. This makes the I/O modes to
another key to the program flow of the {\CDI} stream calls.
\subfloat[Pseudo code encodeNBuffer]
\caption {Legend and pseudo code encodeNBuffer}
On the I/O processes
the excecution of the subprogram {\tt streamWriteVar} is controlled by the I/O
modes. To clearify the functioning we use pseudo code and
flowcharts as you can see in figure~\ref{legend}.
The command
\texttt{encodeNBuffer} abstracts the encoding, compressing and buffering
of the data for a variable. The data in a
{\tt GRIB} file may not be mixed by time, so we need a command \texttt{newTimestep}
to manage the flushing of the output buffers on the I/O server side. To
achieve this, a \texttt{MPI\_Barrier} is used.
I/O modes provided by {\pio}:
\begin{deflist}{\tt PIO\_FPGUARD\ }
\item[{\htmlref{\tt PIO\_NONE}{PIONONE}}]
one process collects, transposes, encodes, compresses, buffers and writes using
{\tt C} {\tt fwrite}.
\item[{\htmlref{\tt PIO\_MPI}{PIOMPI}}]
all processes collect, transpose, encode, compress, buffer and write using
{\tt MPI\_File\_iwrite\_shared}.
\item[{\htmlref{\tt PIO\_ASYNCH}{PIOASYNCH}}]
one process writes the files using low level
{\tt POSIX\_AIO}, the others collect, transpose, encode, compress and buffer.
\item[{\htmlref{\tt PIO\_FPGUARD}{PIOFPGUARD}}]
one process guards the fileoffsets, all others
collect, transpose, encode, compress and write using {\tt C} {\tt fwrite}.
\item[{\htmlref{\tt PIO\_WRITER}{PIOWRITER}}]
one process writes the files using {\tt C} {\tt fwrite},
the others collect, transpose, encode, compress and buffer.
\section{{\tt PIO\_NONE}: $1$ process collects and writes using {\tt POSIX IO}}
\index{PIONONE@{\tt PIO\_NONE}}
The I/O mode {\tt PIO\_NONE} can only be run with one I/O process per physical
node. This
process collects, encodes, compresses, buffers and writes the data to the files
to his node. For low level file writing the {\tt C} standard \texttt{fwrite}
is used. The advantages
in comparison with the former serial writing are that the writing is done
asynchronous with respect to the calculation and that the data is buffered. In
addition it can be executed in parallel spread over physical nodes.
\subfloat[{\tt PIO\_NONE}]{\includegraphics[scale=0.5]{../graphics/pio_none.pdf}}
\subfloat[{\tt PIO\_MPI}]{\includegraphics[scale=0.5]{../graphics/pio_mpi.pdf}}
\caption {{\tt PIO\_NONE} and {\tt PIO\_MPI}}
\section{{\tt PIO\_MPI}: $n$ processes collect and write using {\tt MPI IO}}
\index{PIOMPI@{\tt PIO\_MPI}}
Data access using {\tt MPI} is the straight forward way to parallel file
manipulation. With \\
{\tt MPI\_File\_iwrite\_shared} the processes have a shared
file pointer available. The function is nonblocking and split collective. Like
{\htmlref{\tt PIO\_NONE}{PIONONE}} the I/O mode {\tt PIO\_MPI} has no division
of task within the I/O group, all processes collect, encode, compress, buffer and
write to file. Writing in this I/O mode strongly depends on the {\tt MPI}
implementation, the buffers used internally are of major importance for the
performance of writing.
\section{{\tt PIO\_WRITER}: $n - 1$ processes collect and $1$ writes using
{\tt POSIX IO}}
\caption{{\tt PIO\_WRITER}}
If the I/O mode {\tt PIO\_WRITER} is chosen, the subtasks of writing are split
between the I/O processes. Just one process per physical node does the
low level writing while the others collect, encode, compress and buffer
the data. The writer is the process with the highest rank within the
I/O group on one physical node. Originating from {\htmlref{\tt pioInit}{pioInit}}
he invokes a
backend server function, which he does not leave until he received messages from
collecting I/O processes to finalize. A collector gets data from the calculating
model processes via {\tt MPI RMA} communication, and, after encoding and
compressing it, pushes it to a double buffer. If the buffer is filled,
the contained data is send via \texttt{MPI\_Isend} to the writer, the
switches to the other buffer and continues his job. Before sending the
data he has to wait for a potentially outstanding
\texttt{MPI\_Request}. This might happen if the writer or the buffers
used by {\tt MPI} are overcommited and indicates that the ratio of
collectors and writers has to be checked. The writer is
polling using \texttt{MPI\_Probe} to look for incoming messages from
the collectors. One message per collecting process is tagged with the finalize
command. All other messages contain a stream identifier, a buffer with
data to be written and a file manipulation instruction. There are three kinds of
this commands:
1. open a file and write the data to it, 2. write the data to an
open file and 3. write the data to an open file and close it afterwards.
For the file writing {\tt C} standard \texttt{fwrite} is used.
\section{{\tt PIO\_ASYNCH}: $n - 1$ processes collect and $1$ writes using
{\tt POSIX AIO}}
The I/O mode {\tt PIO\_ASYNCH} is similar to {\tt PIO\_WRITER}, it only differs in
the method used for low level file writing. The asynchronous nonblocking I/O
can be overlapped with processing, write orders are passed to the operating
\section{{\tt PIO\_FPGUARD}: $n - 1$ processes collect and write using {\tt POSIX
\caption{{\tt PIO\_FPGUARD}}
Writing a huge amount of data with a fixed file offset is a very fast
way of file writing. In this I/O mode one I/O process per physical
node is spent to administrate the file offsets while the others do all the
subtasks former defined for the parallel I/O. The functionality of this
collaboration is similar to
{\tt PIO\_WRITER}. Originating from {\htmlref{\tt pioInit}{pioInit}} the
process with the highest rank calls a backend server
function in which it is busy waiting
for messages. The collecting I/O processes
get data from the calculating model processes via {\tt MPI} RMA communication.
The data is encoded, compressed and buffered.
If the buffer is filled, the collector
sends the count of the contained data to the ``file pointer guard'' and gets a
file offset back. With the received offset the collector writes the data to
file using {\tt C} standard
\texttt{fwrite} and goes on with his job. One message per collecting process
is tagged with the finalize command. All other messages needed for
the communication between the ``file pointer guard'' and the collectors
contain a stream identifier, a numeric value holding the amount of buffered data
respectively the file offset and a command. There are three kinds of commands:
1. offset wanted for a file that will be newly opened, 2. offset wanted for an
open file and 3. offset wanted for a file that will be closed after writing.
\ No newline at end of file
\definecolor{zebg}{rgb}{1,1,.8} %elfenbeinfarbig
% pdffitwindow=true,
pdfauthor={Deike Kleberg},
pdftitle={CDI-pio Manual},
pdfcreator={pdflatex + hyperref},
% pdfpagemode=FullScreen,
\newcommand{\CDI}{\bfseries\sffamily CDI}
\newcommand{\pio}{\bfseries\sffamily CDI-pio}
\newcommand{\deflabel}[1]{\bf #1\hfill}
{\settowidth{\labelwidth}{\bf #1}
\newcommand{\idxdotfill}{\ \dotfill \ }
\title{{\pio}, {\CDI} with parallel I/O}
\author{Deike Kleberg, Uwe Schulzweida, Thomas Jahns and Luis Kornblueh\\
Max-Planck-Institute for Meteorology and Deutsches Klima Rechenzentrum\\
Project ScalES funded by BMBF}
\chapter{I/O stages in the program flow}
\chapter{Modes of low level writing}
\chapter{{\pio} modules}
\chapter{Internal concepts}
\ No newline at end of file
With the concept of I/O stages in the model program flow {\pio} meets two of the
main requirements for asynchronous I/O with the {\CDI}: The consistency of the
resources on {\tt MPI} processes in different groups and the minimization of the
communication. The program flow is divided in three stages:
\begin{deflist}{\tt STAGE\_DEFINITION \ }
The {\CDI} resources have to be defined.
\item[{\htmlref{\tt STAGE\_TIMELOOP}{STAGETIMELOOP}}]
Data can be moved from the model to the collecting I/O server.
\item[{\htmlref{\tt STAGE\_CLEANUP}{STAGECLEANUP}}]
The {\CDI} resources can be cleaned up.
A listing of an example program built up of control (\ref{control}) and
model run (\ref{model}) in chapter \nameref{modules} clearifies the program flow.
\section{Define {\CDI} resources: {\tt STAGE\_DEFINITION}}
{\tt STAGE\_DEFINITION} is the default stage and starts with a call to
{\htmlref{pioInit}{pioInit}}. During this stage, the {\CDI} resources have to
be defined. Trying to write data with {\CDI} {\tt streamWriteVar} will lead to an
error and abort the program. The stage is left by a call to
{\htmlref{pioEndDef}{pioEndDef}}. After leaving, any call to a {\CDI}
subprogramm XXX{\tt def}YYY will lead to an error and abort the program.
\section{Writing in parallel: {\tt STAGE\_TIMELOOP}}
{\tt STAGE\_TIMELOOP} starts with a call to {\htmlref{pioEndDef}{pioEndDef}}.
Invocations to {\CDI} {\tt streamClose},\\
{\tt streamOpenWrite},
{\tt streamDefVlist} and {\tt streamWriteVar} effect the local {\CDI} resources
but not the local file system. The
calls are encoded and copied to a {\tt MPI} window buffer.
You can find a flowchart of one timestep in figure~\ref{timestep}.
{\tt streamClose}, {\tt streamOpenWrite} and {\tt streamDefVlist} have
to be called
\item only for an already defined stream/vlist combination,
\item in the suggested order,
\item at most once for each stream during one timestep and
\item before any call to {\tt streamWriteVar} in that timestep.
% todo stream calls collective?
All four {\CDI} stream calls require that the model root process participates.
Disregards to this rules will lead to an error and abort the program. The
implication also holds for attempts to define, change or delete {\CDI} resources
during {\tt STAGE\_TIMELOOP}. Therefor it is necessary to switch stages before
cleaning up the resources. A call to
{\htmlref{pioEndTimestepping}{pioEndTimestepping}} closes
\section{Cleanup {\CDI} resources: {\tt STAGE\_CLEANUP}}
{\tt STAGE\_CLEANUP} is launched by invoking
{\htmlref{pioEndTimestepping}{pioEndTimestepping}}. In this stage, the {\CDI}
resources can be cleaned up. Trying to write data with {\CDI}
{\tt streamWriteVar} will now lead to an error and abort the program.
\section{The namespace object}
For some models the concept of stages is to narrow. In order to meet this
requirement we introduce namespaces. A namespace
\item has an identifier,
\item is mapped to a {\CDI} resource array,
\item indicates, if the model processes write locally or remote,
\item has an I/O stage and
\item is the active namespace or not.
A call to {\htmlref{pioInit}{pioInit}} initializes the namespace objects with two
of the arguments given by the model processes, the number of namespaces to be
used and an array
indicating if they obtain local or remote I/O. Invoking
{\htmlref{pioNamespaceSetActive}{pioNamespaceSetActive}} switches the namespace
so that subsequent {\CDI} calls operate on the resource array mapped to the
chosen namespace. The namespaces are destroyed by
{\htmlref{pioFinalize}{pioFinalize}}. If the model uses the {\CDI} serially,
exactly one namespace supporting local writing is a matter of course. To save
overhead, it is preferable to work with one namespace.
\section{Initialize parallel I/O: {\tt pioInit}}
The function {\tt pioInit} initializes the parallel I/O with {\CDI}, it launches
the {\htmlref{\tt STAGE\_DEFINITION}{STAGEDEFINITION}}. {\tt pioInit} defines
a control object for the {\tt MPI} communicators ( see
figure~\ref{communicators}) and triggers their
initialization. After starting the I/O server, {\tt pioInit} receives a message
from each I/O process, containing information about its location on a physical node
and its function as a collector of data or a backend server. The first information
is stored in the control object, the latter is used to construct the communicators
for the data transfer. Furthermore, {\tt pioInit} defines and initializes an
control object for the {\htmlref{namespace}{namespace}}s. The call {\tt pioInit} is
collective for all
{\tt MPI} processes using the {\CDI}. If the
model employs the {\CDI} serially, a
call to {\tt pioInit} has no effect.
INTEGER IOMode, INTEGER nNamespaces,
INTEGER hasLocalFile ( nNamespaces ));
\begin{deflist}{\tt IN hasLocalFile\ }
\item[{\tt IN commGlob}]
{\tt MPI} communicator (handle).
\item[{\tt IN nProcsIO}]
The number of {\tt MPI} processes that shall be used for I/O.
\item[{\tt IN IOMode}]
The mode for the I/O. Valid I/O modes are {\htmlref{\tt PIO\_NONE}{PIONONE}},
{\htmlref{\tt PIO\_MPI}{PIOMPI}}, {\htmlref{\tt PIO\_WRITER}{PIOWRITER}},
{\htmlref{\tt PIO\_ASYNCH}{PIOASYNCH}} and
{\htmlref{\tt PIO\_FPGUARD}{PIOFPGUARD}}.
\item[{\tt IN nNamespaces}]
The number of used {\htmlref{namespace}{namespace}}s on the model side.
\item[{\tt IN hasLocalFile}]
A logical array with size {\tt nNamespaces} indicating whether the model
processes write locally or let the I/O server write.
Upon successfull completion {\tt pioInit} returns a {\tt FORTRAN} handle to a
{\tt MPI} communicator including only the calculating model processes.
If an error occurs, {\tt pioInit} cleans up, finalizes {\tt MPI} and exits the
whole program.
The arguments of {\tt pioInit} subject to some constraints.
\begin{deflist}{\tt nProcsIO\ }
\item[{\tt commGlob}]
has to be a valid handle to a {\tt MPI} communicator whose group
includes all processes that will work on the {\CDI} resources.
\item[{\tt nProcsIO}]
\item[]$==1$ per physical node, if {\tt IOMode} $==$
{\htmlref{\tt PIO\_NONE}{PIONONE}},
\item[]$<=$ {\tt sizeGlob} $/ 2$ \ otherwise, with {\tt sizeGlob} $=$ number
of processes in {\tt commGlob},
\item[]$>= 2$ per physical node, if {\tt IOMode} $\in \{$
{\htmlref{\tt PIO\_WRITER}{PIOWRITER}},
{\htmlref{\tt PIO\_ASYNCH}{PIOASYNCH}},
{\htmlref{\tt PIO\_FPGUARD}{PIOFPGUARD}}$\}$.
Here is an example using {\htmlref{\tt pioInit}{pioInit}} to start parallel I/O
in {\htmlref{\tt PIO\_NONE}{PIONONE}} mode.
\begin{lstlisting}[language=Fortran, backgroundcolor=\color{zebg},
basicstyle=\footnotesize, label=control]
INCLUDE 'mpif.h'
INTEGER commModel, error
CALL MPI_INIT ( error )
! Initialize asynchronous I/O with CDI
! Definition stage for CDI resources
commModel = pioInit ( MPI_COMM_WORLD, 1, PIO_NONE, 1, (/ 0 /))
CALL MODELRUN ( commModel )
! End cleanup stage for CDI resources
! Finalize asynchronous I/O with CDI
CALL pioFinalize ()
\section{Finalize parallel I/O: {\tt pioFinalize}}
The function {\tt pioFinalize} finalizes the parallel I/O. It cleans up the
{\htmlref{namespace}{namespace}}s and sends a message to the collector processes
to close down the I/O server. The buffers and windows which where needed for
{\tt MPI} RMA are deallocated. At last {\tt pioFinalize} frees the {\tt MPI}
communicators ( see figure~\ref{communicators} )
and destroys the control object. The call {\tt pioFinalize} is collective for all
model processes having invoked {\htmlref{\tt pioInit}{pioInit}}. If the
model employs the {\CDI} serially, a
call to {\tt pioFinalize} has no effect.
SUBROUTINE pioFinalize ();