diff --git a/_quarto.yml b/_quarto.yml index dd0f9f57d281ede6b55cc660c51255816b6ce695..0ed7770ebba494c1c173a33c533a4229b720a7e1 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -34,7 +34,6 @@ website: - "lectures/data-structures/slides.qmd" - "lectures/complexity/slides.qmd" - "lectures/debugging-strategies/slides.qmd" - # - "lectures/good-scientific-practice/slides.qmd" - "lectures/user-experience/slides.qmd" - "lectures/testing/slides.qmd" - "lectures/git2/slides.qmd" @@ -42,6 +41,7 @@ website: - "lectures/hardware/slides.qmd" - "lectures/file-and-data-systems/slides.qmd" - "lectures/memory-hierarchies/slides.qmd" + - "lectures/good-practice/slides.qmd" # - "lectures/student-talks/slides.qmd" - section: "Exercises" contents: @@ -50,7 +50,6 @@ website: - "exercises/data_structures.qmd" - "exercises/complexity.qmd" - "exercises/debugging-strategies.qmd" - # - "exercises/good_scientific_practice.qmd" - "exercises/user-experience.qmd" - "exercises/testing.qmd" - "exercises/git2/exercise.qmd" diff --git a/lectures/good-practice/slides.qmd b/lectures/good-practice/slides.qmd new file mode 100644 index 0000000000000000000000000000000000000000..8e746474c884e7e073c7ce1a61cdb826624008fe --- /dev/null +++ b/lectures/good-practice/slides.qmd @@ -0,0 +1,453 @@ +--- +title: "Good scientific and coding practice" +author: "Bjorn Stevens and Theresa Mieslinger" +--- + +# Good Scientific Practice +*Building trust in research. And in your own work.* + +## What is it about? +Principles fomulated by the research community that define ***proper research behaviour*** with the aim to ensure a high quality, robustness and reproducibility of results (publications, data, code, software). + +:::notes +* research aims to advance knowledge +* research is carried out by many actors, collaboratively, and thus, needs some rules. +::: + +## How does the topic relate to this lecture series? +* guidlines for building software +* using own and other software/data +* communicating the usage of software/data + +## The pillars of Good Scientific Practice {.smaller} + +::: {.incremental} +* **Reliability** in ensuring the quality of research, reflected in the design, methodology, analysis, and use of resources. +* **Honesty** in developing, undertaking, reviewing, reporting, and communicating research in a transparent, fair, full, and unbiased way. +* **Respect** for colleagues, research participants, research subjects, society, ecosystems, cultural heritage, and the environment. +* **Accountability** for the research from idea to publication, for its management and organisation, for training, supervision, and mentoring, and for its wider societal impacts. +::: +*copied from [European Code of Conduct for Research Integrity](https://allea.org/wp-content/uploads/2023/06/European-Code-of-Conduct-Revised-Edition-2023.pdf)* + +## Agenda for this lecture + +* **Reliability & Reproducibility** +* (Honesty) +* **Respect & Accountability** + +:::notes +* we'll cover the topics of primary data, authorship and licenses +::: + +# Reliability & Reproducibility + +:::notes +It's hard to know that you are right, but we'd like to know as quickly as possible when we are wrong. +::: + +## What do we want to reproduce? {.special} + +::: {.fragment} +*the scientific argument* +::: + +::: {.notes} +draw a line plot at the board and ask whether we need to reproduce bit-for-bit the underlying data. +Example: you run ICON to model future climate scenario for 2050 +::: + +## What do we need to save and how? {.special} + +## Data +:::: {.columns} +::: {.column width="50%"} +**Primary data** + +* observational, experimental data +* code base / software version +* configuration +* input data: initial / boundary conditions +::: + +::: {.column width="50%"} +**Derived data** + +* previously published data +* publicly available data, e.g. most satellite data +* model output +::: +:::: + +::: {.notes} +* What is needed to reproduce the argument? +* Primary data is typically published for the first time and cannot be re-generated / measured again. +* Derived data is easy to reproduce from accessible sources. +::: + +## Data Management +* ensure that access to data is as open as possible, as closed as necessary +* data, metadata, protocols, code, software, and other research materials is saved for a reasonable and clearly stated period (typically 10 years) + +## (Meta)data and [FAIR principles](https://www.go-fair.org/fair-principles/) +* **Findable**: unique identifiers, metadata registered in a searchable resource +* **Accessible**: (meta)data retrievable via standardized communication protocol +* **Interoperable**: compatibility with other data through, e.g. [CF conventions](http://cfconventions.org/), common data formats `netCDF`, `zarr`, `csv`. +* **Reusable**: (meta)data description, attributes, data usage license + +## "FAIR is not fun, but fun is FAIR" + +:::: {.columns} +::: {.column width="50%"} +**Issues with FAIR data** + +* data availability not guaranteed +* accessibility only with credentials possible +* DOIs don't point to data, but only to landing pages +::: + +::: {.column width="50%"} +**Beyond FAIR data** + +* openly accessible +* analysis-ready cloud-optimized data formats ([Abernathey et al., 2021](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9354557)) +::: +:::: + +::: {.notes} +* “Cloud-readyâ€: datasets shall be possible to be opened directly from any other server, e.g. OPeNDAP +* cloud-optimized (zarr, TileDB, …) +::: + +## What should we document? {.special} +:::fragment +*intent and usage* +::: + +:::notes +Next to source code and data, we need documentation to reproduce an argument. +::: + +## +### Documentation + +| self-explanatory code | [Commenting Showing Intent (CSI)](https://standards.mousepawmedia.com/en/stable/csi.html) | Docstrings & Manuals +| -----------| ----------- | ----------- | +| Actual behaviour | Intent and design of code | software / API usage | +| developers / maintainers | developers / maintainers | end-developers and -users | +| **WHAT** does the code do? | **WHY** was the code written? | **HOW** to use it? | + + +:::{.notes} +* self-explanatory code: meaningful variable and function names, modular structure +* CSI: makes code language-agnostic! +* Doku: includes docstrings +::: + +## +### Commenting Showing Intent (CSI) + +:::leftalign +Bad example stating **WHAT** + + ```cpp + // set box_width to equal the floor of items and 17 +int items_per_box = floor(items/17) +``` +:::fragment +Good example stating **WHY** + +```cpp +/* Divide our items among 17 boxes. + * We'll deal with the leftovers later. */ +int items_per_box = floor(items/17) +``` +::: +::: + +:::smaller +Examples from [MousePaw Media Standards](https://standards.mousepawmedia.com/en/stable/csi.html) +::: + +::: {.notes} +* don't state the obvious +::: + +## +### Docstrings and Manuals + +:::: {.columns} + +::: {.column width="50%" .smaller} + +```python +def sum(a, axis=None, dtype=None, out=None, keepdims=np._NoValue, + initial=np._NoValue, where=np._NoValue): + """ + Sum of array elements over a given axis. + + Parameters + ---------- + a : array_like + Elements to sum. + axis : None or int or tuple of ints, optional + Axis or axes along which a sum is performed. The default, + axis=None, will sum all of the elements of the input array. If + axis is negative it counts from the last to the first axis. + + .. versionadded:: 1.7.0 + + If axis is a tuple of ints, a sum is performed on all of the axes + specified in the tuple instead of a single axis or all the axes as + before. + dtype : dtype, optional + The type of the returned array and of the accumulator in which the + elements are summed. The dtype of `a` is used by default unless `a` + has an integer dtype of less precision than the default platform + integer. In that case, if `a` is signed then the platform integer + is used while if `a` is unsigned then an unsigned integer of the + same precision as the platform integer is used. + out : ndarray, optional + Alternative output array in which to place the result. It must have + the same shape as the expected output, but the type of the output + values will be cast if necessary. + keepdims : bool, optional + If this is set to True, the axes which are reduced are left + in the result as dimensions with size one. With this option, + the result will broadcast correctly against the input array. + + If the default value is passed, then `keepdims` will not be + passed through to the `sum` method of sub-classes of + `ndarray`, however any non-default value will be. If the + sub-class' method does not implement `keepdims` any + exceptions will be raised. + initial : scalar, optional + Starting value for the sum. See `~numpy.ufunc.reduce` for details. + + .. versionadded:: 1.15.0 + + where : array_like of bool, optional + Elements to include in the sum. See `~numpy.ufunc.reduce` for details. + + .. versionadded:: 1.17.0 + + Returns + ------- + sum_along_axis : ndarray + An array with the same shape as `a`, with the specified + axis removed. If `a` is a 0-d array, or if `axis` is None, a scalar + is returned. If an output array is specified, a reference to + `out` is returned. +``` +::: + +::: {.column width="50%"} +-> docstring info is used in the [numpy.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) API + + +:::{.info .smaller} +See also Python docstring conventions [PEP-257](https://peps.python.org/pep-0257/). +::: +::: +:::: + +## Which tools shall we use? {.special} + +:::notes +We have data and documentation, so which software tools shall we use? +::: + +## Open Source / Open Development +*Open source is a decentralized software development model that encourages collaboration.* + +> [It] relies on **goal-oriented** yet loosely coordinated participants who **cooperate voluntarily** to create a product (or service) of economic value, which is made **freely available** to contributors and noncontributors alike. [Levine and Prietula, 2013](https://pubsonline.informs.org/doi/10.1287/orsc.2013.0872) + +## Trustworthy sources + +* open source products +* products that are used by a critical mass of users (*"democracy works"*) +* software products under active development and maintenance, e.g. according to Git commit history +* proper documentation + +::: {.notes} +* examples: NCL, pyicon +* you are responsible for the result! +* try to stay out of supply chain bugs +::: + +## Summary on Reliability & Reproducibility {.special} +* save the primary data needed to reproduce the argument of your scientific study +* go beyond FAIR data by creating user-friendly datasets and making them publicly available +* use trustworthy sources +* contribute to open source products whenever suitable + +# Respect & Accountability +*What shall we credit and how?* + +## Authorship + +:::fragment +> Authorship provides credit for a researcher's contributions to a study and carries accountability. [-- nature](https://www.nature.com/nature-portfolio/editorial-policies/authorship) + +> To protect the integrity of authorship, only persons who have significantly contributed to the research and paper preparation should be listed as authors. [-- ACP](https://publications.copernicus.org/for_authors/obligations_for_authors.html) +::: + +## Which contributions qualify for authorship? {.special} + +## Authorship +:::leftalign +**Substantial contributions to** + +* the conception or design of the work +* the acquisition, analysis, or interpretation of data for the work + +**Authorship implies responsibility and accountability for the published work.** +::: + +:::notes +* examples: phd-supervisor, technical assistance (setting up Python, redesigning data, data papers?), financial / administrative support, hierarchical / power positions +* approval of the published work +* Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. +* includes being accountable for data / code! +::: + +## Integrity of Authorship +* Credit for knowledge contribution should be accurately and fairly attributed. +* adding gift or honorary authorship diminishes the actual contributions of others and thus, is unfair. + +## Acknowledgements + +:::leftalign +Typically used to acknowledge + +* acquisition of funding +* data sources / providers +* software +* administrative support +* writing or coding assistance +::: +:::notes +* Institute contributions (for instance through IT or administrative services or even technical help) are to be acknowledged through the author’s affiliation. +* The provision of data, model output, or the use of a model, that has been published in a previous study does not constitute a basis for authorship. +::: + +## Does ChatGPT qualify for authorship? {.special} + +:::fragment +*No, because they cannot be responsible for the accuracy, integrity, and originality of the work.* +::: + +:::notes +ChatGPT's answer: No. [...] AI systems like ChatGPT can provide immense help in information search, organization, and preliminary drafting, they don't possess the capability to understand, interpret, or conceptualize the results of a research study or to take responsibility for it. +::: + +## Using and declaring Artificial Intelligence (AI) +:::leftalign +The usage of AI needs to be declared in the respective section: + +:::smaller +* if AI was used for writing assistance -> acknowledgment section +* if AI was used for data collection, analysis, or figure generation -> methods section +::: +:::fragment +**Authors are still responsible for any published material, also if it includes AI-assisted technology** +::: +::: + +## Intellectual Property (IP) rights +Two main legal concepts relevant for software: + +* **Copyright** gives the creator exclusive rights to use, modify, and share the work. In Germany, copyright arises automatically when the work is created. It cannot be transferred, however, the author can grant others the right to use the work. +* **Copyleft** is a licensing method that encourages the unrestricted sharing, modification, and utilisation of creative works. + +:::notes +* Any creative work, be it information, literature, art, or software could be replicated if not protected by intellectual property rights. +* copyright can be granted by public law and depends on the given country/state. In Germany, the German Copyright Act (Urheberrechtsgesetz) regulates copyright. +* because of Copyright, we need to think of licenses that regulate usage, modification and sharing of software +* other types of IP rights: patents, trademarks +::: + +## Types of Software Licenses +* **Public domain licenses** allow others to use, modify, distribute software without any restrictions +* **Permissive licenses** allow others to use, modify, and distribute the software with minimal requirements (e.g. MIT, Apache 2.0, BSD) +* **Copyleft / "Share-alike" licenses** ensure that any derivative work of a software adopts the same licensing type, typically an open-source license (e.g. GPL, LGPL) +* **Proprietary / non-reuse licenses** restricts users from accessing, modifying, and redistributing the software. + +:::info +[Further reading](https://osssoftware.org/blog/open-source-software-licenses-explained-a-beginners-overview/) +::: + +:::notes +* if you upload code to GitHub without stating a license, in principle, nobody is allowed to use it +* ICON has BSD +* GNU General Public License (GPL)-licensed code cannot be integrated with proprietary closed-source code. mostly used for Linux software +* GPL, AGPL - require sharing of modified source code; LGPL allows linking with proprietary code +* changing licenses can be a huge challenge. +* Unlike proprietary software that places restrictions on usage and distribution, open source software guarantees end users the freedom to use, modify, and share the software. +* more Info here: https://osssoftware.org/blog/open-source-software-licenses-explained-a-beginners-overview/ +::: + +## Creative Commons (CC) Licenses +*used e.g. for images, videos, documentation, presentations, and conference poster.* + +[CC licenses](https://creativecommons.org/share-your-work/cclicenses/) provide a way to grant permission to creative work that is under copyright law. + +* **CC0 Public Domain Dedication**: enables creators to give up their copyright +* **CC BY**: allow others to use, modify, distribute creative work, as long as attribution is given to the creator. +* Further possible restrictions: **SA** - Share-alike, **NC** - for noncommercial purpose only, **ND** - no derivatives allowed + +## Is AI-generated work protected by copyright? {.special} +:::fragment +*Unclear. Many countries do not put AI-generated work under copyright, but others do (e.g. China).* +::: + +## Summary on Authorship and Credit {.special} +* Authors have made substantial contributions to the published work and agree to be accountable for these. +* Respect intellectual property of others and communicate a suitable license for your own work. + +# Good Coding Practice + +## Good Coding Practice +* **clean code**: easy to understand for any reviewer, fewer lines of code -> fewer bugs +* **efficient code** + * use math if you can, else, keep the order of complexity of your code in mind and check whether it behaves as you'd expect + * use parallel processing if you can pinpoint the performance bottleneck to a task that can be split + +## Good Coding Practice +* **understandable code**: self-explanatory code, documenting intent and usage +* **trustworthy code**: testing and code review +* **traceable code changes**: version control ensures a tracable record of code changes, it serves as a backup. + +*stay up to date with coding trends and libraries* + +:::notes +* be open and continue learning: new technologies typically improve your productivity :) +::: + +# Summary {.specical} +*Good Scientific Practice ensures research integrity and the advancement of knowledge.* + +:::leftalign +**Reliability & Reproducibility** + +*ssave primary data (data, code,...) that is necessary to reproduce the scientific argument. + +**Respect & Accountability** + +* significant scontributions justify authorship, use acknowledgements for further contributions +* respect intellectual property and attribute a license to your own work +::: + +:::notes +* Use trustworthy sources +* understand your code +* communicate the license +* give credit to contributors +* respect intellectual property (IP) +::: + +# Disclaimer +*This lecture was designed with the help of the Large Language Model OpenAI GPT-4.* + +# Further Reading +* [European Code of Conduct for Research Integrity](https://allea.org/wp-content/uploads/2023/06/European-Code-of-Conduct-Revised-Edition-2023.pdf) +* [DFG Guidlines for Safeguarding Good Research Practice. Code of Conduct](https://zenodo.org/records/6472827) diff --git a/lectures/good-practice/static/numpy_api_example.png b/lectures/good-practice/static/numpy_api_example.png new file mode 100644 index 0000000000000000000000000000000000000000..cf4767be0c663223e8e258e2b354a074261c48fa Binary files /dev/null and b/lectures/good-practice/static/numpy_api_example.png differ