{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Open Science - Software and Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Elements of an Open-Source Software (OSS) Project\n", "\n", "We will give a brief overview of best practices for open-source software. These ideas are relevant for posting your own code, or for participating in an existing project." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **We interact with OSS _all_ the time!**\n", "\n", "Open-source software is everywhere! Many of the apps or programming languages you use are open-source software projects, meaning they are free to use and the code is available to the public. These projects are maintained by a community of developers who follow a specific workflow to update and upgrade the code for public use.\n", "\n", "Some examples of OSS projects that you may use are: Linux, Python, GIMP, VLC, WordPress, Firefox, Blender, Thunderbird, GNU, Inkscape, mySQL, Android, TensorFlow, OpenToonz, Filezilla, PHP, R\n", "\n", "Some open source libraries you might use are: C++ stdlib, deal.ii, LAPACK, XGBoost, BeautifulSoup, Jinja, numpy, Pandas, PyTorch, SciPy, Bootstrap, React, RubyonRails, BLAS, OpenMP, petsc\n", "\n", "### **Key Elements of an OSS Project**\n", "\n", "- Documentation\n", "- Tests\n", "- Standardization (including formatting)\n", "- Interaction (GitHub discussions and issues, Discourse)\n", "\n", "### **Documentation**\n", "\n", "Documentation is the single most important part of a public repository. There are many existing tools that can help you document your project in a familiar, easy-to-read way. Some popular documentation tools that serve slightly different purposes are doxygen, Sphinx, and Jupyter-books.\n", "\n", "The purpose of documentation is to describe the:\n", "- goal\n", "- function\n", "- mechanism\n", "- pathway\n", "\n", "_Good documentation encourages usage and participation in your project!_\n", "\n", "There are several levels of documentation starting going from high -> low: \n", "1) Project documentation: README.md\n", "2) File documentation: File headers, doxygen\n", "3) Function documentation: Function description, Sphinx\n", "4) Code documentation: self-describing code, comments\n", "5) Usage documentation: example files, Sphinx or Jupyter-books\n", "\n", "Here are some great examples of how to document an open-source project directly on GitHub or with a documentation \"engine\":\n", "- [TensorFlow](https://github.com/tensorflow/tensorflow)\n", "- [MOLE](https://github.com/csrc-sdsu/mole)\n", "- [Petsc](https://gitlab.com/petsc/petsc)\n", "- [matplotlib.pyplot](https://matplotlib.org/stable/tutorials/pyplot.html)\n", "\n", "### **Tests**\n", "\n", "Tests show that your code works! Tests can be automated with Git Actions which will run them every time you push new code to the repository. In general tests should verify 2 things:\n", "- correctness (check output types, values, and sizes)\n", "- error handling (check what happens when the code is used improperly)\n", "\n", "Popular testing frameworks for Python code:\n", "- [pytest](https://docs.pytest.org/en/stable/)\n", "- [unittest](https://docs.python.org/3/library/unittest.html)\n", "\n", "_Golden Rule: if there is a function, there is a test!_ \n", "_Silver Rule: do your best_\n", "\n", "### **Standardization**\n", "\n", "Standardization will help to keep your code readable, useable, and maintainable. Because there are many types and styles of programming, standardization of your workflow and your formatting is absolutely necessary for collaboration.\n", "\n", "Pick a standard formatting style for code:\n", "- [PEP 8 - the official style guide for Python](https://peps.python.org/pep-0008/)\n", "- [Google style guide for Python](https://google.github.io/styleguide/)\n", "- [Kernighan and Ritchie style for C](https://www.cas.mcmaster.ca/~carette/SE3M04/2004/slides/CCodingStyle.html)\n", "- [PEP 257 - Docstring conventions](https://peps.python.org/pep-0257/)\n", "\n", "*Linters* are tools that check whether your files meet a particular formatting style and adjusts the spaces/indents/returns if not. Similar to the way you can automatically run tests on GitHub, you can also automatically run a linter to clean up the code when it gets pushed. Linters will catch syntax errors, bad practices, and style violations. There are similar, less comprehensive tools that will at least check types or formatting. Popular fomatting tools are listed below, check out their documentation to see if they might be helpful for your project:\n", "- pylint\n", "- flake8\n", "- mypy \n", "- Black\n", "- autopep8\n", "- isort\n", "- pydocstyle\n", "- bandit\n", "\n", "Pick a standard for adding to the code (usually called a \"developer\" or \"contributing\" guide). Examples below:\n", "- [MOLE/contributing.md](https://github.com/csrc-sdsu/mole/blob/main/CONTRIBUTING.md)\n", "- [matplotlib/contributing](https://matplotlib.org/devdocs/devel/contribute.html)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thinking About Data Storage for Open Science" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "There are many possible ways to do data storage. If you are able to store your data such that data acquisition integrates with tools like Xarray and Dask, then\n", "your platform will be all the more powerful and accessible. There is no one way to do this, and it is important to consider where your users will be doing their analysis, \n", "how much data they will need at a time, and how they will query the main database to find what they want. \n", "\n", "If possible, using a cloud-optimized format is ideal. It will make access to and use of the data more efficient, in some cases by orders of magnitude. Cloud-optimized formats for Geospatial data include: Cloud-Optimized GeoTIFF (COG), GeoParquet, [Zarr](https://zarr.dev/), FlatGeobuf, Cloud-Optimized Point Clouds (COPC), and more" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[What is Zarr?](https://earthmover.io/blog/what-is-zarr)\n", "\n", "[Zarr and Xarray](https://tutorial.xarray.dev/intermediate/intro-to-zarr.html) \n", "\n", "[Example project: CryoCloud](https://book.cryointhecloud.com/) \n", "\n", "[International Interactive Computing Collaboration](https://2i2c.org/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 2 }