Open Science - Software and Data¶
Elements of an Open-Source Software (OSS) Project¶
We will give a brief overview of best practices for open-source software. These ideas are relevant for posting your own code, or for participating in an existing project.
We interact with OSS all the time!¶
Open-source software is everywhere! Many of the apps or programming languages you use are open-source software projects, meaning they are free to use and the code is available to the public. These projects are maintained by a community of developers who follow a specific workflow to update and upgrade the code for public use.
Some examples of OSS projects that you may use are: Linux, Python, GIMP, VLC, WordPress, Firefox, Blender, Thunderbird, GNU, Inkscape, mySQL, Android, TensorFlow, OpenToonz, Filezilla, PHP, R
Some open source libraries you might use are: C++ stdlib, deal.ii, LAPACK, XGBoost, BeautifulSoup, Jinja, numpy, Pandas, PyTorch, SciPy, Bootstrap, React, RubyonRails, BLAS, OpenMP, petsc
Key Elements of an OSS Project¶
Documentation
Tests
Standardization (including formatting)
Interaction (GitHub discussions and issues, Discourse)
Documentation¶
Documentation is the single most important part of a public repository. There are many existing tools that can help you document your project in a familiar, easy-to-read way. Some popular documentation tools that serve slightly different purposes are doxygen, Sphinx, and Jupyter-books.
The purpose of documentation is to describe the:
goal
function
mechanism
pathway
Good documentation encourages usage and participation in your project!
There are several levels of documentation starting going from high -> low:
Project documentation: README.md
File documentation: File headers, doxygen
Function documentation: Function description, Sphinx
Code documentation: self-describing code, comments
Usage documentation: example files, Sphinx or Jupyter-books
Here are some great examples of how to document an open-source project directly on GitHub or with a documentation “engine”:
Tests¶
Tests show that your code works! Tests can be automated with Git Actions which will run them every time you push new code to the repository. In general tests should verify 2 things:
correctness (check output types, values, and sizes)
error handling (check what happens when the code is used improperly)
Popular testing frameworks for Python code:
Standardization¶
Standardization will help to keep your code readable, useable, and maintainable. Because there are many types and styles of programming, standardization of your workflow and your formatting is absolutely necessary for collaboration.
Pick a standard formatting style for code:
Linters are tools that check whether your files meet a particular formatting style and adjusts the spaces/indents/returns if not. Similar to the way you can automatically run tests on GitHub, you can also automatically run a linter to clean up the code when it gets pushed. Linters will catch syntax errors, bad practices, and style violations. There are similar, less comprehensive tools that will at least check types or formatting. Popular fomatting tools are listed below, check out their documentation to see if they might be helpful for your project:
pylint
flake8
mypy
Black
autopep8
isort
pydocstyle
bandit
Pick a standard for adding to the code (usually called a “developer” or “contributing” guide). Examples below:
Thinking About Data Storage for Open Science¶
There are many possible ways to do data storage. If you are able to store your data such that data acquisition integrates with tools like Xarray and Dask, then your platform will be all the more powerful and accessible. There is no one way to do this, and it is important to consider where your users will be doing their analysis, how much data they will need at a time, and how they will query the main database to find what they want.
If possible, using a cloud-optimized format is ideal. It will make access to and use of the data more efficient, in some cases by orders of magnitude. Cloud-optimized formats for Geospatial data include: Cloud-Optimized GeoTIFF (COG), GeoParquet, Zarr, FlatGeobuf, Cloud-Optimized Point Clouds (COPC), and more