Workflows for Earth Science¶

Workflows for Earth and Natural sciences

Collaborating with Git/GitHub (workflows and best practices)
Open Source Software best practices
Using your Python workflow on a remote server
Popular AI Agents for software and their use cases

Collaborating with Git¶

First we will talk about the technical aspects of collaboration with Git including common workflows, updating your branch when the main branch has changed, resolving conflicts and making pull requests. Then we will move to more conceptual aspects of collaboration with Git such as managing your work in ways that facilitate fewer conflicts, opening issues, and reviewing other people’s code. These tools and workflows would apply equally to private or public repos. Finally we will move on to a more general section about Open Source Software.

Team Workflows¶

The most important thing about team workflows is to have one! All collaborators should be aware of the expected workflow. You can even use Rules in your repositories to enforce the desired workflow (discussed below).

The most common and straightforward team flow is the Centralized Workflow (also referred to as the Trunk-Based workflow):

Everyone works in one repository. This repository can be owned by a teammember or by an organization.
There will be one primary, stable version of the code that exists on the main branch (the main branch is sometimes called the trunk).
In order to make changes to the code, collaborators will:
- Create a new branch off of main
- Develop and test their changes on the branch
- Open a Pull Request to merge their changes into main
Feature branches exist for the time it takes to complete the task, and eventually are merged back into main. This is in contrast to other (less popular) workflows where multiple primary branches exist and features that are not in main may be maintained indefinitely.
In this workflow you do not develop directly to the main branch. This keeps the primary branch stable and usable even while new code is under development and testing. It is often possible to merge your feature branch directly into main, without a Pull Request, but this is generally frowned upon in a collaborative setting.

Merging Updates into Your Branch¶

Reminder of how to create a new branch off of main:

[ ]:

git checkout main  # checkout the main branch
git pull  # make sure you are up to date with the latest
git checkout -b feature-branch-1 # create a new branch for your feature, use a descriptive name

After you add some code, you will want to commit it and push your new branch to the remote:

[ ]:

git add <relevant-changes>
git commit -m "<commit-message>"

git push -u origin feature-branch-1  # push the new branch and the changes to the remote repository

You keep working on your branch, let’s say for a week or so, and in the meantime one of your collaborators completes their work and merges it into main. How can you check if there are updates to main?

[ ]:

git fetch  # this will download information about the remote branches, without downloading the new code
git checkout main  # switch back to the main branch
git status # check the status of your local repository, this will tell you if you are behind the remote (i.e., if you need to do a git pull)

Let’s say there are new updates on the main branch. How can you update your feature branch so that your work is applied to the latest version of the main branch? The git merge command will pull changes from another branch into your yours.

[ ]:

git checkout main # if you aren't on main, you should go there
git pull  # download the latest updates
git checkout feature-branch-1  # switch back to your feature branch
git merge main  # merge the latest changes from main into your feature branch

Resolving Conflicts¶

Merging can create conflicts when the same line of code has been changed on two different branches. The output of git merge will notify you of the conflict (and the file it is in) and will not complete the merge until you reconcile the issue. You have 3 options:

Accept the incoming changes (in the example above this would be to take the code from the main branch)
Accept the local or current changes (in the example above this would be to take the code from the feature branch)
Accept a combination of both (in this case you will edit the file directly to tell Git what the reconciled version should look like)

Let’s say we want to overwrite whatever came in on main with our feature branch code. One way to do this is to use git checkout.

[ ]:

git checkout --ours <file name>  # resolve conflicts by keeping your version of the file

Or maybe we didn’t mean to touch that file, and we want to accept whatever is on the main branch:

[ ]:

git checkout --theirs <file name>  # resolve conflicts by keeping the version from the main branch

If you aren’t sure and need to inspect the code, you can do so in any text editor. VSCode has a conflict resolution tool that comes with the Git extension, this will allow you to see the two versions of the file side-by-side and make your decision. If you simply open the file, you will see the conflicting lines of code denoted by >>>>. For example:

<<<<<<< HEAD
fig, ax = plt.subplots(figsize=(10,5))
=======
fig, axes = plt.subplots(figsize(15, 8), nrows=2)
>>>>>>> main

HEAD always refers to the code you have checked out on your local computer. So in this case the incoming change is from main and the current change is from HEAD (our feature branch). You will add/delete whatever lines you would like in order to reconcile the differences in the code. Then you will save the file.

When the conflict resolution is complete you will commit the changes. If you are in the middle of a merge, then you do not need to type -m to add another message. The auto-generated commit message will already say that you are merging main into feature_branch_1:

[ ]:

git commit # finish the merge by committing the changes.
git push # push the resolved changes to the remote repository

Git Exercise 1: Adding collaborators to a repository and creating feature branches¶

Get into groups of 3 or 4
One person should make a new repository, change the default branch name to main, and add a README (we did this on Wednesday).
Add collaborators (only the owner needs to do this, but all group members can figure this out together)
- Go into the repository settings and then to the Collaborators tab.
- Add the other members of your group to the repository using their GitHub user name.
Collaborators should receive invites to join the repo (you will get an email, but you can also see them from your GitHub account).
Each collaborator should clone the repo, make their own feature branch.
Add something to your feature branch (maybe a code file or a new text file). The new file shouldn’t be empty.
Push your feature branch to the remote (follow the steps above).

Opening Pull Requests¶

The appropriate way to get code from your feature branch into main is through a Pull Request. This gives a clear visualization of the changes and provides an opportunity for someone else on the team to give feedback. The easiest way to open a Pull Request is on GitHub.

Insert steps to open PR here

Git Exercise 2: Opening Pull Requests and Assigning Reviewers¶

Each collaborator should open a Pull Request for their feature branch
Explore the PR’s! What kind of information do they show? What did your teammates add?
Let’s look at the PR that is listed first. Whoever opened this PR should add a reviewer (on the right). Pick anyone on your team.

Reviewing Code¶

In general, any code that goes into main should be reviewed by at least one person. Every team should decide how many reviewers a PR needs. Typically, the more complex the changes, the more reviewers you should have. You can assign reviewers to the PR and they will get notified of the request.

Reviewers should look at the changed files. As you scroll through you have the option to add comments. When you add a comment you can either select ‘Add single comment’ or ‘Start a review’. The difference is whether you want to make a single note or if you would like to collect your comments along with an overall summary in a Review. If you have more than one thing to say, the second option is preferred.

Reviews are a way to have a dialog about the changes to the code, prior to merging them into main. Once the comments are all resolved or addressed, the PR can be merged. Convention says that the person who reviewed the code pushes the merge button (i.e. you don’t merge your own code).

NOTE Squashing is almost always the preferred way to merge into main. This keeps the history of the main branch clean and easy to read. Feature branches may end up with hundreds of commits.

NOTE Unless you plan to come continue development on the feature that is being merged into main, you should delete the branch after the PR is closed.

Git Exercise 3: Reviewing Code, Merging into Main, Updating Feature Branches, Resolving Conflicts¶

Let’s go through the whole process. With your team:

The assigned reviewer should review and merge the first PR
Look at one of the other PRs… can it be merged?
Update your feature branches and repeat the PR -> merge process until all PR’s are closed
At least two people should open new feature branches … and this time they should make changes to the same file!
Merge someones branch into main, try to update the other branch, work to resolve the conflicts.
- We suggest looking at both the raw text file (see the way git denotes the conflicting lines) and trying to use the VSCode conflict resolution tool
- If you have the Git extension installed and open the conflicting file in VSCode, it will automatically suggest the conflict resolution tool.

Facilitating Easier Merges (Fewer Conflicts)¶

Lots of small commits, using an agreed upon format specification (or a linter), think about the feature development list and prioritize

Opening Issues¶

Forking a Repo¶

Forks are most useful when you want to adapt an existing project that you are not a part of for your own use.

Example use case: You study Equatorial dynamics, and there is a software project on GitHub where some scientists have already implemented the Matsuno Shallow Water equations in Python. The problem is their assumed domain is too small for your research question. Instead of rewriting the model from scratch (waste of time) or downloading their code and then uploading it as your own new repo (plagiarism), you can create a fork!

Forks allow you to make a new respository based on an existing one. The forked version is still linked to the original, meaning you can pull updates from the original at a later date (handy!). This gives credit to the original authors, allows you to benefit from updated code, and prevents you from reinventing the wheel. It also allows you to save and push your changes to GitHub without disturbing the original repo (phew!).

It is also possible to push new code in a forked repository up to the original repo, although we won’t cover that here. This might be desired if you made changes that the original owners want to incorporate into the primary version of the project. In general, branching is preferred over forking for git collaboration.

The easiest way to fork a repository is through GitHub.

Mini exercise¶

Start by navigating to a repo that you like! Pick anything. It could be a Python package that you use or a project that your friend has started.

Collaborative Git Best Practices¶

Commit often with descriptive commit messages (don’t say things like “bugfix”, “changes”, “commit”)
Update your branch on main often
Use branches to complete new features
Use Pull Requests to merge features into main
Squash merges are always preferred
Use Git issues to track your work (past, current, and future)
Add tests where possible to verify the status of your main branch
Agree upon style and formatting specifications
Use tags to denote stable points between major feature developments
Add Git Rules to your repo to enforce the desired workflow (e.g., protect branches, always squash)
Favor branching over forking for development.
If you want to make changes to a repository on which you aren’t a collaborator, fork it first so that you can save your changes.

Open Source Software Best Practices¶

Thinking About Data Storage for Open Science¶

Cloud-optimized formats for Geospatial data include: Cloud-Optimized GeoTIFF (COG), GeoParquet, Zarr, FlatGeobuf, Cloud-Optimized Point Clouds (COPC), and more

There are many possible ways to do data storage. If you are able to store your data so that data acquisition integrates with tools like Xarray and Dask, then your platform will be all the more powerful and accessible. There is no one way to do this, and it is important to consider where your users will be doing their analysis, how much data they will need at a time, and how they will query the main database to find what they want.

If possible, using a cloud-optimized format is ideal. It will make access to and use of the data more efficient, in some cases by orders of magnitude.

Zarr

What is Zarr?

Zarr and Xarray

Example project: CryoCloud

International Interactive Computing Collaboration

Migrating Your Workflow to a Remote Server¶

We will discuss to popular ways to use Python and Conda on a server:

VSCode Remote-SSH
Jupyter Notebooks with SSH Tunneling
Python scripts run in a detached terminal with screen or tmux

The VSCode Extension Remote-SSH allows you to SSH into a server through the code-editor. You will then be able to navigate your files on the server, load conda environments, and save files with the exact same interface you use on your laptop.

Install the Remote-SSH extension in VSCode
Open the control panel (cmd + shift + p)
Type Remote-SSH and select Remote-SSH: Connect to Host...
Select + Add New SSH Host
Enter the credentials for your server
Enter your password

This will log you into your server and will open a new VSCode window (which will be in the filesystem on the server). The first time you do this, you may have to wait a few minutes after logging in while VSCode installs the necessary tools on the server side. After you have done this the first time, you will be able to select the host name from the list of known hosts (instead of adding a new one) and input your password. You will no longer need to type out ssh username@ip_address.

Log into your server (ssh username@your-server-address)
Run jupyter-lab --no-browser in the folder you want to work from
jupyter lab will start, and the output in the terminal will include a line that looks like Or copy and paste one of these URLs: http://localhost:8888/lab?token=some-long-token
1. The port number (in this case 8888) is the port on the server that the jupyter lab information is piped to
2. We need to link a port on the laptop to the port on the server and then we can open the webpage like we are used to
open a new terminal and run ssh -L 8888:localhost:8888 username@your-server-address
1. This links port localhost port 8888 on our laptop to localhost port 8888 on the server. These numbers don’t have to be the same.
2. If port 8888 on our laptop is taken, we can try another number. ssh -L 8890:localhost:8888 username@your-server-address. The server port needs to match the number in the jupyter lab output.
3. Enter your password
Copy and paste the URL from the jupyter lab output into your web browser (http://localhost:8888/lab?token=some-long-token)

The tmux command allows you to open terminals in which you can start a program that won’t die even when the terminal is closed. This is useful for running on a server because a normal terminal will kill whatever process is running if you close the window, close your computer, or lose your internet connection. That is not a practical way to work if we need to run programs that take a long time.

screen is an alternative to tmux, although it is somewhat outdated.

This workflow would look something like:

Develop your Python script on a small set of data. This could be done in a Jupyter notebook or python script on the server or on your laptop.
Organize the analysis you need to do into a script or set of scripts.
Open a terminal with the appropriate tmux or screen command (see instructions below).
Activate your conda environment.
Run your Python program. Alternatively you could have a bash script that runs a whole list of Python programs.

At this point you can close the terminal and your programs will continue to run on the server.

Git repositories provide an easy way to move work between locations (and to make sure you don’t lose anything because you are switching between platforms). To use Git on a server:

Log into the server
Generate a new SSH key pair (the key pair should end up in your home directory on the server at ~/.ssh/)
Add the public key to your GitHub profile through Settings -> SSH and GPG Keys -> Add SSH key

Once you have done the above, you can use your Git account to track the work you do on the server. Your research project repo can be downloaded on both the server and on your local computer. When you make changes in one place, you can add, commit, and push them. When you return to the other location you can do a git pull to see the updated work.

An example workflow:

Starting a project from scratch: Create a repo on your local computer or in your GitHub account. Clone that repo to the server. Do your work in the repo. Before you log out, add and commit the changes. Return to your local computer and git pull to see the updated project without being logged into the computer. This is a great way to keep figures up to date. You will generate new figures, commit and push both the figures and the code that generated them, and then pull them to your local computer. This negates the need to use scp or rsync to continually move around copies of your figures and prevents confusion about which code generated which version of the figures.
Tracking an existing project: Log onto your server. Navigate into a directory where you have code for your project. Use git init to make the project into a repository (make sure to have a .gitignore file!). Push the repo to your GitHub. Proceed with your work on the server. Add and commit (and push) any changes before you log out. Return to your local computer, use the link on GitHub to clone your repository. Now you can see the work you were doing on the server!

Bash Scripting¶

We should insert a couple quick examples (what to put in a bash script). But we don’t need to get fancy. How to execuate any one specific task is a good ChatGPT question.

AI Agents and Tools for Coding¶

The world of AI tools for coding is evolving rapidly… and everything we talk about in this lesson will likely be out of date soon. This discussion is intended to provide a bit of guidance that will likely remain relevant, even as the agents themselves evolve. We will not discuss an exhaustive list of agents, and instead we will touch on a few that are popular right now (August 2025). Our goal is to shed a bit of light on how AI could augment your workflow, how they are being used to generate new code that you may interact with, and how they can also waste your time or lead you down rabbit holes.

The AI tools we will discuss today are:

ChatGPT (AI chat bot)
GitHub Co-Pilot (AI assistant in your IDE)
Cursor (AI-native IDE)
Claude Code (AI command line tool)

The differences between these tools include the size and scope of their context window in addition to their level of autonomy, interactivity, and control. The more agentic a tool is, the more it is able to reason, debug, execute code, and generate new content on its own.

The above tools are primariliy based on these Large-language models (LLMs), all of which are proprietary:

Anthropic’s Claude models (Claude Opus, Claude Sonnet, and Claude Haiku)
OpenAI’s GPT models (GPT-3.5, GPT-4, GPT4-Turbo)

All of these tools also have tiered access with subscription plans, and in general their usage is capped, even for subscribers.

GitHub public repositories
Stack Overflow
Technical documentation
Programming textbooks and tutorials
Natural language + code mix (e.g., docstrings, commit messages)
Some (like Claude and GPT-4) are multi-modal and can reason about code in text, files, and diagrams.

The links below are intended to provoke discussion, and provide a bit of caution. They do not reflect the opinions of anyone in the software workshop. It is worth thinking about when and where to use AI in your workflow, such that you make your work better not worse.

Does AI Understand Bugs?

Are developers getting less efficient?

Are AI bugs worse than normal bugs?

These models are just statistical tools. They cannot genuinely reason, and they hallucinate all of the time (i.e., they generate false information without indicating that their response is not reliable). In some cases the tool will even double down on its hallucination unless prompted in the right way.

For example, a few weeks ago I asked ChatGPT to generate an Apple Keynote file for me, not thinking about the fact that Keynote is a proprietary format (therefore it cannot do this). It generated several garbage files before I asked whether it was actually capable of completing the task. It responded by telling me that it definitely could and would do it by “handling it outside of ChatGPT”. The rest of our conversation is below:

990d42272a1546ec9e10d697a4da91d0