Tabular Data with Pandas

Pandas is the most popular Python tool for data analysis. It handles tabular data and has tons of nice features for manipulating text, grouping data by attributes, formatting dates, the list goes on… Even if you do not use tabular or csv data, you will likely end up using Pandas tools for some of their nice data-wrangling features. In general, if you have data that could be organized in an Excel spreadsheet or a table, you should use Pandas. This is particularly useful for data downloaded from places like NOAA and USGS where observations from individual meaurement stations are usually concatenated into a table of some kind. Pandas can be used to open data in CSV, HTML, JSON, XLS, Parquet, SQL, GBQ formats and more.

Earthquake Data Tutorial

For this tutorial we have grabbed the last 30 days of global earthquake activity from USGS (downloaded August 11th, 2025). The data only includes earthquakes of magnitude greater than or equal to 2.5.

[1]:
import pandas as pd #import the pandas library

# open our csv file, this assumes it is in the top level of your VSCode workspace
df = pd.read_csv('global_30day_earthquakes.csv')

When we open a dataset, Pandas reads the data into an object called a ‘DataFrame’ which looks like a table.

[2]:
# inspect our dataframe
df
[2]:
time latitude longitude depth mag magType nst gap dmin rms ... updated place type horizontalError depthError magError magNst status locationSource magSource
0 2025-08-12T00:30:58.784Z 62.008000 -145.533400 9.800 2.60 ml NaN NaN NaN 0.8700 ... 2025-08-12T00:37:39.629Z 7 km SW of Copperville, Alaska earthquake NaN 0.300 NaN NaN automatic ak ak
1 2025-08-12T00:30:53.834Z 61.740900 -146.097200 42.400 2.60 ml NaN NaN NaN 0.5700 ... 2025-08-12T00:35:35.397Z 38 km S of Tolsona, Alaska earthquake NaN 3.200 NaN NaN automatic ak ak
2 2025-08-11T23:45:41.896Z 51.193700 160.859900 10.000 4.50 mb 25.0 135.00 2.2830 0.4400 ... 2025-08-12T00:10:36.040Z 256 km SE of Vilyuchinsk, Russia earthquake 7.30 1.883 0.069 61.0 reviewed us us
3 2025-08-11T21:24:31.503Z 41.242100 -116.714900 6.300 2.90 ml 14.0 144.57 0.3420 0.1627 ... 2025-08-11T22:14:22.638Z 60 km NE of Valmy, Nevada earthquake NaN 3.300 0.260 12.0 reviewed nn nn
4 2025-08-11T21:02:42.456Z 51.651500 159.778700 35.000 5.00 mb 61.0 106.00 1.5390 0.6000 ... 2025-08-12T00:35:51.040Z 170 km SSE of Vilyuchinsk, Russia earthquake 6.76 1.859 0.043 175.0 reviewed us us
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2658 2025-07-13T03:02:17.889Z -22.593300 -68.014400 153.524 4.40 mb 29.0 89.00 0.3880 0.8700 ... 2025-08-02T22:30:27.040Z 40 km NNE of San Pedro de Atacama, Chile earthquake 7.46 5.557 0.138 15.0 reviewed us us
2659 2025-07-13T02:32:49.570Z 32.791833 -115.829333 4.810 2.62 ml 67.0 28.00 0.1032 0.2600 ... 2025-08-02T22:22:55.040Z 17 km ENE of Ocotillo, CA earthquake 0.20 0.640 0.121 131.0 reviewed ci ci
2660 2025-07-13T00:49:42.340Z -9.885400 116.631100 54.034 4.10 mb 17.0 104.00 1.5440 0.4500 ... 2025-07-25T02:12:15.040Z 128 km S of Taliwang, Indonesia earthquake 8.08 5.198 0.174 9.0 reviewed us us
2661 2025-07-13T00:30:59.994Z 50.336900 -175.400000 10.000 2.80 ml 19.0 235.00 1.6530 0.7300 ... 2025-07-22T21:41:02.040Z 191 km SSE of Adak, Alaska earthquake 3.57 2.019 0.105 12.0 reviewed us us
2662 2025-07-13T00:11:12.440Z -49.427300 -8.066900 10.000 4.50 mb 14.0 106.00 22.4470 0.6600 ... 2025-07-25T01:58:47.040Z southern Mid-Atlantic Ridge earthquake 8.85 1.934 0.145 14.0 reviewed us us

2663 rows × 22 columns

[3]:
# instead of looking at the whole table, lets see an overview of the information we have
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2663 entries, 0 to 2662
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   time             2663 non-null   object
 1   latitude         2663 non-null   float64
 2   longitude        2663 non-null   float64
 3   depth            2663 non-null   float64
 4   mag              2663 non-null   float64
 5   magType          2663 non-null   object
 6   nst              2498 non-null   float64
 7   gap              2496 non-null   float64
 8   dmin             2496 non-null   float64
 9   rms              2662 non-null   float64
 10  net              2663 non-null   object
 11  id               2663 non-null   object
 12  updated          2663 non-null   object
 13  place            2663 non-null   object
 14  type             2663 non-null   object
 15  horizontalError  2481 non-null   float64
 16  depthError       2662 non-null   float64
 17  magError         2470 non-null   float64
 18  magNst           2478 non-null   float64
 19  status           2663 non-null   object
 20  locationSource   2663 non-null   object
 21  magSource        2663 non-null   object
dtypes: float64(12), object(10)
memory usage: 457.8+ KB

We can see that the column ‘time’ has been recognized as a generic object, instead of a date. Let’s fix that by telling Pandas which column to interpret as a date type. The list above tells us that the date is the 0th column.

[4]:
df = pd.read_csv('global_30day_earthquakes.csv', parse_dates=[0])

When we run df.info() again, we can see that time is now a datetime64 type. We can also see that the datetime units are nanoseconds, and that it is UTC time. One of the best features of Pandas is the ease with which it can handle and interpret dates.

[5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2663 entries, 0 to 2662
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   time             2663 non-null   datetime64[ns, UTC]
 1   latitude         2663 non-null   float64
 2   longitude        2663 non-null   float64
 3   depth            2663 non-null   float64
 4   mag              2663 non-null   float64
 5   magType          2663 non-null   object
 6   nst              2498 non-null   float64
 7   gap              2496 non-null   float64
 8   dmin             2496 non-null   float64
 9   rms              2662 non-null   float64
 10  net              2663 non-null   object
 11  id               2663 non-null   object
 12  updated          2663 non-null   object
 13  place            2663 non-null   object
 14  type             2663 non-null   object
 15  horizontalError  2481 non-null   float64
 16  depthError       2662 non-null   float64
 17  magError         2470 non-null   float64
 18  magNst           2478 non-null   float64
 19  status           2663 non-null   object
 20  locationSource   2663 non-null   object
 21  magSource        2663 non-null   object
dtypes: datetime64[ns, UTC](1), float64(12), object(9)
memory usage: 457.8+ KB

We can inspect the first few entries with df.head()

[6]:
df.head()
[6]:
time latitude longitude depth mag magType nst gap dmin rms ... updated place type horizontalError depthError magError magNst status locationSource magSource
0 2025-08-12 00:30:58.784000+00:00 62.0080 -145.5334 9.8 2.6 ml NaN NaN NaN 0.8700 ... 2025-08-12T00:37:39.629Z 7 km SW of Copperville, Alaska earthquake NaN 0.300 NaN NaN automatic ak ak
1 2025-08-12 00:30:53.834000+00:00 61.7409 -146.0972 42.4 2.6 ml NaN NaN NaN 0.5700 ... 2025-08-12T00:35:35.397Z 38 km S of Tolsona, Alaska earthquake NaN 3.200 NaN NaN automatic ak ak
2 2025-08-11 23:45:41.896000+00:00 51.1937 160.8599 10.0 4.5 mb 25.0 135.00 2.283 0.4400 ... 2025-08-12T00:10:36.040Z 256 km SE of Vilyuchinsk, Russia earthquake 7.30 1.883 0.069 61.0 reviewed us us
3 2025-08-11 21:24:31.503000+00:00 41.2421 -116.7149 6.3 2.9 ml 14.0 144.57 0.342 0.1627 ... 2025-08-11T22:14:22.638Z 60 km NE of Valmy, Nevada earthquake NaN 3.300 0.260 12.0 reviewed nn nn
4 2025-08-11 21:02:42.456000+00:00 51.6515 159.7787 35.0 5.0 mb 61.0 106.00 1.539 0.6000 ... 2025-08-12T00:35:51.040Z 170 km SSE of Vilyuchinsk, Russia earthquake 6.76 1.859 0.043 175.0 reviewed us us

5 rows × 22 columns

We can see above that each entry gets a generic row index of 0, 1, 2,… That isn’t particularly useful for our analysis since the indices don’t mean anything physical. Let’s change that by telling Pandas which column to use as the index. In this case we will use the earthquake ID. These are unique alphanumeric strings assigned by USGS to each earthquake.

[7]:
df = df.set_index('id')
df.head()
[7]:
time latitude longitude depth mag magType nst gap dmin rms ... updated place type horizontalError depthError magError magNst status locationSource magSource
id
ak025aagkam9 2025-08-12 00:30:58.784000+00:00 62.0080 -145.5334 9.8 2.6 ml NaN NaN NaN 0.8700 ... 2025-08-12T00:37:39.629Z 7 km SW of Copperville, Alaska earthquake NaN 0.300 NaN NaN automatic ak ak
ak025aagka93 2025-08-12 00:30:53.834000+00:00 61.7409 -146.0972 42.4 2.6 ml NaN NaN NaN 0.5700 ... 2025-08-12T00:35:35.397Z 38 km S of Tolsona, Alaska earthquake NaN 3.200 NaN NaN automatic ak ak
us6000qzzz 2025-08-11 23:45:41.896000+00:00 51.1937 160.8599 10.0 4.5 mb 25.0 135.00 2.283 0.4400 ... 2025-08-12T00:10:36.040Z 256 km SE of Vilyuchinsk, Russia earthquake 7.30 1.883 0.069 61.0 reviewed us us
nn00902440 2025-08-11 21:24:31.503000+00:00 41.2421 -116.7149 6.3 2.9 ml 14.0 144.57 0.342 0.1627 ... 2025-08-11T22:14:22.638Z 60 km NE of Valmy, Nevada earthquake NaN 3.300 0.260 12.0 reviewed nn nn
us6000qzz1 2025-08-11 21:02:42.456000+00:00 51.6515 159.7787 35.0 5.0 mb 61.0 106.00 1.539 0.6000 ... 2025-08-12T00:35:51.040Z 170 km SSE of Vilyuchinsk, Russia earthquake 6.76 1.859 0.043 175.0 reviewed us us

5 rows × 21 columns

We can investigate specific columns of the first 10 entries with the following syntax:

[8]:
df[['place', 'mag','depth']].head(n=10)
[8]:
place mag depth
id
ak025aagkam9 7 km SW of Copperville, Alaska 2.6 9.800
ak025aagka93 38 km S of Tolsona, Alaska 2.6 42.400
us6000qzzz 256 km SE of Vilyuchinsk, Russia 4.5 10.000
nn00902440 60 km NE of Valmy, Nevada 2.9 6.300
us6000qzz1 170 km SSE of Vilyuchinsk, Russia 5.0 35.000
us6000qzyh 195 km SE of Vilyuchinsk, Russia 4.8 10.000
us6000qzvz 94 km NW of Copiapó, Chile 4.8 23.635
us6000qzv1 southeast of the Loyalty Islands 4.8 10.000
us6000qzux 115 km SE of Attu Station, Alaska 4.2 35.000
nn00902450 55 km NE of Valmy, Nevada 2.5 11.300

We can grab summary statistics of our entire DataFrame using df.describe().

[9]:
df.describe()

[9]:
latitude longitude depth mag nst gap dmin rms horizontalError depthError magError magNst
count 2663.000000 2663.000000 2663.000000 2663.000000 2498.000000 2496.000000 2496.000000 2662.000000 2481.000000 2662.000000 2470.000000 2478.000000
mean 35.686975 24.620066 44.237501 4.140421 51.969976 138.648778 1.958450 0.735855 7.573879 4.463947 0.093915 86.439871
std 28.067148 142.590874 83.846288 0.893920 39.714453 59.363062 2.873961 0.294644 3.474638 17.009364 0.050587 141.302986
min -63.245700 -179.983200 -2.360000 2.500000 0.000000 12.000000 0.000000 0.050000 0.080000 0.000000 0.001186 2.000000
25% 19.406700 -149.969300 10.000000 3.300000 24.000000 101.000000 0.527750 0.560000 5.440000 1.869000 0.056000 15.000000
50% 51.341200 120.729800 17.413000 4.400000 40.000000 131.000000 1.351000 0.730000 7.970000 1.961000 0.088000 30.000000
75% 52.897450 159.673300 35.000000 4.700000 68.000000 184.000000 2.246000 0.930000 9.850000 6.337750 0.121000 88.000000
max 83.270300 179.909500 638.859000 8.800000 338.000000 352.000000 33.879000 2.340000 22.620000 850.600000 0.371000 1144.000000

Looks like the largeset earthquake in the last 30 days was an 8.8 (see column mag, row max), and the average earthquake size was 4.1 with a standard deviation of 0.89. We can use features like nlargest to quickly sort through our data. If we use nlargest on mag we can find the n largest earthquakes. Let’s do 20.

[10]:
df.mag.nlargest(n=20)
[10]:
id
us6000qw60    8.8
us7000qdyl    7.4
us7000qd1y    7.3
us6000qvsf    7.0
us6000qw6q    6.9
us6000qxt4    6.8
us7000qcik    6.7
us6000qw1a    6.6
us6000quy1    6.6
us7000qdz2    6.6
us7000qdyw    6.6
us7000qdye    6.6
us6000qvpp    6.5
us6000qxsv    6.4
us6000qwiq    6.4
us6000qugg    6.3
us6000qxzw    6.2
us6000qwuh    6.2
us6000qw6s    6.2
us6000qvbw    6.2
Name: mag, dtype: float64

Now let’s look at some of the more interesting details, like where are these earthquakes happening? We can see there is a column called place.

[11]:
df['place'].head()
[11]:
id
ak025aagkam9       7 km SW of Copperville, Alaska
ak025aagka93           38 km S of Tolsona, Alaska
us6000qzzz       256 km SE of Vilyuchinsk, Russia
nn00902440              60 km NE of Valmy, Nevada
us6000qzz1      170 km SSE of Vilyuchinsk, Russia
Name: place, dtype: object

It looks like the place descriptor has lots of extra text. We want something more concise so that we can categorize the earthquakes by larger areas (e.g., Alaska). Let’s create a new column, called loc where we take just the last phrase in the place descriptor.

The next step is a little nuanced, because you have to use a regex (regular expression) pattern to find what we are looking for. Regex is a powerful tool for parsing text and matching strings of various formats. We won’t teach it here, because ChatGPT is very good at generating regex patterns to find exactly what you need. The pattern used here will isolate the text after the comma in the place descriptor. If there isn’t a comma, then we will take the whole string.

[12]:
# We'll try to get the last comma-separated part of the `place`, or the whole string if there is no comma
df['loc'] = df['place'].str.extract(r'(?:,\s*)([^,]+)$')

# If there are entires without a comma (like "Fiji region"), fill them with the original value
df['loc'] = df['loc'].fillna(df['place'])

# compare what we had in 'place' with what we have in 'loc'
print(df[['place', 'loc']].head())
                                          place     loc
id
ak025aagkam9     7 km SW of Copperville, Alaska  Alaska
ak025aagka93         38 km S of Tolsona, Alaska  Alaska
us6000qzzz     256 km SE of Vilyuchinsk, Russia  Russia
nn00902440            60 km NE of Valmy, Nevada  Nevada
us6000qzz1    170 km SSE of Vilyuchinsk, Russia  Russia

There could be many earthquakes in the same country, and none in another. Let’s look at a list of the unique values in the loc column.

[13]:
df['loc'].unique()
[13]:
array(['Alaska', 'Russia', 'Nevada', 'Chile',
       'southeast of the Loyalty Islands', 'Mexico', 'Turkey',
       'New Zealand', 'Vanuatu', 'Peru', 'CA', 'Tonga',
       'south of the Kermadec Islands', 'Philippines', 'Indonesia',
       'Timor Leste', 'Colombia', 'Sea of Okhotsk', 'Guam', 'Guatemala',
       'Washington', 'south of the Fiji Islands', 'Wyoming',
       'Dominican Republic', 'Taiwan', 'Canada', 'Puerto Rico',
       'western Indian-Antarctic Ridge', 'central Mid-Atlantic Ridge',
       'Burma (Myanmar)', 'Arizona', 'Macquarie Island region',
       'Azerbaijan', 'Indiana', 'Ethiopia', 'Chagos Archipelago region',
       'Argentina', 'Japan', 'Papua New Guinea', 'Tanzania',
       'Japan region', 'Uzbekistan', 'Oregon', 'Hawaii', 'Afghanistan',
       'Fiji', 'Tajikistan', 'California', 'U.S. Virgin Islands', 'Texas',
       'Greece', 'Solomon Islands', 'Reykjanes Ridge', 'West Chile Rise',
       'southern Mid-Atlantic Ridge', 'New Mexico', 'Banda Sea',
       'New Jersey', 'Kansas', 'Kyrgyzstan', 'Saint Eustatius and Saba ',
       'east of the Kuril Islands', 'Iran', 'Bouvet Island region', 'MX',
       'Haiti', 'Fiji region', 'Nicaragua', 'El Salvador', 'Brazil',
       'Pacific-Antarctic Ridge', 'China', 'Pakistan', 'Portugal',
       'Venezuela', 'Costa Rica', 'Arkansas', 'Gulf of Alaska',
       'northern Mid-Atlantic Ridge', 'South Sandwich Islands region',
       'Montana', 'Saudi Arabia', 'Russia Earthquake', 'India',
       'Mariana Islands region', 'Mauritius - Reunion region', 'Idaho',
       'Burundi', 'Ecuador', 'Djibouti', 'Iceland', 'India region',
       'Northern Mariana Islands', 'east of the South Sandwich Islands',
       'Kermadec Islands region', 'Bolivia', 'west of Macquarie Island',
       'Antigua and Barbuda', 'New Caledonia',
       'Azores-Cape St. Vincent Ridge', 'Wallis and Futuna', 'Oklahoma',
       'Scotia Sea', 'Kazakhstan', 'Yemen', 'Madagascar',
       'Easter Island region', 'Falkland Islands region',
       'Balleny Islands region', 'southeast of Easter Island',
       'North Pacific Ocean', 'north of Ascension Island',
       'Iceland region', 'Anguilla', 'Malaysia', 'New Zealand region',
       'northern East Pacific Rise', 'off the coast of Oregon',
       'Owen Fracture Zone region', 'Carlsberg Ridge',
       'east of the North Island of New Zealand', 'south of Tonga',
       'Alaska Earthquake', 'Kuril Islands',
       'near the north coast of Greenland', 'Mauritius', 'Panama',
       'central East Pacific Rise', 'Spain'], dtype=object)

Maybe we are only interested in larger earthquakes that are more likely to be felt. We can create a new DataFrame that only contains earthquakes larger than 4.0.

[14]:
df_mag4 = df[df['mag'] >= 4.0]

Now we can start plotting. Pandas and matplotlib work well together to naturally generate plots that have nicely labeled axes. You will have to define the details of the plots, but Pandas makes it very easy to subset/select data by different columns, date ranges, etc.

Plot Top 5 Most Earthquakes by Location

[15]:
import matplotlib.pyplot as plt

# count how many earthquakes per country
counts = df_mag4['loc'].value_counts()

# take the top 5
top5 = counts.head(n=5)

fig, ax = plt.subplots(figsize=(8,5))
top5.plot(kind='bar',color='skyblue')
ax.set_title('Top 5 locations with most earthquakes')
ax.set_xlabel('Location')
ax.set_ylabel('Number of Earthquakes')
ax.tick_params(axis="x", rotation=45)

plt.tight_layout()
../../_images/in_workshop_python_earth_science_pandas_31_0.png

Plot a Histogram of Earthquake Magnitudes

[16]:
fig, ax = plt.subplots(figsize=(8,5))
ax.hist(df_mag4['mag'],bins=20,color='orange',edgecolor='black')
ax.set_title('Distribution of earthquakes >= 4.0')
ax.set_xlabel('Magnitude')
ax.set_ylabel('Count')
ax.grid(axis='y',linestyle='--')
../../_images/in_workshop_python_earth_science_pandas_33_0.png

Plot a Histogram From Both Datasets with Log Axis

[17]:
fig, ax = plt.subplots(1,2,figsize=(10,5))

ax[0].hist(df['mag'],bins=20,color='skyblue',edgecolor='black')
ax[0].set_title('Distribution of earthquakes')
ax[0].set_xlabel('Magnitude')
ax[0].set_ylabel('Count')
ax[0].grid(axis='y',linestyle='--')
ax[0].set_yscale('log')

ax[1].hist(df_mag4['mag'],bins=20,color='orange',edgecolor='black')
ax[1].set_title('Distribution of earthquakes >= 4.0')
ax[1].set_xlabel('Magnitude')
ax[1].set_ylabel('Count')
ax[1].grid(axis='y',linestyle='--')
ax[1].set_yscale('log')

plt.tight_layout()
../../_images/in_workshop_python_earth_science_pandas_35_0.png

Scatter Plot of Earthquake Locations

[18]:
fig, ax = plt.subplots(figsize=(8,5))

ax.scatter(df['longitude'],df['latitude'],s=10, alpha=0.6,c='red',edgecolors='black')
ax.set_xlabel('longitude')
ax.set_ylabel('latitude')
ax.grid(True)
../../_images/in_workshop_python_earth_science_pandas_37_0.png

Add a Map to Scatter with Cartopy

[19]:
import cartopy.crs as ccrs
import cartopy.feature as cfeature

# Create a GeoAxes with PlateCarree projection (standard lat/lon)
fig = plt.figure(figsize=(12,10))
ax = plt.axes(projection=ccrs.PlateCarree())

# Add coastlines and land features
ax.coastlines(resolution='110m')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.OCEAN, facecolor='lightblue')

# Scatter earthquake points
ax.scatter(df['longitude'], df['latitude'], s=10, c='red', alpha=0.6,
           edgecolors='k', linewidth=0.3, transform=ccrs.PlateCarree())

# Add gridlines with labels
gl = ax.gridlines(draw_labels=True, linewidth=0.5, color='gray', alpha=0.5, linestyle='--')
gl.top_labels = False
gl.right_labels = False

ax.set_title('Earthquake Locations with Coastlines')

plt.show()
../../_images/in_workshop_python_earth_science_pandas_39_0.png

Exercises

  1. Update the scatter plot, such that the color of each dot reflects the magnitude of the earthquake.

  2. Same idea as option (1), but make the size of each dot reflect the magnitude of the earthquake.

  3. Create a new table that only contains earthquakes from Russia. Investigate the distribution of earthquake magnitudes in Russia in the last 30 days. Plot just the relevant region, and include some coloring or sizing of the scatter to highlight the biggest earthquakes.

Content for this tutorial was inspired by following an assignment from the Earth and Environmental Science Databook. The assignment can be found here.