Filling Missing Data in Time Series Using Python and Pandas

Posted on September 17, 2024 by Peter Andersson

Handling missing data in a time series is a common challenge when working with datasets, especially in domains like finance or IoT, where data might be collected at irregular intervals. In this post, I’ll walk through how to use Python and Pandas to load time series data, resample it, and fill in the missing gaps.

What is Missing Data in a Time Series?

Time series data is data collected at specific intervals. Sometimes, due to various factors, some data points might be missing. For example, stock market prices might not be available at every second, or a sensor could temporarily go offline. To perform further analysis, it’s often necessary to fill in these gaps in a reasonable way—either by forward-filling previous values, interpolating between points, or using other methods.

Let’s look at a step-by-step guide to loading, manipulating, and filling missing data in a time series using Pandas.

Loading Time Series Data from SQL

To start, we need to load time series data from a SQL database. In this example, we’ll use the sqlalchemy package to create a connection to a MySQL database and load the data into a Pandas DataFrame.

Here’s how we do it:

import pandas as pd
import sqlalchemy

# Create an engine to connect to your database
engine = sqlalchemy.create_engine('mysql+pymysql://user:password@hostname/database')

# Load the data from SQL for all indices
query = "SELECT dt, tm, name, avg FROM ad_averages ORDER BY dt, tm"
df = pd.read_sql(query, engine)

Here, we’re fetching columns:

dt (date),
tm (time),
name (the index or identifier),
avg (the average value of interest).

Combining Date and Time

Since date and time are in separate columns, the next step is to combine them into a single datetime column. Pandas makes it easy to do this with pd.to_datetime:

# Combine dt and tm into a single datetime column
df['dt_tm'] = pd.to_datetime(df['dt'] + ' ' + df['tm'])

# Set dt_tm as the index, keeping 'name' as a column
df.set_index('dt_tm', inplace=True)

At this point, our DataFrame has the dt_tm column (our datetime index), with name and avg as regular columns.

Optimizing Data Types

To save memory, especially when dealing with large datasets, it’s good practice to optimize the data types of our DataFrame:

# Ensure that the data types are optimized to save memory
df['name'] = df['name'].astype('category')
df['avg'] = pd.to_numeric(df['avg'], downcast='float')

Here, we convert the name column to a categorical data type and downcast avg to a float, reducing the memory footprint.

Resampling and Filling Missing Data

Once our data is loaded and cleaned, we can handle the missing entries. One effective method for filling missing data in time series is to use forward-filling. First, we need to resample the data at regular intervals and then fill the gaps using ffill:

# Group by 'name', resample, and forward fill within each group
df_filled = (
    df.groupby('name', observed=False)
    .apply(lambda group: group.resample('5min').asfreq().ffill())
    .reset_index(level=0, drop=True)
    .reset_index()
)

In this code:

We group the data by name (since each index is handled separately).
For each group, we use resample('5min') to resample the data at 5-minute intervals.
asfreq() ensures that we’re keeping only the new regular time intervals, introducing NaN values where there is no data.
We then forward-fill (ffill) to fill the missing data points based on the most recent value.

A note about observed=False

The observed=False parameter in the groupby() method in Pandas is used when working with categorical data.

In particular, observed=False affects how Pandas handles groupings for categorical columns that have unused categories. Let’s break it down:

Categorical Data: In Pandas, when you have a column with categorical data, it can contain a predefined set of categories, even if some of those categories don’t appear in the actual data. For example, if a name column is defined as a categorical type, it could have categories like ['A', 'B', 'C'], but in the actual data, only A and B might appear.
Groupby and Observed: When grouping by a categorical column:
- If observed=True (the default), Pandas will exclude any categories that don’t have any data (i.e., categories that are not observed in the DataFrame).
- If observed=False, Pandas will include all the categories, even if they don’t appear in the actual data. In other words, it will keep those “unobserved” categories in the resulting DataFrame or group.

You would use observed=False if you need to ensure that all categories are represented in the result, regardless of whether they appear in the data or not. This can be useful for reporting or analysis where you want to track missing categories explicitly.

Dropping Rows with Initial Missing Data

After resampling and filling the data, you might still have NaN values if there was missing data at the very beginning of the series. These can be dropped:

# Optional: Drop any rows where 'avg' is still NaN
df_filled = df_filled.dropna(subset=['avg'])

Updating the Database with the Filled Data

Once we’ve filled the missing data, we can save the results back to the database:

# Update another table with this filled data
df_filled.to_sql('ad_averages_filled', engine, if_exists='replace', index=False)

Here, we’re saving the DataFrame to a new table ad_averages_filled in the database. The if_exists='replace' ensures that any existing table with that name is replaced by the new data.

Conclusion

Handling missing data in time series is a crucial task when working with real-world datasets. Using Pandas, we can easily load, manipulate, and fill missing data with methods like forward-filling. Resampling and grouping by categories (like name) ensure that each time series is handled separately and filled accordingly.

This approach gives us a clean, continuous dataset that can be used for further analysis or machine learning models.

Here’s a quick summary of what we covered:

Load time series data from a SQL database.
Combine date and time into a single datetime column.
Optimize data types to save memory.
Resample the time series at regular intervals and forward-fill missing data.
Save the filled data back into the database.

By following these steps, you can handle missing data effectively and make your time series data ready for further processing.

Development, Python

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30