Filling Missing Data in Time Series Using Python and Pandas
Handling missing data in a time series is a common challenge when working with datasets, especially in domains like finance or IoT, where data might be collected at irregular intervals. In this post, I’ll walk through how to use Python and Pandas to load time series data, resample it, and fill in the missing gaps.
What is Missing Data in a Time Series?
Time series data is data collected at specific intervals. Sometimes, due to various factors, some data points might be missing. For example, stock market prices might not be available at every second, or a sensor could temporarily go offline. To perform further analysis, it’s often necessary to fill in these gaps in a reasonable way—either by forward-filling previous values, interpolating between points, or using other methods.
Let’s look at a step-by-step guide to loading, manipulating, and filling missing data in a time series using Pandas.
Loading Time Series Data from SQL
To start, we need to load time series data from a SQL database. In this example, we’ll use the sqlalchemy
package to create a connection to a MySQL database and load the data into a Pandas DataFrame.
Here’s how we do it:
import pandas as pd
import sqlalchemy
# Create an engine to connect to your database
engine = sqlalchemy.create_engine('mysql+pymysql://user:password@hostname/database')
# Load the data from SQL for all indices
query = "SELECT dt, tm, name, avg FROM ad_averages ORDER BY dt, tm"
df = pd.read_sql(query, engine)
Here, we’re fetching columns:
dt
(date),tm
(time),name
(the index or identifier),avg
(the average value of interest).
Combining Date and Time
Since date and time are in separate columns, the next step is to combine them into a single datetime
column. Pandas makes it easy to do this with pd.to_datetime
:
# Combine dt and tm into a single datetime column
df['dt_tm'] = pd.to_datetime(df['dt'] + ' ' + df['tm'])
# Set dt_tm as the index, keeping 'name' as a column
df.set_index('dt_tm', inplace=True)
At this point, our DataFrame has the dt_tm
column (our datetime index), with name
and avg
as regular columns.
Optimizing Data Types
To save memory, especially when dealing with large datasets, it’s good practice to optimize the data types of our DataFrame:
# Ensure that the data types are optimized to save memory
df['name'] = df['name'].astype('category')
df['avg'] = pd.to_numeric(df['avg'], downcast='float')
Here, we convert the name
column to a categorical data type and downcast avg
to a float, reducing the memory footprint.
Resampling and Filling Missing Data
Once our data is loaded and cleaned, we can handle the missing entries. One effective method for filling missing data in time series is to use forward-filling. First, we need to resample the data at regular intervals and then fill the gaps using ffill
:
# Group by 'name', resample, and forward fill within each group
df_filled = (
df.groupby('name', observed=False)
.apply(lambda group: group.resample('5min').asfreq().ffill())
.reset_index(level=0, drop=True)
.reset_index()
)
In this code:
- We group the data by
name
(since each index is handled separately). - For each group, we use
resample('5min')
to resample the data at 5-minute intervals. asfreq()
ensures that we’re keeping only the new regular time intervals, introducing NaN values where there is no data.- We then forward-fill (
ffill
) to fill the missing data points based on the most recent value.
A note about observed=False
The observed=False
parameter in the groupby()
method in Pandas is used when working with categorical data.
In particular, observed=False
affects how Pandas handles groupings for categorical columns that have unused categories. Let’s break it down:
- Categorical Data: In Pandas, when you have a column with categorical data, it can contain a predefined set of categories, even if some of those categories don’t appear in the actual data. For example, if a
name
column is defined as a categorical type, it could have categories like['A', 'B', 'C']
, but in the actual data, onlyA
andB
might appear. - Groupby and Observed: When grouping by a categorical column:
- If
observed=True
(the default), Pandas will exclude any categories that don’t have any data (i.e., categories that are not observed in the DataFrame). - If
observed=False
, Pandas will include all the categories, even if they don’t appear in the actual data. In other words, it will keep those “unobserved” categories in the resulting DataFrame or group.
- If
You would use
observed=False
if you need to ensure that all categories are represented in the result, regardless of whether they appear in the data or not. This can be useful for reporting or analysis where you want to track missing categories explicitly.
Dropping Rows with Initial Missing Data
After resampling and filling the data, you might still have NaN values if there was missing data at the very beginning of the series. These can be dropped:
# Optional: Drop any rows where 'avg' is still NaN
df_filled = df_filled.dropna(subset=['avg'])
Updating the Database with the Filled Data
Once we’ve filled the missing data, we can save the results back to the database:
# Update another table with this filled data
df_filled.to_sql('ad_averages_filled', engine, if_exists='replace', index=False)
Here, we’re saving the DataFrame to a new table ad_averages_filled
in the database. The if_exists='replace'
ensures that any existing table with that name is replaced by the new data.
Conclusion
Handling missing data in time series is a crucial task when working with real-world datasets. Using Pandas, we can easily load, manipulate, and fill missing data with methods like forward-filling. Resampling and grouping by categories (like name
) ensure that each time series is handled separately and filled accordingly.
This approach gives us a clean, continuous dataset that can be used for further analysis or machine learning models.
Here’s a quick summary of what we covered:
- Load time series data from a SQL database.
- Combine date and time into a single datetime column.
- Optimize data types to save memory.
- Resample the time series at regular intervals and forward-fill missing data.
- Save the filled data back into the database.
By following these steps, you can handle missing data effectively and make your time series data ready for further processing.