11 Feature Engineering
When working with business data, we often have a database that stores transactions. We will have tables that store credit card transactions, tables that store paycheck transactions, key card tables that store build access transactions, and account tables that store logins or purchase transactions. These transaction tables are usually not at the right level of detail (often called ‘unit of analysis’) for any machine learning or statistical model inference.
To build our predictive models, these transaction tables need to be munged into the correct space, time, and variable aggregation. Here are the questions we must work through based on our modeling needs.
- What is my spatial unit of interest for predictions? How can I munge my tables to create the foundation of the analysis table?
- What is my label/target that I will use for prediction?
- What features would help predict my label?
- What is my temporal unit of interest for feature building?
- What tables/keys do I need to map my transactions to that spatial unit?
How would we answer the above questions if we wanted to predict the next US county where a temple for the Church of Jesus Christ of Latter-day Saints would be built? First, what tables do we have?
- patterns: monthly traffic patterns for each place.
- places: Key information about each place.
- censustract_pkmap: The mapping of places keys to census tract based on placekeys in places.
- tract_table: The table that maps tract keys to state and county.
- idaho_temples.csv: A table of temples with a key that maps to tract.
11.1 Spatial Unit of interest derivation
The tract_table
has the county data within it. We can leverage this table to get our unit of analysis table.
11.2 Label creation
The idaho_temples.csv
has our path to a label. We want to create a 0/1 label for has not/has a temple for each county.
11.3 Feature Creation
We want a clean feature that focuses on chapels for the Church of Jesus Christ of Latter-day Saints. How can we get a list of placekeys
that are LDS chapels? Some regular expressions on the name will be helpful. What about day of the week attendance?
What can we do to check the quality of the data.
11.4 Temporal Considerations
This data runs over COVID-19 as we have 2019 through 2022.