Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the FeatureExtraction package on CRAN offers a powerful and flexible toolset for automating and streamlining this process.
Originally developed as part of the OHDSI (Observational Health Data Sciences and Informatics) ecosystem, FeatureExtraction is particularly well-suited for working with large-scale observational data. In this post, we’ll explore the package in detail, focusing on its core features, practical applications, and a step-by-step example to help you get started.
Key Features and Capabilities
1. Automated Feature Generation
FeatureExtraction excels at automatically generating a wide range of features from raw data. These features include basic demographic variables like age and gender, as well as more complex attributes derived from longitudinal data, such as the frequency of medical visits or the presence of certain conditions over time.
2. Temporal Features
Temporal data, such as patient histories or time-dependent events, are common in many fields, especially healthcare. FeatureExtraction handles temporal data adeptly, allowing users to define time windows relative to key events (e.g., diagnosis dates). This feature is crucial for creating time-sensitive covariates that capture trends and patterns in data over specified periods.
3. Custom Feature Extraction
While the package offers extensive automated capabilities, it also allows for custom feature extraction. Users can define custom covariates and specify how these should be generated from the underlying data, incorporating domain-specific knowledge into the feature engineering process.
4. Scalability
Feature engineering can become computationally intensive, particularly with large datasets. FeatureExtraction is designed for scalability, leveraging parallel processing and optimized algorithms to ensure that feature extraction remains efficient even with big data.
5. Integration with OHDSI Tools
As part of the OHDSI ecosystem, FeatureExtraction integrates seamlessly with other tools like PatientLevelPrediction and CohortMethod, enabling a smooth workflow from data extraction to model building and analysis.
Getting started
Installing the FeatureExtraction package is straightforward. You can install it directly from CRAN using the following command:
install.packages("FeatureExtraction")
library(FeatureExtraction)
Practical Example: Creating Covariates Based on Other Cohorts
To illustrate how FeatureExtraction can be applied, let’s walk through an example where we create covariates based on the presence of patients in other cohorts. This is particularly useful in studies where the relationship between different conditions or treatments over time is of interest.
Step 1: Setting Up the Database Connection
First, we need to define the connection to our CDM-compliant database:
connectionDetails <- createConnectionDetails(dbms = "postgresql",
server = "your_server",
user = "your_username",
password = "your_password")
cdmDatabaseSchema <- "your_cdm_schema"
cohortDatabaseSchema <- "your_cohort_schema"
Step 2: Define the Cohorts of Interest
Assume we have a cohort of patients with diabetes and another cohort with a history of cardiovascular disease. We want to create a feature that indicates whether a patient in the diabetes cohort has a prior history of cardiovascular disease.
# Define cohort IDs (these would be predefined in your database)
diabetesCohortId <- 1
cvdCohortId <- 2
Step 3: Create the Feature Extraction Settings
Next, we define the feature extraction settings, specifying that we want to create covariates based on the presence of patients in the cardiovascular disease cohort:
covariateSettings <- createCohortBasedCovariateSettings(useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
cohortId = cvdCohortId,
startDay = -365,
endDay = 0)
In this example, the startDay
and endDay
parameters define a time window of one year prior to the cohort’s index date. This means the feature will reflect whether a patient was in the cardiovascular disease cohort within one year before the index date.
Step 4: Extract the Features
Now, we extract the features for the diabetes cohort using the settings we defined:
covariateData <- getDbCovariateData(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = "cohort",
cohortId = diabetesCohortId,
covariateSettings = covariateSettings)
This function retrieves the covariate data for the specified cohort, based on the feature extraction settings we provided.
Step 5: Use the Extracted Features
The extracted features are now available in the covariateData
object, which can be used for further analysis, such as model building or cohort characterization.
# Explore the covariate data
summary(covariateData)
This simple example demonstrates how FeatureExtraction can be used to create meaningful features based on different cohorts. The package’s flexibility and scalability make it a powerful tool for a wide range of applications, from small-scale studies to large observational databases.