data science Archives - Ger Inberg

FeatureExtraction on CRAN

Ger — Tue, 03 Sep 2024 12:40:01 +0000

Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the FeatureExtraction package on CRAN offers a powerful and flexible toolset for automating and streamlining this process.

Originally developed as part of the OHDSI (Observational Health Data Sciences and Informatics) ecosystem, FeatureExtraction is particularly well-suited for working with large-scale observational data. In this post, we’ll explore the package in detail, focusing on its core features, practical applications, and a step-by-step example to help you get started.

Key Features and Capabilities

1. Automated Feature Generation

FeatureExtraction excels at automatically generating a wide range of features from raw data. These features include basic demographic variables like age and gender, as well as more complex attributes derived from longitudinal data, such as the frequency of medical visits or the presence of certain conditions over time.

2. Temporal Features

Temporal data, such as patient histories or time-dependent events, are common in many fields, especially healthcare. FeatureExtraction handles temporal data adeptly, allowing users to define time windows relative to key events (e.g., diagnosis dates). This feature is crucial for creating time-sensitive covariates that capture trends and patterns in data over specified periods.

3. Custom Feature Extraction

While the package offers extensive automated capabilities, it also allows for custom feature extraction. Users can define custom covariates and specify how these should be generated from the underlying data, incorporating domain-specific knowledge into the feature engineering process.

4. Scalability

Feature engineering can become computationally intensive, particularly with large datasets. FeatureExtraction is designed for scalability, leveraging parallel processing and optimized algorithms to ensure that feature extraction remains efficient even with big data.

5. Integration with OHDSI Tools

As part of the OHDSI ecosystem, FeatureExtraction integrates seamlessly with other tools like PatientLevelPrediction and CohortMethod, enabling a smooth workflow from data extraction to model building and analysis.

Getting started

Installing the FeatureExtraction package is straightforward. You can install it directly from CRAN using the following command:

install.packages("FeatureExtraction")

Load the package in your R session:

library(FeatureExtraction)

Practical Example: Creating Covariates Based on Other Cohorts

To illustrate how FeatureExtraction can be applied, let’s walk through an example where we create covariates based on the presence of patients in other cohorts. This is particularly useful in studies where the relationship between different conditions or treatments over time is of interest.

Step 1: Setting Up the Database Connection

First, we need to define the connection to our CDM-compliant database:

connectionDetails <- createConnectionDetails(dbms = "postgresql",
server = "your_server",
user = "your_username",
password = "your_password")
cdmDatabaseSchema <- "your_cdm_schema"
cohortDatabaseSchema <- "your_cohort_schema"

Step 2: Define the Cohorts of Interest

Assume we have a cohort of patients with diabetes and another cohort with a history of cardiovascular disease. We want to create a feature that indicates whether a patient in the diabetes cohort has a prior history of cardiovascular disease.

# Define cohort IDs (these would be predefined in your database)
diabetesCohortId <- 1
cvdCohortId <- 2

Step 3: Create the Feature Extraction Settings

Next, we define the feature extraction settings, specifying that we want to create covariates based on the presence of patients in the cardiovascular disease cohort:

covariateSettings <- createCohortBasedCovariateSettings(useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
cohortId = cvdCohortId,
startDay = -365,
endDay = 0)

In this example, the startDay and endDay parameters define a time window of one year prior to the cohort’s index date. This means the feature will reflect whether a patient was in the cardiovascular disease cohort within one year before the index date.

Step 4: Extract the Features

Now, we extract the features for the diabetes cohort using the settings we defined:

covariateData <- getDbCovariateData(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = "cohort",
cohortId = diabetesCohortId,
covariateSettings = covariateSettings)

This function retrieves the covariate data for the specified cohort, based on the feature extraction settings we provided.

Step 5: Use the Extracted Features

The extracted features are now available in the covariateData object, which can be used for further analysis, such as model building or cohort characterization.

# Explore the covariate data
summary(covariateData)

This simple example demonstrates how FeatureExtraction can be used to create meaningful features based on different cohorts. The package’s flexibility and scalability make it a powerful tool for a wide range of applications, from small-scale studies to large observational databases.

The post FeatureExtraction on CRAN appeared first on Ger Inberg.

AI Redaction Application

Ger — Tue, 15 Dec 2020 14:32:00 +0000

What is redaction?

redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the government. Categories of redacted items are phone numbers, e-mail addresses, bank account numbers, dates and names. It takes quite some time to manually redact documents, but fortunately AI can help to speed up this process. Natural Language Processing (NLP) is a subfield of AI that studies how to analyze and process a piece of natural text. This technology allows us to extract the keywords from the text.

Slimmer AI develops AI software products that support industries, solve real-world challenges and takes professionals into the future. They have developed an API that allows the redaction of PDF files. This API returns the redacted document based on your redaction action (e.g. all phone numbers). I have collaborated with Slimmer AI on building the interface for their new redaction application.

Redaction Application

The developed application has the following features:

search for keyword(s) in the text, this can be a regular expression
AI search: search for items in a category like phone numbers
select a piece of text in the document
redact the results from the actions above
display the redacted PDF

Below you see a screenshot of the application. The left sidebar is the search column where the keyword and AI search can be performed. At the bottom of this sidebar, the results of the search are shown. When a user clicks on a result, it is selected for redaction.

The center of the application contains the document. This is the section where the text selection is performed. Once a piece of text is selected a popup appears that asks if the selected text should be redacted or not.

The right column contains the items that have been selected for redaction. When the user pushes the ‘Redact All’ button, the document is processed on the backend and the middle section will show the redacted version of the document.

The application uses the PDF.JS library for basic functionality like rendering the PDF and selecting some text. It is a free and open source library. There are some commercial libraries that offer more functionality, but they were unrequired. The rest of the technology stack for the application includes Javascript, JQuery, Bootstrap4 and HTML/CSS.

Improvements

The application was meant as a Proof of Concept to see if we could create a user-friendly wrapper for the API. Since the current functionality is working well, the application is being further developed. One thing on the improvements list is the option for a rectangle select. So next to redacting a piece of text on a line, like we can do now, this allows the user to redact any rectangular area in the document.

The post AI Redaction Application appeared first on Ger Inberg.

Remote data scientists

Ger — Mon, 18 Dec 2017 05:02:25 +0000

Remote data scientists is a group of people working in the field of data science which do their work (totally) remotely.

I recently started this group since it’s a good way to connect with other data scientists that are in a similar situation. Most of us travel quite a bit and face the same challenges like: how to find remote work? Where is a good place to work which has also a data community? But it can also be used to discuss technical challenges and keep up-to-data with cool projects that other people are doing.

Currently, most of the people in the group are staying in Chiang Mai, Thailand. We had our first physical meetup there and we will organize more. I am currently planning to give a workshop about data visualization using R shiny.

If you are also working remotely as a data scientist, please join us on facebook!

The post Remote data scientists appeared first on Ger Inberg.

Live collaboration in Shiny apps

Ger — Sun, 27 Aug 2017 08:14:48 +0000

Since a couple of years I am using the shiny package in R for interactive data visualization. It started as a tool for exploratory analysis but it’s getting more popular and it has more use cases now. For example I have helped a client with building a production dashboard to monitor industry devices in (near) real time. Also, I notice there is a growing need of editing data and collaboration between multiple users.

So when I watched some videos of the UseR conference, the google docs like collaboration with shiny video draw my attention.

The developers of the shiny collections package make use of a reactive database (RethinkDB). In short, this is a database that you don’t have to poll for updates but can notify your application in case of changes. This makes this paradigm ideal for realtime apps. Shiny uses the reactive programming model, so using this type of database extends the reactivity from the GUI until your database.

I have build a small application in which you can see the reactivity in action. To make it work on your localhost, please make sure the rethink database is running. Next, run the application in 2 tabs. When you update a row in the datatable, it immediately appears in the other tab as well.

The post Live collaboration in Shiny apps appeared first on Ger Inberg.

Deploying shiny apps on aws using docker

Ger — Wed, 12 Apr 2017 02:05:08 +0000

Recently, I decided to migrate my shiny apps to amazon webservices. Most of them were running on the shinyapps.io platform. Why did I decide to migrate my apps?

In the free version of shinyapps it is only possible to deploy 5 apps which is not enough for me. The cheapest option then is to have the starter subscription which costs 9 dollar per month. It’s not a lot of money, but I think it can be cheaper to use and AWS instance (though it requires more work)
I would like to be flexible and not having to use the shinyapps configuration. For example I want to be able to change the configuration of the shiny server.
I would like some more experience with Docker. Docker is an which is a container system, that allows for “Build, Ship and Run Any App, Anywhere”.

Docker – R Shiny container

To get started with docker you will need an image. An image is an immutable blueprint of an operation environment, while a container is an instance of an image. For programmers, you can compare it with classes and objects where images are the classes and containers the objects.

There are a lot of images already available on docker hub, so you might want to check them out first. For R and shiny server there is also quite some choice. I first used the Rocker/Shiny image to try to deploy my apps. I found out that most of my applications were not starting unfortunately. So, let’s look at the log file..well it’s not there! It turns out that if the R process exits successfully, the log files will be removed by default (to not waste disk space). I turned this setting off, so I could find out what was going on. The problem why my apps were not starting was that some packages were missing. So I installed them, most of them were related to plotting (plotly, shinyjs, leaflet, etc).

Because of these 2 issues, I decided to create a custom DockerFile. It is based on Rocker/shiny but it will install the packages I need by default and it will save the log files.

FROM rocker/shiny:latest 

MAINTAINER Ger Inberg "*****@****.com"

# install ssl
RUN sudo apt-get update; exit 0
RUN sudo apt-get install -y libssl-dev

# install additional packages
RUN R -e "install.packages(c('ggplot2', 'plotly', 'shinyjs', 'shinyBS', 'leaflet', 'ggmap', 'webshot', 'DT', 'shinydashboard'), repos='https://cran.rstudio.com/')"

# copy shiny-server config file
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf

CMD ["/usr/bin/shiny-server.sh"]

On the docker site, you can find how to create an image from a DockerFile.

Shiny Server configuration

Shiny server comes with a default configuration, which includes the default port number (3838), location of the log file directory, etc. I have only added the lines in blue.

# Instruct Shiny Server to run applications as the user "shiny"
run_as shiny;
# preserve logs!
preserve_logs true;

# Define a server that listens on port 3838
server {
 listen 3838;

 # Define a location at the base URL
 location / {

 # Host the directory of Shiny Apps stored in this directory
 site_dir /srv/shiny-server;

 # Log all Shiny output to files in this directory
 log_dir /var/log/shiny-server;

 # When a user visits the base URL rather than a particular application,
 # an index of the applications available in this directory will be shown.
 directory_index on;
 }
}

Run docker instance on AWS

I assume you have already a running AWS instance Docker is installed. Furthermore I assume your shiny apps are already installed on the AWS instance. From there on it is easy to get up and running.

Below are the commands I have executed to use my own docker image to host my shiny apps.

# Get the image
docker pull ginberg/shiny

# run the docker image such that shiny server is running on port 80
# furthermore use the -v option to map the application directory and log file directory from aws instance fs to docker fs
docker run --rm -p 80:3838 -v /home/ubuntu/shiny/apps:/srv/shiny-server/ -v /home/ubuntu/shiny/apps/logs:/var/log/shiny-server/ ginberg/shiny & 

# view log files directory
ls -al /home/ubuntu/shiny/apps/logs

# to login on the docker image: view running processes and use the id
docker ps
docker exec -it  bash

Result

My deployed apps can be found here

I am wondering how and on which platform are you hosting your shiny apps? Please let me know if you have any questions about this article.

The post Deploying shiny apps on aws using docker appeared first on Ger Inberg.

I am a top rated data science freelancer!

Ger — Sun, 12 Mar 2017 11:36:35 +0000

Recently, I got the ‘top rated’ status on Upwork. Upwork is a global freelancing platform where business and freelancers can meet and work remotely on projects. You might question what is this top rated status is all about?

Advantages

Well, this status has a couple of advantages for freelancers, such as

A badge on your Upwork freelancer profile
Personalized tips to strengthen your profile
Exclusive invitations to submit proposals
Private access to the Top Rated Community forum
An exclusive job-digest email to make it easier for you to find attractive opportunities

The most important requirement to get the status is to keep a job success score of at least 90% for a couple of months. This score is, as you might expect, calculated by the amount of projects that have been successfully finished divided by the total amount of projects.

After, I shared my update I got questions from some people on how I got this done or “how I got any project at all?”. Yes, I know there is quite of lot of competition from all over the world and it’s not easy.

Tips

However, I can give you my tips to get started on Upwork:

Have a complete profile and set your availability.
Respond to job invitations and it’s best to do it within 24 h. Upwork keep statistics about these which you can view in “My Stats”.
Only do projects, you are (pretty) sure you can do. It might be tempting to submit a proposal to a project that looks really cool, especially if you are low on budget. But how sure are you that you can really do it? Even if you can do it, but the project will take at lot longer than the client requested (i.e. because you have to update your knowledge) it might be better not to do it, since the client will not be totally satisfied and the review won’t be great.
Be modest in your salary requirements for your first projects. For me it was not so easy to get my first projects, since I didn’t have any reviews that they could use as a reference. I realized I needed some good reviews, so I applied to a lot of projects with a lower rate than my ideal rate, to increase the chance that a client would offer the project to me. It think it worked, since I got my first projects and after that I became easier to get other projects as well.
Communicate with your client regularly. Some clients wants to communicate daily, others weekly, so this depends. The same applies for the type of communication, some via Upwork and others via email or Skype. You can just ask him/her about this so both of you are sure about this.

Hope, you will find this useful. Please let me know your experiences or questions!

The post I am a top rated data science freelancer! appeared first on Ger Inberg.

Reinforcement learning in smartcabs

Ger — Thu, 22 Dec 2016 04:13:45 +0000

Reinforcement learning is an area in Machine Learning which is quite different than supervised or unsupervised learning. This is because it is not about building a model based upon a dataset with given features and label(s). It is about software agents that take actions in a certain environment to maximize a reward. It has applications in game theory, operations research, genetic algorithms. It is also used in something that is changing the way we will transport ourselves in the future..self driving cars!

As part of my Machine Learning study, I have used reinforcement learning to create an agent for a self driving car. Please see my github.

The post Reinforcement learning in smartcabs appeared first on Ger Inberg.

What is apache spark / guest writer

Ger — Thu, 10 Nov 2016 04:55:15 +0000

A while ago, I was asked to write some data science articles. One of these, an introduction to Apache Spark, can be found on simplylearn. Enjoy!

The post What is apache spark / guest writer appeared first on Ger Inberg.

GridSearchCV with Apache Spark

Ger — Sat, 29 Oct 2016 06:17:18 +0000

This article continues where I left with Classification for machine learning

Apache Spark

Apache Spark is a very popular framework in big data processing. The main reason for this: it’s fast! It can be used to parallelize your task on a cluster so it will be completed earlier than if you would execute it serially.

It also can be used on a single computer, which has the advantage that you can use all in the cores in your computer. The first solution I have written for the classification was using the sklearn package of python. Sklearn also provides functionality to do multicore processing on a single machine via joblib, but since my client wanted to use explicitly Spark I have used that.

I have been looking into how to migrate the sklearn code to Spark ML I found out there are some initiatives already to run a sklearn solution on Spark. Because the most expensive part of the code is to find the hyperparameters with GridSearchCV, it’s important to parallelize this functionality. Databricks, the company behind the founder of Spark, has developed an integration package for sklearn on Spark. Unfortunately, it didn’t work with my code. It was caused by the fact that I used a custom cross validator, StratifiedShuffleSplit, and I need this in order to keep balanced sample classes. I only had to make a slight modification to the code and published this on my github.

The python script can be submitted to Spark with the spark-submit command, since Spark 2.0 the pyspark command is not supported anymore to execute scripts. Spark-submit takes the python script as argument as well as some optional arguments. In the example I submit it to my local computer and specify it should use 8 cores.

spark-submit --master local[8] build_model_spark.py

Before my modifications, it took my laptop about 14 minutes to build the model on the whole dataset. With Spark this was reduced to less than 4 minutes, which is a pretty good improvement! My client was happy with the result and gave me a good review, so hope this results in more ML projects!

The post GridSearchCV with Apache Spark appeared first on Ger Inberg.

Classification with machine learning

Ger — Sat, 29 Oct 2016 05:07:47 +0000

Classification with machine learning: that was the title of one of the last projects I have done via Upwork. In this post I will tell what the project was about and what I have done.

Click–through rate (CTR) is the ratio of users who click on a specific link to the number of total users who view the page that contains the link. For a marketing company, a high CTR normally means an effective campaign, since many people click on the link ans thus visit the website of customer X. This project involved predicting the probability if a user would click on a given advertisement or not.

There were some requirements from the technical side: python in combination with jupyter notebook. For the machine learning part, there were 2 options. The first being sklearn, the second (preferred) being Spark ML. Furthermore a stacking of Gradient Boosted Trees (GBT) with Logistic Regression (LR) had to be used in combination with GridSearchCV to find the hyperparameters, a modelling approach that was effective for Facebook. A quite challenging job from this perspective, since I haven’t been using stacking of algorithms that much.

Data / Modelling

The client provided about 2 GB of data about historical campaigns. This included features like country, browser, campaign-id and also if a user had clicked on the link or not. So, a supervised machine learning challenge since the result feature (clicked or not) is in the dataset.

I have first build the model in sklearn. There were some challenges here, because the data was very unbalanced (a lot more data for users that did not click). One challenge with such a dataset is that you have to choose a right metric to optimize for. Let’s say you would have a dataset in which 1% of the entries has a positive result and 99% a negative result. A simple model that is always predicting a negative result, would be right in 99% of the cases, not bad! In this situation, the accuracy which is defined by the number of right predictions divided by the total number of predictions, would not be a good choice. It would be better to use either precision or recall

In the FB paper, the metric Normalized Entrophy (NE) is used. It is defined by the average log loss per impression divided by what the average log loss per impression would be if a model predicted the background click through rate (CTR) for every impression. The lower this score, the better the predictions are that the model has created. One other thing to be careful about is that GridSearchCV in sklearn always tries to maximize the given metric, so for a metric like NE, the negative score should be provided, to obtain the best result!

Comparison of Models

I have compared LR with GBT and the stacked solution of GBT+LR. The stacked solution proved to be just a little better than GBT on itself. Below the ROC curves for those models.

After the model was finalized, I have been looking into Apache Spark and how to tweak the performance. In my next article I will tell something about that.

The post Classification with machine learning appeared first on Ger Inberg.