<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ger Inberg</title>
	<atom:link href="https://gerinberg.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://gerinberg.com/</link>
	<description>data science developer</description>
	<lastBuildDate>Tue, 03 Sep 2024 12:41:39 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.1</generator>

<image>
	<url>https://gerinberg.com/wp-content/uploads/2017/05/favicon-150x150.jpg</url>
	<title>Ger Inberg</title>
	<link>https://gerinberg.com/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>FeatureExtraction on CRAN</title>
		<link>https://gerinberg.com/2024/09/03/featureextraction-on-cran/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Tue, 03 Sep 2024 12:40:01 +0000</pubDate>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[software engineering]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1861</guid>

					<description><![CDATA[<p>Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2024/09/03/featureextraction-on-cran/">FeatureExtraction on CRAN</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Feature engineering is a crucial step in the data science process, often making the difference between a good model and a great one. It involves transforming raw data into meaningful features that can improve the performance of predictive models. For those working in R, the <strong>FeatureExtraction</strong> package on <a href="https://cran.r-project.org/web/packages/FeatureExtraction/index.html" target="_blank" rel="noopener">CRAN</a> offers a powerful and flexible toolset for automating and streamlining this process.</p>
<p>Originally developed as part of the OHDSI (Observational Health Data Sciences and Informatics) ecosystem, <strong>FeatureExtraction</strong> is particularly well-suited for working with large-scale observational data. In this post, we&#8217;ll explore the package in detail, focusing on its core features, practical applications, and a step-by-step example to help you get started.</p>
<h3>Key Features and Capabilities</h3>
<h4>1. <strong>Automated Feature Generation</strong></h4>
<p><strong>FeatureExtraction</strong> excels at automatically generating a wide range of features from raw data. These features include basic demographic variables like age and gender, as well as more complex attributes derived from longitudinal data, such as the frequency of medical visits or the presence of certain conditions over time.</p>
<h4>2. <strong>Temporal Features</strong></h4>
<p>Temporal data, such as patient histories or time-dependent events, are common in many fields, especially healthcare. <strong>FeatureExtraction</strong> handles temporal data adeptly, allowing users to define time windows relative to key events (e.g., diagnosis dates). This feature is crucial for creating time-sensitive covariates that capture trends and patterns in data over specified periods.</p>
<h4>3. <strong>Custom Feature Extraction</strong></h4>
<p>While the package offers extensive automated capabilities, it also allows for custom feature extraction. Users can define custom covariates and specify how these should be generated from the underlying data, incorporating domain-specific knowledge into the feature engineering process.</p>
<h4>4. <strong>Scalability</strong></h4>
<p>Feature engineering can become computationally intensive, particularly with large datasets. <strong>FeatureExtraction</strong> is designed for scalability, leveraging parallel processing and optimized algorithms to ensure that feature extraction remains efficient even with big data.</p>
<h4>5. <strong>Integration with OHDSI Tools</strong></h4>
<p>As part of the OHDSI ecosystem, <strong>FeatureExtraction</strong> integrates seamlessly with other tools like <strong>PatientLevelPrediction</strong> and <strong>CohortMethod</strong>, enabling a smooth workflow from data extraction to model building and analysis.</p>
<h4>Getting started</h4>
<p>Installing the <strong>FeatureExtraction</strong> package is straightforward. You can install it directly from CRAN using the following command:</p>
<div class="dark bg-gray-950 contain-inline-size rounded-md border-[0.5px] border-token-border-medium">
<div class="flex items-center relative text-token-text-secondary bg-token-main-surface-secondary px-4 py-2 text-xs font-sans justify-between rounded-t-md"><code>install.packages("FeatureExtraction")</code></div>
</div>
<div></div>
<div>Load the package in your R session:</div>
<div></div>
<div><code>library(FeatureExtraction)</code></div>
<div>
<h3>Practical Example: Creating Covariates Based on Other Cohorts</h3>
<p>To illustrate how <strong>FeatureExtraction</strong> can be applied, let&#8217;s walk through an example where we create covariates based on the presence of patients in other cohorts. This is particularly useful in studies where the relationship between different conditions or treatments over time is of interest.</p>
<h4>Step 1: Setting Up the Database Connection</h4>
<p>First, we need to define the connection to our CDM-compliant database:</p>
<p><code>connectionDetails &lt;- createConnectionDetails(dbms = "postgresql",</code><br />
<code>server = "your_server",</code><br />
<code>user = "your_username",</code><br />
<code>password = "your_password")</code><br />
<code>cdmDatabaseSchema &lt;- "your_cdm_schema"</code><br />
<code>cohortDatabaseSchema &lt;- "your_cohort_schema"</code></p>
<h4>Step 2: Define the Cohorts of Interest</h4>
<p>Assume we have a cohort of patients with diabetes and another cohort with a history of cardiovascular disease. We want to create a feature that indicates whether a patient in the diabetes cohort has a prior history of cardiovascular disease.</p>
<p><code># Define cohort IDs (these would be predefined in your database)</code><br />
<code>diabetesCohortId &lt;- 1</code><br />
<code>cvdCohortId &lt;- 2</code></p>
<h4>Step 3: Create the Feature Extraction Settings</h4>
<p>Next, we define the feature extraction settings, specifying that we want to create covariates based on the presence of patients in the cardiovascular disease cohort:</p>
<p><code>covariateSettings &lt;- createCohortBasedCovariateSettings(useDemographicsGender = TRUE,</code><br />
<code>useDemographicsAge = TRUE,</code><br />
<code>cohortId = cvdCohortId,</code><br />
<code>startDay = -365,</code><br />
<code>endDay = 0)</code></p>
<p>In this example, the <code>startDay</code> and <code>endDay</code> parameters define a time window of one year prior to the cohort&#8217;s index date. This means the feature will reflect whether a patient was in the cardiovascular disease cohort within one year before the index date.</p>
<h4>Step 4: Extract the Features</h4>
<p>Now, we extract the features for the diabetes cohort using the settings we defined:</p>
<p><code>covariateData &lt;- getDbCovariateData(connectionDetails = connectionDetails,</code><br />
<code>cdmDatabaseSchema = cdmDatabaseSchema,</code><br />
<code>cohortDatabaseSchema = cohortDatabaseSchema,</code><br />
<code>cohortTable = "cohort",</code><br />
<code>cohortId = diabetesCohortId,</code><br />
<code>covariateSettings = covariateSettings)</code></p>
<p>This function retrieves the covariate data for the specified cohort, based on the feature extraction settings we provided.</p>
<h4>Step 5: Use the Extracted Features</h4>
<p>The extracted features are now available in the <code>covariateData</code> object, which can be used for further analysis, such as model building or cohort characterization.</p>
</div>
<p><code># Explore the covariate data</code><br />
<code>summary(covariateData)</code></p>
<p>This simple example demonstrates how <strong>FeatureExtraction</strong> can be used to create meaningful features based on different cohorts. The package&#8217;s flexibility and scalability make it a powerful tool for a wide range of applications, from small-scale studies to large observational databases.</p>
<p>The post <a href="https://gerinberg.com/2024/09/03/featureextraction-on-cran/">FeatureExtraction on CRAN</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>DrugExposure Diagnostics</title>
		<link>https://gerinberg.com/2023/04/01/drugexposurediagnostics/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Sat, 01 Apr 2023 09:32:00 +0000</pubDate>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[DrugExposureDiagnostics]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1848</guid>

					<description><![CDATA[<p>DrugExposureDiagnostics: A Comprehensive R Package for Assessing Drug Exposure in Clinical Research Drug exposure is an essential aspect of clinical research, as it directly affects the efficacy and safety of drugs. Measuring drug exposure accurately and understanding the factors that influence it is crucial for [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2023/04/01/drugexposurediagnostics/">DrugExposure Diagnostics</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="flex-1 overflow-hidden">
<div class="react-scroll-to-bottom--css-louzy-79elbk h-full dark:bg-gray-800">
<div class="react-scroll-to-bottom--css-louzy-1n7m0yu">
<div class="flex flex-col items-center text-sm dark:bg-gray-800">
<div class="group w-full text-gray-800 dark:text-gray-100 border-b border-black/10 dark:border-gray-900/50 bg-gray-50 dark:bg-[#444654]">
<div class="text-base gap-4 md:gap-6 md:max-w-2xl lg:max-w-2xl xl:max-w-3xl p-4 md:py-6 flex lg:px-0 m-auto">
<div class="relative flex w-[calc(100%-50px)] flex-col gap-1 md:gap-3 lg:w-[calc(100%-115px)]">
<div class="flex flex-grow flex-col gap-3">
<div class="min-h-[20px] flex flex-col items-start gap-4 whitespace-pre-wrap">
<div class="markdown prose w-full break-words dark:prose-invert light">
<div class="flex-1 overflow-hidden">
<div class="react-scroll-to-bottom--css-louzy-79elbk h-full dark:bg-gray-800">
<div class="react-scroll-to-bottom--css-louzy-1n7m0yu">
<div class="flex flex-col items-center text-sm dark:bg-gray-800">
<div class="group w-full text-gray-800 dark:text-gray-100 border-b border-black/10 dark:border-gray-900/50 bg-gray-50 dark:bg-[#444654]">
<div class="text-base gap-4 md:gap-6 md:max-w-2xl lg:max-w-2xl xl:max-w-3xl p-4 md:py-6 flex lg:px-0 m-auto">
<div class="relative flex w-[calc(100%-50px)] flex-col gap-1 md:gap-3 lg:w-[calc(100%-115px)]">
<div class="flex flex-grow flex-col gap-3">
<div class="min-h-[20px] flex flex-col items-start gap-4 whitespace-pre-wrap">
<div class="markdown prose w-full break-words dark:prose-invert light">
<p>DrugExposureDiagnostics: A Comprehensive R Package for Assessing Drug Exposure in Clinical Research</p>
<p>Drug exposure is an essential aspect of clinical research, as it directly affects the efficacy and safety of drugs. Measuring drug exposure accurately and understanding the factors that influence it is crucial for clinical decision-making. This is where the R package DrugExposureDiagnostics comes in handy.</p>
<p>As the author of this R package, I am excited to introduce you to this powerful tool for analyzing drug exposure data. Before delving into the package, let&#8217;s first understand what drug exposure is and why it is crucial in clinical research.</p>
<p>Drug exposure refers to the extent to which a drug enters and stays in the body, thereby producing its intended therapeutic effects. Measuring drug exposure accurately involves capturing key metrics, such as drug concentrations, AUC, Cmax, and Tmax. By doing so, researchers can evaluate drug efficacy and safety and make informed decisions regarding dosing and administration.</p>
<p>One way to capture drug exposure data is through the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), developed by the Observational Health Data Sciences and Informatics (OHDSI) community. The OMOP CDM standardizes and integrates data from various sources, allowing for large-scale observational studies and analysis.</p>
<p>This is where the R package DrugExposureDiagnostics comes in. It is a comprehensive tool for analyzing drug exposure data in the OMOP CDM format. The package includes functions for calculating various exposure metrics, handling missing data, and summarizing data at different levels, such as by subject or visit. Additionally, it provides tools for identifying outliers and comparing exposure between groups.</p>
<p>DrugExposureDiagnostics has been extensively tested and validated, ensuring that it produces accurate results. The package has been released on the <a href="https://cran.r-project.org/web/packages/DrugExposureDiagnostics/index.html">Comprehensive R Archive Network</a> (CRAN), making it easily accessible to R users worldwide. To use the package, simply install it using the install.packages() function in R and load it using the library() function.</p>
<p>If you are interested in learning more about DrugExposureDiagnostics or trying it out for yourself, visit the <a href="https://github.com/darwin-eu/DrugExposureDiagnostics">package github</a></p>
</div>
</div>
</div>
<div class="flex justify-between">
<div class="text-gray-400 flex self-end lg:self-center justify-center mt-2 gap-3 md:gap-4 lg:gap-1 lg:absolute lg:top-0 lg:translate-x-full lg:right-0 lg:mt-0 lg:pl-2 visible"></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="absolute bottom-0 left-0 w-full border-t md:border-t-0 dark:border-white/20 md:border-transparent md:dark:border-transparent md:bg-vert-light-gradient bg-white dark:bg-gray-800 md:!bg-transparent dark:md:bg-vert-dark-gradient pt-2">
<form class="stretch mx-2 flex flex-row gap-3 last:mb-2 md:mx-4 md:last:mb-6 lg:mx-auto lg:max-w-3xl">
<div class="relative flex h-full flex-1 md:flex-col">
<div class="flex flex-col w-full py-2 flex-grow md:py-3 md:pl-4 relative border border-black/10 bg-white dark:border-gray-900/50 dark:text-white dark:bg-gray-700 rounded-md shadow-[0_0_10px_rgba(0,0,0,0.10)] dark:shadow-[0_0_15px_rgba(0,0,0,0.10)]"></div>
</div>
</form>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p>The post <a href="https://gerinberg.com/2023/04/01/drugexposurediagnostics/">DrugExposure Diagnostics</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Multi page shiny apps</title>
		<link>https://gerinberg.com/2022/08/11/multi-page-shiny-apps/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Thu, 11 Aug 2022 18:09:05 +0000</pubDate>
				<category><![CDATA[data viz]]></category>
		<category><![CDATA[shiny]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[donation]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[R]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1834</guid>

					<description><![CDATA[<p>Web applications can have rich functionality nowadays. For example a website of an E-commerce shop has a page about the products they are selling, a page about their conditions, a shopping cart page and an order page. Furthermore it can handle different HTTP requests from [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2022/08/11/multi-page-shiny-apps/">Multi page shiny apps</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Web applications can have rich functionality nowadays. For example a website of an E-commerce shop has a page about the products they are selling, a page about their conditions, a shopping cart page and an order page. Furthermore it can handle different HTTP requests from the user. A GET request is used to retrieve a page (e.g. products) from the server whereas a POST request is used to send information to the server (e.g. make order).</p>
<h4>Problem</h4>
<p>Shiny apps are, by default, a bit limited when looking at it from this perspective.  It can handle only GET requests by default, unless you are a technical expert in this field, see <a href="https://stackoverflow.com/questions/25297489/accept-http-request-in-r-shiny-application" target="_blank" rel="noopener">this stack overflow</a> post. Furthermore shiny apps can have just one entry-point &#8220;/&#8221;. So you can&#8217;t have another entry-point &#8220;/page2&#8221;. Thus, the e-commerce shop is not possible out of the box in R shiny.</p>
<h4>Solution</h4>
<p>There are multiple solutions to support multiple pages. The one that I am using since a while is the package <a href="https://github.com/ColinFay/brochure" target="_blank" rel="noopener">brochure </a>developed by Colin Fay. It is still in development, so you might encounter some issues but I haven&#8217;t found any major bugs yet. A brochure app consists of a series of pages that are defined by an endpoint/path, a UI and a server function. Thus each page has its own shiny session, its own UI, and its own server! This is important to keep in mind.  A separate session for each page has some advantages but also some disadvantages (e.g. how to pass user data between pages?). A very simple brochureApp looks like this:</p>
<pre class="highlight"><code><span class="n">library(shiny)
library(brochure)

brochureApp</span><span class="p">(</span>
  <span class="c1"># First page</span>
  <span class="n">page</span><span class="p">(</span>
    <span class="n">href</span> <span class="o">=</span> <span class="s2">"/"</span><span class="p">,</span>
    <span class="n">ui</span> <span class="o">=</span> <span class="n">fluidPage</span><span class="p">(</span>
      <span class="n">h1</span><span class="p">(</span><span class="s2">"My first page"</span><span class="p">),</span> 
      <span class="n">plotOutput</span><span class="p">(</span><span class="s2">"plot"</span><span class="p">)</span>
    <span class="p">),</span>
    <span class="n">server</span> <span class="o">=</span> <span class="k">function</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="n">output</span><span class="p">,</span> <span class="n">session</span><span class="p">){</span>
      <span class="n">output</span><span class="o">$</span><span class="n">plot</span> <span class="o">&lt;-</span> <span class="n">renderPlot</span><span class="p">({</span>
        <span class="n">plot</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span>
      <span class="p">})</span>
    <span class="p">}</span>
  <span class="p">),</span> 
  <span class="c1"># Second page, no server-side function</span>
  <span class="n">page</span><span class="p">(</span>
    <span class="n">href</span> <span class="o">=</span> <span class="s2">"/page2"</span><span class="p">,</span> 
    <span class="n">ui</span> <span class="o">=</span>  <span class="n">fluidPage</span><span class="p">(</span>
      <span class="n">h1</span><span class="p">(</span><span class="s2">"My second page"</span><span class="p">)</span>
    <span class="p">)</span>
  <span class="p">)</span>
<span class="p">)</span></code></pre>
<h4>Donation app</h4>
<p>Coming back to the E-commerce shop example, I have developed an app where one can sponsor me for my open source work on R packages. The app has an integration with Stripe to make a donation and a thank you and error page. When calling Stripe you have to give the two endpoints for these pages and by using brochure, I am able to setup these endpoints. See the app on <a href="https://ginberg.shinyapps.io/donate/" target="_blank" rel="noopener">shinyapps.io</a> and of course I would appreciate it, if you use the app!:-)</p>
<p>The post <a href="https://gerinberg.com/2022/08/11/multi-page-shiny-apps/">Multi page shiny apps</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Speed skating viz updated</title>
		<link>https://gerinberg.com/2021/12/30/speed-skating-viz-updated/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Thu, 30 Dec 2021 14:54:12 +0000</pubDate>
				<category><![CDATA[data viz]]></category>
		<category><![CDATA[beijing]]></category>
		<category><![CDATA[d3js]]></category>
		<category><![CDATA[olympic games]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[speedskating]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1817</guid>

					<description><![CDATA[<p>Speed skating is one of my favorite sports to practice and to watch. This winter the Winter Olympics will be held in Beijing, China.  Will the Dutch be as successful as they were in Sochi and PyeongChang?  How many medals will the Chinese win? Four [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2021/12/30/speed-skating-viz-updated/">Speed skating viz updated</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="olympic-section-wrapper">
<div id="olympic-introduction">
<p class="bold-start">Speed skating is one of my favorite sports to practice and to watch. This winter the Winter Olympics will be held in Beijing, China.  Will the Dutch be as successful as they were in Sochi and PyeongChang?  How many medals will the Chinese win?</p>
<p><a href="http://gerinberg.com/2017/11/04/speed-skating-olympic-medalists/">Four years ago</a>, I created a visualization about past medal winners at the Olympic Games. I have updated it now with the results of the games in 2018 at PyeongChang.</p>
</div>
</div>
<p>See the <a href="https://gerinberg.com/speedskating">live version</a>. Let me know if you like it or you are interested in some other charts. Enjoy the Winter Olympics!</p>
<p>The post <a href="https://gerinberg.com/2021/12/30/speed-skating-viz-updated/">Speed skating viz updated</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Catchment Area Research Dashboard</title>
		<link>https://gerinberg.com/2021/07/13/catchment-area-research-dashboard/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Tue, 13 Jul 2021 08:04:00 +0000</pubDate>
				<category><![CDATA[data viz]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1800</guid>

					<description><![CDATA[<p>In healthcare, the catchment area is the area served by a hospital or medical centre. The Rutgers Cancer Institute of New Jersey has one main goal: to help individuals fight cancer. More specific they are targetting cancer with precision medicine, immunotherapy and clinical trials next to providing advanced cancer care [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2021/07/13/catchment-area-research-dashboard/">Catchment Area Research Dashboard</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In healthcare, the catchment area is the area served by a hospital or medical centre. The <a href="https://www.cinj.org/" target="_blank" rel="noopener">Rutgers Cancer Institute of New Jersey</a> has one main goal: to help individuals fight cancer. More specific they are targetting cancer with precision medicine, immunotherapy and clinical trials next to providing advanced cancer care to adults and children.</p>
<p>In order to serve patients as best as they can, researchers need as much (quality) data that can serve their purpose.Surveillance, Tracking and Reporting through Informed Data Collection and Engagement (STRIDE) is an interactive data and visualization dashboard. It includes clinical trials enrollment, bio-specimen inventory, tumor registry analytic cases, and catchment area information.</p>
<p>They have approached me to improve the user interface of their dashboard and that&#8217;s what I have been doing! Next to this, I have been helping the person that created the initial dashboard with the following concepts.</p>
<ul>
<li>DRY (Don&#8217;t Repeat Yourself); this is basically re-using of code that you already wrote so you don&#8217;t have to write this code again and you will end up with less code to maintain.</li>
<li>Reproducible results. When I tried to run the initial dashboard, it didn&#8217;t work since some packages were not imported and it was not clear which version of those packages I should use. I have added <a href="https://rstudio.github.io/renv/articles/renv.html" target="_blank" rel="noopener">renv</a> as dependency management, this will add a file to your project containing all packages and their versions. It&#8217;s easy to setup and it&#8217;s worth it!</li>
<li>Interactive charts. The initial charts were made with the package ggplot2. This is a nice package for data visualization and it offers many charts and display options. But, it&#8217;s not interactive and that&#8217;s what most people want and expect in a dashboard.</li>
</ul>
<p>See below for some screenshots of this app.</p>
<p><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2021/10/dashboard1-1024x522-640x480.png" title="dashboard1" alt="" /></div></p>
<p>The post <a href="https://gerinberg.com/2021/07/13/catchment-area-research-dashboard/">Catchment Area Research Dashboard</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>CITO public analysis</title>
		<link>https://gerinberg.com/2021/03/12/cito-public-analysis/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Fri, 12 Mar 2021 07:24:00 +0000</pubDate>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[data viz]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1750</guid>

					<description><![CDATA[<p>CITO is an institute in the Netherlands that support governments and schools so that they can develop world-class testing and monitoring systems to complete their educational programs. They have a lot of data regarding testing scores and it could be interesting to combine this data [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2021/03/12/cito-public-analysis/">CITO public analysis</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><a href="https://www.cito.nl/" target="_blank" rel="noopener">CITO</a> is an institute in the Netherlands that support governments and schools so that they can develop world-class testing and monitoring systems to complete their educational programs. They have a lot of data regarding testing scores and it could be interesting to combine this data with public data. For example, are testing scores of children living in deprived areas worse than average?</p>
<h5>Exploratory Analysis</h5>
<p>Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics. This is often done by using data visualization methods. The main purpose of EDA is to help look at data before making any assumptions. For me it&#8217;s one of the nicest parts of the data science! Since you don&#8217;t know yet what&#8217;s in the data and there will always be surprises. It&#8217;s like you are on holiday and exploring the area that you seeing for the first time:-)</p>
<p>For example, is a certain variable in the data normally distributed or not? Is there any missing data or duplicated values? In my experience, yes in most cases, there is missing and duplicated data. We need to fix these issues before we can do the real analysis. This phase is called data cleaning, you might have heard about this before.</p>
<h5>Representativeness Analyses</h5>
<p>In general, a representative sample is a group or set chosen from a larger statistical population that adequately replicates the larger group according to whatever characteristic or quality is under study. In case of CITO, we like to know if the sample data set has more or less the same characteristics regarding scores.  For example, are the average and standard deviation of the sample data set close to the ones of the total data set. I have plotted the distributions of the 2 data sets in a single chart, in order to compare them. In the subtitle one can find the average, standard deviation and the median.</p>
<p>Below you can find some of the charts I made for both EDA and the representativeness Analyses. The code is available in a public repository on <a href="https://github.com/ginberg/cito" target="_blank" rel="noopener">github</a>. It can be run using a docker container, R and <a href="https://rstudio.github.io/renv/articles/renv.html" target="_blank" rel="noopener">renv</a> for library management.</p>
<p><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2021/06/ex_scores-1-640x480.png" title="Exploratory: scores and score per sex" alt="" /></div></p>
<p>The post <a href="https://gerinberg.com/2021/03/12/cito-public-analysis/">CITO public analysis</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>AI Redaction Application</title>
		<link>https://gerinberg.com/2020/12/15/ai-redaction-application/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Tue, 15 Dec 2020 14:32:00 +0000</pubDate>
				<category><![CDATA[data science]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[redaction]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1548</guid>

					<description><![CDATA[<p>What is redaction? redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2020/12/15/ai-redaction-application/">AI Redaction Application</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h3>What is redaction?</h3>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">redaction is the blacking out or deletion of text in a document. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. It is common within court documents and in the government. Categories of redacted items are phone numbers, e-mail addresses, bank account numbers, dates and names. It takes quite some time to manually redact documents, but fortunately AI can help to speed up this process. Natural Language Processing (<a href="https://monkeylearn.com/blog/nlp-ai/" target="_blank" rel="noopener">NLP</a>) is a subfield of AI that studies how to analyze and process a piece of natural text. This technology allows us to extract the keywords from the text.</span></p>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;"><a href="https://www.slimmer.ai/" target="_blank" rel="noopener">Slimmer AI</a> develops AI software products that support industries, solve real-world challenges and takes professionals into the future. They have developed an API that allows the redaction of PDF files. This API returns the redacted document based on your redaction action (e.g. all phone numbers). I have collaborated with Slimmer AI on building the interface for their new redaction application. </span></p>
<h3>Redaction Application</h3>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">The developed application has the following features:</span></p>
<ul>
<li><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">search for keyword(s) in the text, this can be a regular expression</span></li>
<li><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">AI search: search for items in a category like phone numbers</span></li>
<li><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">select a piece of text in the document </span></li>
<li><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">redact the results from the actions above</span></li>
<li><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">display the redacted PDF</span></li>
</ul>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">Below you see a screenshot of the application. The left sidebar is the search column where the keyword and AI search can be performed. At the bottom of this sidebar, the results of the search are shown. When a user clicks on a result, it is selected for redaction.</span></p>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">The center of the application contains the document. This is the section where the text selection is performed. Once a piece of text is selected a popup appears that asks if the selected text should be redacted or not. </span></p>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">The right column contains the items that have been selected for redaction. When the user pushes the &#8216;Redact All&#8217; button, the document is processed on the backend and the middle section will show the redacted version of the document.</span></p>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/redacto8-1-1024x486-640x480.png" title="AI Redaction Application" alt="AI Redaction Application" /></div>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">The application uses the <a href="https://mozilla.github.io/pdf.js/" target="_blank" rel="noopener">PDF.JS</a> library for basic functionality like rendering the PDF and selecting some text. It is a free and open source library. There are some commercial libraries that offer more functionality, but they were unrequired. The rest of the technology stack for the application includes Javascript, JQuery, Bootstrap4 and HTML/CSS.</span></p>
<h3>Improvements</h3>
<p><span style="color: #202124; font-family: georgia, palatino, serif; font-size: 12pt;">The application was meant as a Proof of Concept to see if we could create a user-friendly wrapper for the API. Since the current functionality is working well, the application is being further developed. One thing on the improvements list is the option for a rectangle select. So next to redacting a piece of text on a line, like we can do now, this allows the user to redact any rectangular area in the document. </span></p>
<p>The post <a href="https://gerinberg.com/2020/12/15/ai-redaction-application/">AI Redaction Application</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>My Electricity Balance</title>
		<link>https://gerinberg.com/2020/10/14/my-electricity-balance/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Wed, 14 Oct 2020 09:02:38 +0000</pubDate>
				<category><![CDATA[data viz]]></category>
		<category><![CDATA[electricity]]></category>
		<category><![CDATA[solarpanels]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1530</guid>

					<description><![CDATA[<p>Since the beginning of this year I have solar panels on the roof of my house. The electricity these are producing should be more than enough than I am currently using. I bought a bit more solar panels since I expect to have an electric [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2020/10/14/my-electricity-balance/">My Electricity Balance</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Since the beginning of this year I have solar panels on the roof of my house. The electricity these are producing should be more than enough than I am currently using. I bought a bit more solar panels since I expect to have an electric car in a couple of years.</p>
<p>My <a href="https://www.samenom.nl/" target="_blank" rel="noopener noreferrer">energy provider</a> is giving me my monthly usage and production. Based on this I have created the visual below. It states for each month the consumption (red) vs the production (blue). The electricity meter is installed at March 1 so there is no data before that date. The solar panels have been installed in May. So far this year the numbers are looking very good, since the solar panels produce the most in spring and summer. In the coming months the bars will be more red! See below for my current balance or have a look at <a href="https://gerinberg.com/energy/" target="_blank" rel="noopener noreferrer">my energy</a> for the current status.</p>
<p></p>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/usage-550x400.png" title="My Electricity Balance" alt="My Electricity Balance" /></div>

<p>The post <a href="https://gerinberg.com/2020/10/14/my-electricity-balance/">My Electricity Balance</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>eRum lightning talk speaker</title>
		<link>https://gerinberg.com/2020/06/16/erum-lightning-talk-speaker/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Tue, 16 Jun 2020 13:44:44 +0000</pubDate>
				<category><![CDATA[data viz]]></category>
		<category><![CDATA[canvasXpress]]></category>
		<category><![CDATA[erum]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1505</guid>

					<description><![CDATA[<p>This week, the European R Users Meeting (ERUM) is happening. It's a biennial conference that brings the R User Community together and this year it would be held in Milan. I am excited to give a lightning talk about "Reproducible Data Visualization with CanvasXpress"!</p>
<p>The post <a href="https://gerinberg.com/2020/06/16/erum-lightning-talk-speaker/">eRum lightning talk speaker</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<p>This week, the European R Users Meeting (eRum) is happening. It&#8217;s a biennial conference that brings the R User Community together and this year it would be held in Milan. Because of covid-19, the organizers decided to do the whole conference online. I am very happy that the conference didn&#8217;t have to be cancelled, though it&#8217;s too bad we can&#8217;t visit Milan. Furthermore I am excited that I will give a lightning talk about &#8220;Reproducible Data Visualization with CanvasXpress&#8221;!</p>
<p>Since I am working with <a href="https://canvasxpress.org/index.html" target="_blank" rel="noopener noreferrer">CanvasXpress</a> since a couple of years, I know it quite well. I wrote about it before in <a href="https://gerinberg.com/2018/07/21/canvasxpress/" target="_blank" rel="noopener noreferrer">this blog</a>. Many times I have been surprised by the amount of functionality that the library provides. Especially all the options that are available after the chart has been created. There&#8217;s a &#8216;Reproducible Research&#8217; sub-menu which has been extended lately with a very cool replay option.</p>
<h4>Replay</h4>
<p>When you make changes to a rendered plot, canvasXpress keeps a history of these changes. You can reset the chart back to it&#8217;s original state and replay the previous changes that you have made using the replay button. This button is the 2nd leftmost button in the top menu bar, see the screenshot below. It&#8217;s only available if you have made any changes to the plot.</p>
<p><a href="https://gerinberg.com/wp-content/uploads/2020/06/replay.png"><img decoding="async" class="size-medium wp-image-1510" src="https://gerinberg.com/wp-content/uploads/2020/06/replay-300x108.png" alt="" width="300" height="108" srcset="https://gerinberg.com/wp-content/uploads/2020/06/replay-300x108.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/replay-768x277.png 768w, https://gerinberg.com/wp-content/uploads/2020/06/replay-830x300.png 830w, https://gerinberg.com/wp-content/uploads/2020/06/replay-230x83.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/replay-350x126.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/replay-480x173.png 480w, https://gerinberg.com/wp-content/uploads/2020/06/replay.png 1005w" sizes="(max-width: 300px) 100vw, 300px" /></a></p>
<p>The replay creates a new window that displays all the user actions step by step. For each step, more information is available when selecting that step. In the example above, I have removed the x-axis on top in step 1. This relates to the property &#8216;xAxisShow&#8217;.</p>
<p>What if you would like to share this replay with your coworker? Well, you can download the chart in PNG format using the camera icon in the top menu. The downloaded image also contains the user actions. So if your coworker is importing your canvasXpress PNG, he/she can do the same replay as you. Pretty nice huh?</p>
<p>I will give a demonstration about the replay functionality in my presentation this Friday. Please have a look at the eRum <a href="https://2020.erum.io/program/" target="_blank" rel="noopener noreferrer">schedule</a> for the exact time and (online) location.</p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
</div></div>
<p>The post <a href="https://gerinberg.com/2020/06/16/erum-lightning-talk-speaker/">eRum lightning talk speaker</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Linear Regression with R</title>
		<link>https://gerinberg.com/2020/06/01/r-linear-regression/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Mon, 01 Jun 2020 11:32:00 +0000</pubDate>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[linear regression]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1582</guid>

					<description><![CDATA[<p>  You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R? Introduction to Linear Regression Linear regression is an algorithm [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2020/06/01/r-linear-regression/">Linear Regression with R</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<aside class="mashsb-container mashsb-main mashsb-stretched">
<div> </div>
</aside>
<p><span data-preserver-spaces="true">You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R?</span></p>
<h4 id="intro"><span data-preserver-spaces="true">Introduction to Linear Regression</span></h4>
<p><span data-preserver-spaces="true">Linear regression is an algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. The output variable, what you’re predicting, has to be continuous. The output variable can be calculated as a linear combination of the input variable(s).</span></p>
<p><span data-preserver-spaces="true">There are two types of linear regression:</span></p>
<ul>
<li><strong><span data-preserver-spaces="true">Simple linear regression</span></strong><span data-preserver-spaces="true"> – only one input variable</span></li>
<li><strong><span data-preserver-spaces="true">Multiple linear regression</span></strong><span data-preserver-spaces="true"> – multiple input variables</span></li>
</ul>
<p><span data-preserver-spaces="true">We will implement both today – simple linear regression from scratch and multiple linear regression with built-in R functions.</span></p>
<p><span data-preserver-spaces="true">You can use a linear regression model to learn which features are important by examining </span><strong><span data-preserver-spaces="true">coefficients</span></strong><span data-preserver-spaces="true">. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value. </span></p>
<p><span data-preserver-spaces="true">That’s how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added.</span></p>
<p><span data-preserver-spaces="true">There’s still one thing we should cover before diving into the code – assumptions of a linear regression model:</span></p>
<ul>
<li><strong><span data-preserver-spaces="true">Linear assumption</span></strong><span data-preserver-spaces="true"> — model assumes that the relationship between variables is linear</span></li>
<li><strong><span data-preserver-spaces="true">No noise</span></strong><span data-preserver-spaces="true"> — model assumes that the input and output variables are not noisy — so remove outliers if possible</span></li>
<li><strong><span data-preserver-spaces="true">No collinearity</span></strong><span data-preserver-spaces="true"> — model will overfit when you have highly correlated input variables</span></li>
<li><strong><span data-preserver-spaces="true">Normal distribution</span></strong><span data-preserver-spaces="true"> — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking</span></li>
<li><strong><span data-preserver-spaces="true">Rescaled inputs</span></strong><span data-preserver-spaces="true"> — use scalers or normalizer to make more reliable predictions</span></li>
</ul>
<p><span data-preserver-spaces="true">You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects. </span></p>
<h4 id="simple-lr"><span data-preserver-spaces="true">Simple Linear Regression from Scratch</span></h4>
<p><span data-preserver-spaces="true">If you have a single input variable, you’re dealing with simple linear regression. It won’t be the case most of the time, but it can’t hurt to know. A simple linear regression can be expressed as:</span><img decoding="async" class="size-medium wp-image-1645 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/formula-300x82.png" alt="Linear Regression Formula" width="300" height="82" srcset="https://gerinberg.com/wp-content/uploads/2020/06/formula-300x82.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/formula-230x63.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/formula-350x96.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/formula.png 358w" sizes="(max-width: 300px) 100vw, 300px" /><span data-preserver-spaces="true">As you can see, there are two terms you need to calculate beforehand: beta0 and beta1. </span><span data-preserver-spaces="true">You’ll first see how to calculate Beta1, as Beta0 depends on it. This is the formula:</span><img decoding="async" class="size-medium wp-image-1646 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-300x75.png" alt="Beta1" width="300" height="75" srcset="https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-300x75.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-230x57.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-350x87.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation.png 450w" sizes="(max-width: 300px) 100vw, 300px" /><span data-preserver-spaces="true">And this is the formula for Beta0:</span></p>
<p><img loading="lazy" decoding="async" class="size-medium wp-image-1647 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-300x64.png" alt="Beta0" width="300" height="64" srcset="https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-300x64.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-230x49.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-350x75.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation.png 412w" sizes="auto, (max-width: 300px) 100vw, 300px" /></p>
<p><span data-preserver-spaces="true">These x’s and y’s with the bar over them represent the mean (average) of the corresponding variables. </span></p>
<p><span data-preserver-spaces="true">Let’s see how all of this works in action. The code snippet below generates </span><strong><span data-preserver-spaces="true">X</span></strong><span data-preserver-spaces="true"> with 500 linearly spaced numbers between 1 and 500, and generates </span><strong><span data-preserver-spaces="true">Y</span></strong><span data-preserver-spaces="true"> as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the </span><span class="enlighter"><span class="enlighter-text">plotly</span></span><span data-preserver-spaces="true"> package:</span></p>
<pre>library(plotly)<br /># Generate synthetic data with a linear relationship
x &lt;- seq(from = 1, to = 500)
y &lt;- rnorm(n = 500, mean = 0.5*x + 70, sd = 30)
lr_data &lt;- data.frame(x, y)

# create the plot
plot_ly(data = lr_data, x = ~x, y = ~y,
marker = list(size = 10)) %&gt;%
layout(title = list(text = paste0('Simple linear regression', '&lt;br&gt;&lt;sup&gt;', 'Linear relation is visible', '&lt;/sup&gt;'))) %&gt;%
config(displayModeBar = F)</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/linear_regression-1-640x480.jpeg" title="linear_regression" alt="" /></div>
<p> </p>
<div id="attachment_6295" class="wp-caption aligncenter">
<p><span style="font-size: inherit;">Let&#8217;s calculate the coefficients now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a </span><span class="enlighter"><span class="enlighter-m0">lr_predict</span><span class="enlighter-g1">() </span></span><span data-preserver-spaces="true">function that implements the line equation.</span></p>
</div>
<p><span data-preserver-spaces="true">The predictions can then be obtained by applying the </span><span class="enlighter"><span class="enlighter-m0">lr_predict</span><span class="enlighter-g1">() </span></span><span data-preserver-spaces="true">function to the vector X – they should all be on a single straight line. Finally, input data and predictions are visualized</span><span data-preserver-spaces="true">:</span></p>
<div id="gist107057482" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<pre># Calculate coefficients
b1 &lt;- (sum((x - mean(x)) * (y - mean(y)))) / (sum((x - mean(x))^2))
b0 &lt;- mean(y) - b1 * mean(x)

# Define function for generating predictions
lr_predict &lt;- function(x) { return(b0 + b1 * x) }

# Calculated predictions: Apply lr_predict() to input
lr_data$ypred &lt;- sapply(x, lr_predict)

# Visualize input data and the best fit line
plot_ly(data = lr_data, x = ~x) %&gt;%
add_markers(y = ~y, marker = list(size = 10)) %&gt;%
add_lines(x = ~x, y = lr_data$ypred, line = list(color = "black", width = 5)) %&gt;%
layout(title = list(text = paste0('Applying simple linear regression to data', '&lt;br&gt;&lt;sup&gt;', 'Black line = best fit line', '&lt;/sup&gt;')),
showlegend = FALSE) %&gt;%
config(displayModeBar = F)</pre>
</div>
</div>
</div>
</div>
<div class="gist-meta"> </div>
</div>
</div>
<div id="attachment_6296" class="wp-caption aligncenter">
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/linear_regression_applied-1-640x480.jpeg" title="linear_regression_applied" alt="" /></div>
</div>
<p><span data-preserver-spaces="true">And that’s how you can implement simple linear regression in R! </span></p>
<h4 id="multiple-lr"><span data-preserver-spaces="true">Multiple Linear Regression</span></h4>
<p><span data-preserver-spaces="true">You’ll use the <a href="https://github.com/ginberg/boston_housing/blob/master/housing.csv">Boston Housing</a></span><span data-preserver-spaces="true"> dataset to build your model. To start, the goal is to load in the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots.</span></p>
<p><span data-preserver-spaces="true">The code snippet below loads in the dataset and visualizes box plots for every feature (not the target):</span></p>
<pre>library(reshape)

df &lt;- read.csv("https://raw.githubusercontent.com/
                ginberg/boston_housing/master/housing.csv")

# Remove target variable
temp_df &lt;- subset(df, select = -c(MEDV))
melt_df &lt;- melt(temp_df)

plot_ly(melt_df, 
        y = ~value, 
        color = ~variable, 
        type = "box") %&gt;%
   config(displayModeBar = F)</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/boxplot-640x480.jpeg" title="boxplot" alt="" /></div>
<p><span data-preserver-spaces="true">A degree of skew seems to be present in all input variables, and they all contain a couple of outliers. We’ll keep this blog to machine learning based, so we won’t do any data preparation/cleaning.</span></p>
<p><span data-preserver-spaces="true">The next step once you’re done with preparation is to split the data into testing and training data. The </span><strong><span class="enlighter"><span class="enlighter-text">caTools</span></span></strong><span data-preserver-spaces="true"> package is the perfect candidate for this task. </span></p>
<p><span data-preserver-spaces="true">You can train the model on the training set after the split. R has the </span><strong><span class="enlighter"><span class="enlighter-text">lm</span></span></strong><span data-preserver-spaces="true"> function built-in, and it is used to train linear models. Inside the </span><strong><span class="enlighter"><span class="enlighter-text">lm</span></span></strong><span data-preserver-spaces="true"> function, you’ll need to write the target variable on the left and input features on the right, separated by the  </span><span class="enlighter"><span class="enlighter-text">~</span></span><span data-preserver-spaces="true"> sign. If you put a dot instead of feature names, it means you want to train the model on all features.</span></p>
<p><span data-preserver-spaces="true">After the model is trained, you can call the </span><strong><span class="enlighter"><span class="enlighter-m0">summary</span><span class="enlighter-g1">() </span></span></strong><span data-preserver-spaces="true">function to see how well it performed on the training set. Here’s a code snippet for everything discussed above:</span></p>
<div id="gist107057502" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<pre>library(caTools)
set.seed(21)

# Train/Test split, 80:20 ratio
sample_split &lt;- sample.split(Y = df$MEDV, SplitRatio = 0.8)
train_set    &lt;- subset(x = df, sample_split == TRUE)
test_set     &lt;- subset(x = df, sample_split == FALSE)
# Fit the model and print summary
model        &lt;- lm(MEDV ~ ., data = train_set)
summary(model)</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="wp-caption aligncenter"><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/model-640x480.png" title="model" alt="" /></div></p>
<p><span data-preserver-spaces="true">The most interesting result are the P-values, displayed in the </span><span class="enlighter"><span class="enlighter-m0">Pr</span><span class="enlighter-g1">(&gt;</span><span class="enlighter-text">|t|</span><span class="enlighter-g1">) </span></span><span data-preserver-spaces="true">column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.</span></p>
<p><span data-preserver-spaces="true">Let’s make a residuals plot now. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it. Here’s the code for visualizing residuals:</span></p>
<div id="gist107057516" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<pre class="Box-body p-0 blob-wrapper data type-r  "># Get residuals
lm_residuals &lt;- as.data.frame(residuals(model))

# Visualize residuals
plot_ly(x = lm_residuals</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/residuals_plot-640x480.jpeg" title="residuals_plot" alt="" /></div>
<div id="gist107057516" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2"><span style="font-size: inherit;">As you can see, there’s a bit of skew present due to a large error on the far right. </span><span style="font-size: inherit;" data-preserver-spaces="true">Now, let&#8217;s make predictions on the test set. You can use the </span><strong style="font-size: inherit;"><span class="enlighter"><span class="enlighter-m0">predict</span><span class="enlighter-g1">() </span></span></strong><span style="font-size: inherit;" data-preserver-spaces="true">function to apply the model to the test set. You can combine the actual values and predictions into a single data frame, just so the evaluation becomes easier. Here’s how:</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div> </div>
<pre># predict price for test_set <br />predicted_prices &lt;- predict(model, newdata = test_set) <br />result &lt;- data.frame(Y = test_set$MEDV, Ypred = predicted_prices)</pre>
<div id="attachment_6300" class="wp-caption aligncenter">
<p id="caption-attachment-6300" class="wp-caption-text"><span style="font-size: inherit; font-family: 'Noto Serif';"><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/predicted_values-640x480.png" title="predicted_values" alt="" /></div></span></p>
</div>
<p><span data-preserver-spaces="true">A good way of evaluating your regression models is to look at the RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of price units the model is wrong:</span></p>
<div id="gist107057542" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<pre class="file my-2">mse  &lt;- mean((result$Y - result$Ypred)^2)
rmse &lt;- sqrt(mse)
</pre>
<div><span style="font-size: inherit;" data-preserver-spaces="true">The </span><strong style="font-size: inherit;"><span class="enlighter"><span class="enlighter-text">rmse</span></span></strong><span style="font-size: inherit;" data-preserver-spaces="true"> variable holds the value of 70.821, indicating the model is on average wrong by 70.821 price units.</span></div>
<div> </div>
</div>
</div>
</div>
</div>
<div id="gist107057542" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<h4><span style="color: inherit; font-size: 1.25em; font-weight: 600;">Conclusion</span></h4>
</div>
</div>
</div>
</div>
</div>
<p><span data-preserver-spaces="true">In this blog you’ve learned how to train linear regression models in R. You’ve implemented a simple linear regression model entirely from scratch. After that you have implemented a multiple linear regression model with  on the real dataset. You’ve also learned how to evaluate the model through summary functions, residuals plots, and the RMSE metric. </span></p>
<p><strong><span data-preserver-spaces="true">If you want to implement machine learning in your organization, feel free to <a href="https://gerinberg.com/contact">contact</a> me.</span></strong></p>

<p>The post <a href="https://gerinberg.com/2020/06/01/r-linear-regression/">Linear Regression with R</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
