<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>machine learning Archives - Ger Inberg</title>
	<atom:link href="https://gerinberg.com/category/ml/feed/" rel="self" type="application/rss+xml" />
	<link>https://gerinberg.com/category/ml/</link>
	<description>data science developer</description>
	<lastBuildDate>Sun, 27 Dec 2020 16:58:45 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.1</generator>

<image>
	<url>https://gerinberg.com/wp-content/uploads/2017/05/favicon-150x150.jpg</url>
	<title>machine learning Archives - Ger Inberg</title>
	<link>https://gerinberg.com/category/ml/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Linear Regression with R</title>
		<link>https://gerinberg.com/2020/06/01/r-linear-regression/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Mon, 01 Jun 2020 11:32:00 +0000</pubDate>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[linear regression]]></category>
		<guid isPermaLink="false">https://gerinberg.com/?p=1582</guid>

					<description><![CDATA[<p>  You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R? Introduction to Linear Regression Linear regression is an algorithm [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2020/06/01/r-linear-regression/">Linear Regression with R</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<aside class="mashsb-container mashsb-main mashsb-stretched">
<div> </div>
</aside>
<p><span data-preserver-spaces="true">You might have heard about linear regression and machine learning before. Basically linear regression is a simple statistics problem.  But what are the different types of linear regression and how to implement these in R?</span></p>
<h4 id="intro"><span data-preserver-spaces="true">Introduction to Linear Regression</span></h4>
<p><span data-preserver-spaces="true">Linear regression is an algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. The output variable, what you’re predicting, has to be continuous. The output variable can be calculated as a linear combination of the input variable(s).</span></p>
<p><span data-preserver-spaces="true">There are two types of linear regression:</span></p>
<ul>
<li><strong><span data-preserver-spaces="true">Simple linear regression</span></strong><span data-preserver-spaces="true"> – only one input variable</span></li>
<li><strong><span data-preserver-spaces="true">Multiple linear regression</span></strong><span data-preserver-spaces="true"> – multiple input variables</span></li>
</ul>
<p><span data-preserver-spaces="true">We will implement both today – simple linear regression from scratch and multiple linear regression with built-in R functions.</span></p>
<p><span data-preserver-spaces="true">You can use a linear regression model to learn which features are important by examining </span><strong><span data-preserver-spaces="true">coefficients</span></strong><span data-preserver-spaces="true">. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value. </span></p>
<p><span data-preserver-spaces="true">That’s how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added.</span></p>
<p><span data-preserver-spaces="true">There’s still one thing we should cover before diving into the code – assumptions of a linear regression model:</span></p>
<ul>
<li><strong><span data-preserver-spaces="true">Linear assumption</span></strong><span data-preserver-spaces="true"> — model assumes that the relationship between variables is linear</span></li>
<li><strong><span data-preserver-spaces="true">No noise</span></strong><span data-preserver-spaces="true"> — model assumes that the input and output variables are not noisy — so remove outliers if possible</span></li>
<li><strong><span data-preserver-spaces="true">No collinearity</span></strong><span data-preserver-spaces="true"> — model will overfit when you have highly correlated input variables</span></li>
<li><strong><span data-preserver-spaces="true">Normal distribution</span></strong><span data-preserver-spaces="true"> — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking</span></li>
<li><strong><span data-preserver-spaces="true">Rescaled inputs</span></strong><span data-preserver-spaces="true"> — use scalers or normalizer to make more reliable predictions</span></li>
</ul>
<p><span data-preserver-spaces="true">You should be aware of these assumptions every time you’re creating linear models. We’ll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects. </span></p>
<h4 id="simple-lr"><span data-preserver-spaces="true">Simple Linear Regression from Scratch</span></h4>
<p><span data-preserver-spaces="true">If you have a single input variable, you’re dealing with simple linear regression. It won’t be the case most of the time, but it can’t hurt to know. A simple linear regression can be expressed as:</span><img decoding="async" class="size-medium wp-image-1645 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/formula-300x82.png" alt="Linear Regression Formula" width="300" height="82" srcset="https://gerinberg.com/wp-content/uploads/2020/06/formula-300x82.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/formula-230x63.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/formula-350x96.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/formula.png 358w" sizes="(max-width: 300px) 100vw, 300px" /><span data-preserver-spaces="true">As you can see, there are two terms you need to calculate beforehand: beta0 and beta1. </span><span data-preserver-spaces="true">You’ll first see how to calculate Beta1, as Beta0 depends on it. This is the formula:</span><img decoding="async" class="size-medium wp-image-1646 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-300x75.png" alt="Beta1" width="300" height="75" srcset="https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-300x75.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-230x57.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation-350x87.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/beta1_equation.png 450w" sizes="(max-width: 300px) 100vw, 300px" /><span data-preserver-spaces="true">And this is the formula for Beta0:</span></p>
<p><img decoding="async" class="size-medium wp-image-1647 aligncenter" src="https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-300x64.png" alt="Beta0" width="300" height="64" srcset="https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-300x64.png 300w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-230x49.png 230w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation-350x75.png 350w, https://gerinberg.com/wp-content/uploads/2020/06/beta0_equation.png 412w" sizes="(max-width: 300px) 100vw, 300px" /></p>
<p><span data-preserver-spaces="true">These x’s and y’s with the bar over them represent the mean (average) of the corresponding variables. </span></p>
<p><span data-preserver-spaces="true">Let’s see how all of this works in action. The code snippet below generates </span><strong><span data-preserver-spaces="true">X</span></strong><span data-preserver-spaces="true"> with 500 linearly spaced numbers between 1 and 500, and generates </span><strong><span data-preserver-spaces="true">Y</span></strong><span data-preserver-spaces="true"> as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the </span><span class="enlighter"><span class="enlighter-text">plotly</span></span><span data-preserver-spaces="true"> package:</span></p>
<pre>library(plotly)<br /># Generate synthetic data with a linear relationship
x &lt;- seq(from = 1, to = 500)
y &lt;- rnorm(n = 500, mean = 0.5*x + 70, sd = 30)
lr_data &lt;- data.frame(x, y)

# create the plot
plot_ly(data = lr_data, x = ~x, y = ~y,
marker = list(size = 10)) %&gt;%
layout(title = list(text = paste0('Simple linear regression', '&lt;br&gt;&lt;sup&gt;', 'Linear relation is visible', '&lt;/sup&gt;'))) %&gt;%
config(displayModeBar = F)</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/linear_regression-1-640x480.jpeg" title="linear_regression" alt="" /></div>
<p> </p>
<div id="attachment_6295" class="wp-caption aligncenter">
<p><span style="font-size: inherit;">Let&#8217;s calculate the coefficients now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a </span><span class="enlighter"><span class="enlighter-m0">lr_predict</span><span class="enlighter-g1">() </span></span><span data-preserver-spaces="true">function that implements the line equation.</span></p>
</div>
<p><span data-preserver-spaces="true">The predictions can then be obtained by applying the </span><span class="enlighter"><span class="enlighter-m0">lr_predict</span><span class="enlighter-g1">() </span></span><span data-preserver-spaces="true">function to the vector X – they should all be on a single straight line. Finally, input data and predictions are visualized</span><span data-preserver-spaces="true">:</span></p>
<div id="gist107057482" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<pre># Calculate coefficients
b1 &lt;- (sum((x - mean(x)) * (y - mean(y)))) / (sum((x - mean(x))^2))
b0 &lt;- mean(y) - b1 * mean(x)

# Define function for generating predictions
lr_predict &lt;- function(x) { return(b0 + b1 * x) }

# Calculated predictions: Apply lr_predict() to input
lr_data$ypred &lt;- sapply(x, lr_predict)

# Visualize input data and the best fit line
plot_ly(data = lr_data, x = ~x) %&gt;%
add_markers(y = ~y, marker = list(size = 10)) %&gt;%
add_lines(x = ~x, y = lr_data$ypred, line = list(color = "black", width = 5)) %&gt;%
layout(title = list(text = paste0('Applying simple linear regression to data', '&lt;br&gt;&lt;sup&gt;', 'Black line = best fit line', '&lt;/sup&gt;')),
showlegend = FALSE) %&gt;%
config(displayModeBar = F)</pre>
</div>
</div>
</div>
</div>
<div class="gist-meta"> </div>
</div>
</div>
<div id="attachment_6296" class="wp-caption aligncenter">
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/linear_regression_applied-1-640x480.jpeg" title="linear_regression_applied" alt="" /></div>
</div>
<p><span data-preserver-spaces="true">And that’s how you can implement simple linear regression in R! </span></p>
<h4 id="multiple-lr"><span data-preserver-spaces="true">Multiple Linear Regression</span></h4>
<p><span data-preserver-spaces="true">You’ll use the <a href="https://github.com/ginberg/boston_housing/blob/master/housing.csv">Boston Housing</a></span><span data-preserver-spaces="true"> dataset to build your model. To start, the goal is to load in the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots.</span></p>
<p><span data-preserver-spaces="true">The code snippet below loads in the dataset and visualizes box plots for every feature (not the target):</span></p>
<pre>library(reshape)

df &lt;- read.csv("https://raw.githubusercontent.com/
                ginberg/boston_housing/master/housing.csv")

# Remove target variable
temp_df &lt;- subset(df, select = -c(MEDV))
melt_df &lt;- melt(temp_df)

plot_ly(melt_df, 
        y = ~value, 
        color = ~variable, 
        type = "box") %&gt;%
   config(displayModeBar = F)</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/boxplot-640x480.jpeg" title="boxplot" alt="" /></div>
<p><span data-preserver-spaces="true">A degree of skew seems to be present in all input variables, and they all contain a couple of outliers. We’ll keep this blog to machine learning based, so we won’t do any data preparation/cleaning.</span></p>
<p><span data-preserver-spaces="true">The next step once you’re done with preparation is to split the data into testing and training data. The </span><strong><span class="enlighter"><span class="enlighter-text">caTools</span></span></strong><span data-preserver-spaces="true"> package is the perfect candidate for this task. </span></p>
<p><span data-preserver-spaces="true">You can train the model on the training set after the split. R has the </span><strong><span class="enlighter"><span class="enlighter-text">lm</span></span></strong><span data-preserver-spaces="true"> function built-in, and it is used to train linear models. Inside the </span><strong><span class="enlighter"><span class="enlighter-text">lm</span></span></strong><span data-preserver-spaces="true"> function, you’ll need to write the target variable on the left and input features on the right, separated by the  </span><span class="enlighter"><span class="enlighter-text">~</span></span><span data-preserver-spaces="true"> sign. If you put a dot instead of feature names, it means you want to train the model on all features.</span></p>
<p><span data-preserver-spaces="true">After the model is trained, you can call the </span><strong><span class="enlighter"><span class="enlighter-m0">summary</span><span class="enlighter-g1">() </span></span></strong><span data-preserver-spaces="true">function to see how well it performed on the training set. Here’s a code snippet for everything discussed above:</span></p>
<div id="gist107057502" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<pre>library(caTools)
set.seed(21)

# Train/Test split, 80:20 ratio
sample_split &lt;- sample.split(Y = df$MEDV, SplitRatio = 0.8)
train_set    &lt;- subset(x = df, sample_split == TRUE)
test_set     &lt;- subset(x = df, sample_split == FALSE)
# Fit the model and print summary
model        &lt;- lm(MEDV ~ ., data = train_set)
summary(model)</pre>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="wp-caption aligncenter"><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/model-640x480.png" title="model" alt="" /></div></p>
<p><span data-preserver-spaces="true">The most interesting result are the P-values, displayed in the </span><span class="enlighter"><span class="enlighter-m0">Pr</span><span class="enlighter-g1">(&gt;</span><span class="enlighter-text">|t|</span><span class="enlighter-g1">) </span></span><span data-preserver-spaces="true">column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.</span></p>
<p><span data-preserver-spaces="true">Let’s make a residuals plot now. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it. Here’s the code for visualizing residuals:</span></p>
<div id="gist107057516" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2">
<pre class="Box-body p-0 blob-wrapper data type-r  "># Get residuals
lm_residuals &lt;- as.data.frame(residuals(model))

# Visualize residuals
plot_ly(x = lm_residuals</pre>
<div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/residuals_plot-640x480.jpeg" title="residuals_plot" alt="" /></div>
<div id="gist107057516" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div id="file-linear_regression-r" class="file my-2"><span style="font-size: inherit;">As you can see, there’s a bit of skew present due to a large error on the far right. </span><span style="font-size: inherit;" data-preserver-spaces="true">Now, let&#8217;s make predictions on the test set. You can use the </span><strong style="font-size: inherit;"><span class="enlighter"><span class="enlighter-m0">predict</span><span class="enlighter-g1">() </span></span></strong><span style="font-size: inherit;" data-preserver-spaces="true">function to apply the model to the test set. You can combine the actual values and predictions into a single data frame, just so the evaluation becomes easier. Here’s how:</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div> </div>
<pre># predict price for test_set <br />predicted_prices &lt;- predict(model, newdata = test_set) <br />result &lt;- data.frame(Y = test_set$MEDV, Ypred = predicted_prices)</pre>
<div id="attachment_6300" class="wp-caption aligncenter">
<p id="caption-attachment-6300" class="wp-caption-text"><span style="font-size: inherit; font-family: 'Noto Serif';"><div class="envira-gallery-feed-output"><img decoding="async" class="envira-gallery-feed-image" src="https://gerinberg.com/wp-content/uploads/2020/12/predicted_values-640x480.png" title="predicted_values" alt="" /></div></span></p>
</div>
<p><span data-preserver-spaces="true">A good way of evaluating your regression models is to look at the RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of price units the model is wrong:</span></p>
<div id="gist107057542" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<pre class="file my-2">mse  &lt;- mean((result$Y - result$Ypred)^2)
rmse &lt;- sqrt(mse)
</pre>
<div><span style="font-size: inherit;" data-preserver-spaces="true">The </span><strong style="font-size: inherit;"><span class="enlighter"><span class="enlighter-text">rmse</span></span></strong><span style="font-size: inherit;" data-preserver-spaces="true"> variable holds the value of 70.821, indicating the model is on average wrong by 70.821 price units.</span></div>
<div> </div>
</div>
</div>
</div>
</div>
<div id="gist107057542" class="gist">
<div class="gist-file">
<div class="gist-data">
<div class="js-gist-file-update-container js-task-list-container file-box">
<div class="Box-body p-0 blob-wrapper data type-r  ">
<h4><span style="color: inherit; font-size: 1.25em; font-weight: 600;">Conclusion</span></h4>
</div>
</div>
</div>
</div>
</div>
<p><span data-preserver-spaces="true">In this blog you’ve learned how to train linear regression models in R. You’ve implemented a simple linear regression model entirely from scratch. After that you have implemented a multiple linear regression model with  on the real dataset. You’ve also learned how to evaluate the model through summary functions, residuals plots, and the RMSE metric. </span></p>
<p><strong><span data-preserver-spaces="true">If you want to implement machine learning in your organization, feel free to <a href="https://gerinberg.com/contact">contact</a> me.</span></strong></p>

<p>The post <a href="https://gerinberg.com/2020/06/01/r-linear-regression/">Linear Regression with R</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Air quality prediction with XGBoost</title>
		<link>https://gerinberg.com/2017/05/11/air-quality-prediction-xgboost/</link>
		
		<dc:creator><![CDATA[Ger]]></dc:creator>
		<pubDate>Thu, 11 May 2017 04:23:17 +0000</pubDate>
				<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">http://gerinberg.com/?p=830</guid>

					<description><![CDATA[<p>While travelling in South East Asia, I noticed the air quality issues in some bigger cities. It affects peoples lives directly because they might get breathing problems, will stay only inside buildings and/or are wearing masks. When people will know that at a certain time [&#8230;]</p>
<p>The post <a href="https://gerinberg.com/2017/05/11/air-quality-prediction-xgboost/">Air quality prediction with XGBoost</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>While travelling in South East Asia, I noticed the air quality issues in some bigger cities. It affects peoples<br />
lives directly because they might get breathing problems, will stay only inside buildings and/or are wearing<br />
masks. When people will know that at a certain time the air quality is bad, they can take measures to<br />
prevent possible (health) problems.<br />
In the “Human Health Effects on Air Pollution” study (Marilena Kampa, Elias Castanas 2007) the relation<br />
between air quality and the health of the people having to deal with that air have been shown. This has led<br />
to the introduction of the Air Quality Index. The AQI is an index for reporting daily air quality. It tells how<br />
clean or polluted the air is, and what might be the associated health effects.</p>
<p>For my final project at Udacity&#8217;s Machine Learning nanodegree I have created a model to predict air pollutants. I have used the data from a <a href="https://www.kaggle.com/c/dsg-hackathon/data" target="_blank" rel="noopener noreferrer">Kaggle competition</a></p>
<p>Since the competition is already 5 years old (btw, did you know Kaggle started in 2010 already?), I wanted to use new techniques to see if I could improve on the scores. Therfore I decided to use eXtreme Gradient Boosting (XGBoost) which is a very popular algorithm nowadays because of it&#8217;s speed and accuracy.</p>
<p>Please see my final report on <a href="https://github.com/ginberg/mlcapstone/blob/master/report.pdf" target="_blank" rel="noopener noreferrer">github</a> for the results. The exploratory plot below displays the relation between hour of day and the mean pollution level for 39 pollutants.</p>
<p><a href="http://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour.png"><img loading="lazy" decoding="async" class="alignnone wp-image-831 size-medium" src="http://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-300x245.png" alt="Mean pollution level per hour of day calculated using xgboost" width="300" height="245" srcset="https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-300x245.png 300w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-768x628.png 768w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-1024x837.png 1024w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-830x679.png 830w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-230x188.png 230w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-350x286.png 350w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour-480x392.png 480w, https://gerinberg.com/wp-content/uploads/2017/05/mean_targets_per_hour.png 1617w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a></p>
<p>The post <a href="https://gerinberg.com/2017/05/11/air-quality-prediction-xgboost/">Air quality prediction with XGBoost</a> appeared first on <a href="https://gerinberg.com">Ger Inberg</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
