Data Analysis

Introduction

Avocado, also known as Persea Americana was first introduced in California, United State in 1871. By 1950s, there were different varieties of avocado being sold in the market in different parts of the US, with Fuerte being the most consumed variety. This however changed in the next twenty years, where Hass avocado became the most consumed variety not only in the United State, but also in other different parts of the world. At the moment, Avocado has become one of the most consumed fruit, not only due to its taste, but also because of its contribution to health. Studies have shown that 85% of the avocados produced and sold globally are of Hass variety. This can be attributed to the fact that Hass variety can grow in almost all regions, and throughout the year. Although Mexico is the leading producer of Avocado in the world, the US is the leading country concerning imports of avocados, accounting to about a million tons per year. Since 2008, the avocado market in the US has been growing at 16% every year, and this is expected to continue in the coming future. The amount of Avocado produced in states such as California and Hawaii is huge, however, it does not satisfy the market demand, and therefore more has to be imported from countries such as Mexico, Chile, Peru, Dominican Republic, among other countries. It is however to note that the consumption of avocado is not uniform throughout the country. For instance, in California, 90% of the families consume about three units of avocado per month.

Problem Statement

Study has shown that about fifty million dollars are spent every year on advertising and carrying out promotional activities on healthy avocado consumption (Cavaletto 465).In the light of this, collection and scientific analysis of information on the avocado market and the rate of consumption in different state could be of great help to producers, vendors, avocado association, and companies dealing with processing of this fruit. Such data could be used in selecting the right places to sell avocado and to determine places where marketing campaigns can be carried out successfully or help in development of production innovations and new strategies for increasing the sale of such a product. The following paper aims to using machine learning techniques to analyze a dataset in the bid to determine trends in the sales of avocado in different states in the US, the number of units sold per month, and the total sales in different parts of the country. Such information will go a long way in helping avocado producers, vendors, associations and companies in making informed decision when planning on their sales and marketing campaigns as well as getting to know the sales expected to be registered in advance for a given state. The result obtained in this paper could be an essential input for making rational decision in relation to the avocado market and making the right step toward encouraging consumption of avocado, or shifting supplies to areas whose demand is much higher.

Data Source

The data used in this project was obtained from Kaggle Inc.as provided by the Hass Avocado Board program. The Hass Avocado Board is a program by the US government that is financed through tax applied on all Hass avocados sold in the US market. Most of these funds are channeled toward advertisement and promotion programs. Hass Avocado Board also helps in collecting, tracking, analyzing, and dissemination of information on the sales of Hass avocados in the US market. This information is used for research and making decision on cultivation, harvesting, distribution, and marketing of avocados.

Dataset

The dataset, named Avocado Prices, was collected between 2013 and 2018 and comprises of 18249 rows and 14 columns. This is determined by first loading the dataset into the R-studio before determining the dimensions as shown below:

> Avocado <- read.csv (“Avoc.csv”)

> dim(Avocado)

[1] 18249 14

Some of the most relevant columns here include Dates, for data when the observation was made, AveragePrice, for the average price of a single avocado, type, for the either conventional or organic avocados, year, for the year when the observation was made, Region, for the city or the region where the observation was made, Total Volume for the summation of all the avocados sold, 4046, representing the total number of the avocados with Product Lookup codes (PLU) 4046 sold, 4225, for the total number of avocados with PLU 4225 sold, and 4770 for the total number of avocados with PLU 4770 sold. This was determined as shown below

> str (Avocado)

‘data.frame’: 18249 obs. of 15 variables:

$ Unnamed.0 : int 0 1 2 3 4 5 6 7 8 9 …

$ Date : Factor w/ 169 levels “1/1/2017″,”1/10/2016”,..: 54 51 48 58 42 39 36 45 33 27 …

$ months : logi NA NA NA NA NA NA …

$ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 …

$ Total.Volume: num 64237 54877 118220 78992 51040 …

$ X4046 : num 1037 674 795 1132 941 …

$ X4225 : num 54455 44639 109150 71976 43838 …

$ X4770 : num 48.2 58.3 130.5 72.6 75.8 …

$ Total.Bags : num 8697 9506 8145 5811 6184 …

$ Small.Bags : num 8604 9408 8042 5677 5986 …

$ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 …

$ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 …

$ type : Factor w/ 2 levels “conventional”,..: 1 1 1 1 1 1 1 1 1 1 …

$ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 …

$ region : Factor w/ 54 levels “Albany”,”Atlanta”,..: 1 1 1 1 1 1 1 1 1 1 …

To ensure that the dataset was appropriate for this study, it was subjected to several cleaning process, following the Cross Industry Standard process for Data Mining (CRISP-DM) methodology using R-studio. This was done by first searching for missing values and blank spaces. Fortunately there were no any missing data or values in the Avocado prices dataset and only the first column was unnamed as shown in the output below:

> colSums(is.na(Avocado))

Unnamed.0 Date AveragePrice Total.Volume X4046

0 0 18249 0 0 0

X4225 X4770 Total.Bags Small.Bags Large.Bags XLarge.Bags

0 0 0 0 0 0

type year region

0 0 0

>summary(Avocado
Unnamed.0	Date	AveragePrice	Total.Volume	Small.Bags	XLarge.Bags
Min. : 0.00	1/1/2017 :108	Min. :0.440	Min. : 85	Min. : 0	Min. : 0
1st Qu.:10.00	1/10/2016:108	1st Qu.:1.100	1st Qu.: 10839	1st Qu.: 2849	1st Qu.: 0
Median:24.00	1/11/2015: 108	Median :1.370	Median : 107377	Median : 26363	Median : 0.
Mean :24.23	1/14/2018:108	Mean :1.406	Mean : 850644	Mean : 182195	Mean : 3106.4
3rd Qu.:38.00	1/15/2017: 108	3rd Qu.:1.660	3rd Qu.: 432962	3rd Qu.: 83338	3rd Qu.: 132.5
Max. :52.00	1/17/2016: 108	Max. :3.250	Max. :62505647	Max. :13384587	Max. :551693.7
(Other) :17601

X4046	X4225	X4770	Total.Bags	Large.Bags	Type
Min. : 0	Min. : 0	Min. : 0	Min. : 0	Min. : 0	conventional:9126
1st Qu.: 854	1st Qu.: 3009	1st Qu.: 0	1st Qu.: 5089	1st Qu.: 127	organic :9123
Median : 8645	Median : 29061	Median : 185	Median : 39744	Median : 2648
Mean : 293008	Mean : 295155	Mean : 22840	Mean : 239639	Mean : 54338
3rd Qu.: 111020	3rd Qu.: 150207	3rd Qu.: 6243	3rd Qu.: 110783	3rd Qu.: 22029
Max. :22743616	Max. :20470573	Max. :2546439	Max. :19373134	Max. :5719097
(Other) :16221

Trend of Avocado Prices

The trend in the price of avocados will be done using several visualization diagrams. The first to be used is the box plot, which will help in determining the trend in the price of both organic and conventional avocados. This will be done using the following set of codes:

> options(repr.plot.width= 7, repr.plot.height=5)

> ggplot(Avocado, aes(Avocado$type, Avocado$AveragePrice))+

+ geom_boxplot(aes(colour = Avocado$year))+

+ labs(colour = “Year”, x = “Type”, y =”Average Price”, title = ” Average price per year by avocado type”)

Trend of avocado prices by Avocado type in last four years by region

This will be done by first grouping regions using the following set of codes:

>min_con = round(min(grouped_region_conv$AveragePrice),1)-0.1

> max_con = round(max(grouped_region_conv$AveragePrice),1)+0.1

> grouped_region_org = Avocado %>%

+ select(year, region, type, AveragePrice) %>%

+ filter(type == ‘organic’)

> min_org = round(min(grouped_region_org$AveragePrice),1)-0.1

> max_org = round(max(grouped_region_org$AveragePrice),1)+0.1

> options(repr.plot.width= 10, repr.plot.height=12)

> ggplot(grouped_region_conv, aes(x=region, y=AveragePrice)) +

+ geom_tufteboxplot() +

+ facet_grid(.~grouped_region_conv$year, scales=”free”) +

+ labs(colour = “Year”, x = “Region”, y =”Average Price”, title = “Average prices of Conventional Avocados for each region by year”)+

+ scale_y_continuous(breaks=c(seq(min_con,max_con,0.2)), limits = c(min_con,max_con)) +

+ coord_flip() +

+ theme(axis.text.x = element_text(angle = 90, vjust = 0))

Maximum Avocado Price in Last Four Years by each Region and Type

This will be achieved by running the following set of code

> grouped_region_max_conventional = Avocado %>%

+ group_by(region, type) %>%

+ select(region, type,AveragePrice) %>%

+ summarise(maxPrice = max(AveragePrice)) %>%

+ filter(type ==’conventional’)

> grouped_region_max_conventional$region <- factor(grouped_region_max_conventional$region, levels= pull(arrange(grouped_region_max_conventional, (grouped_region_max_conventional$maxPrice)), region))

> grouped_region_max_organic = Avocado %>%

+ group_by(region, type) %>%

+ select(region, type,AveragePrice) %>%

+ summarise(maxPrice = max(AveragePrice)) %>%

+ filter(type ==’organic’)

> grouped_region_max_organic$region <- factor(grouped_region_max_organic$region, levels= pull(arrange(grouped_region_max_organic,

+ (grouped_region_max_organic$maxPrice)), region))

> plot3 <- ggplot(grouped_region_max_conventional, aes(x=maxPrice, y=region, label = round(maxPrice, 1)))+

+ geom_segment(aes(x = 0, y = region, xend = maxPrice, yend = region), color = “grey50”)+

+ labs(x = “Region”, y =”Average Price”, title = “Max. prices of avocado (Conventional)”)+

+ geom_point() +

+ geom_text(nudge_x = 0.3)

> options(repr.plot.width= 10, repr.plot.height=10)

> grid.arrange(plot3,plot2, ncol=2)

The top region in production of avocado by PLU, type of avocado and year in relation to the total volume of avocado sold and net price can also be determined and visualized using the following set of code

> grouped_avocado_type = avocadoCodes %>%

+ select(year,type, `Avocado Type`,Volume, TotalPrice, region, AveragePrice) %>%

+ group_by(year,`Avocado Type`, region, type) %>%

+ summarise(Volume = sum(Volume)/1000000, NetPrice = sum(TotalPrice)/1000000) %>%

+ filter(Volume > 0 & NetPrice > 0)

> avocado_type_DT <- data.table(grouped_avocado_type)

> top5_volume <- avocado_type_DT[order(-Volume), head(.SD,5),by = .(`Avocado Type`, year, type)]

> top5_price <- avocado_type_DT[order(-NetPrice), head(.SD,5),by = .(`Avocado Type`, year, type)]

> options(repr.plot.width= 8, repr.plot.height=5)

> ggplot(top5_volume, aes(x=region, y=Volume))+

+ geom_point(aes(col=year), size=2) +

+ facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+

+ labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+

+ coord_flip(

Analysis of the data and the above visual representation shows that total US is an outlier and need to be eliminated from the data to obtain a clear view of the data. This is essential in the bid to increase visibility of trends in other regions better. This is done using the following set of code

> top5_volume = top5_volume %>% filter(region != ‘TotalUS’)

> top5_price = top5_price %>% filter(region != ‘TotalUS’)

> ggplot(top5_volume, aes(x=region, y=Volume))+

+ geom_point(aes(col=year), size=2) +

+ facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+

+ labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region (excluding TotalUS)”, colour = ‘Year’)++ coord_flip()

The bottom regions by the PLU codes, avocado type, and year can also be used to visually determine their trend in relation to the total volume and net price of the avocado. This is done using the following set of code:

> bottom5_volume <- avocado_type_DT[order(Volume), head(.SD,3),by = .(`Avocado Type`, year, type)]

> bottom5_price <- avocado_type_DT[order(NetPrice), head(.SD,3),by = .(`Avocado Type`, year, type)]

> options(repr.plot.width= 9, repr.plot.height=7)

> ggplot(bottom5_volume, aes(x=region, y=Volume))+

+ geom_point(aes(col=year), size=2) +

+ facet_grid(bottom5_volume$type ~bottom5_volume$`Avocado Type`, scales=”free”)+

+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+

+ labs(y = “Total Price (in Millions)”, x =”Region”, title = “Total Price for bottom 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+

+ coord_flip()

Predictive Modeling using Machine Learning

Regression Analysis

Regression analysis defines a predictive modeling method that is used to determine the relationship between dependent and independent variables. It is one of the most suitable methods that is used in investigating causal effects relationship between different variables in a dataset, Some of the benefits associated with use of regression analysis include the fact that it helps in determining the significant relationship between dependent and independent variable, as well as indicating the strength of the impact of different independent variables on the dependent variable. Using regression analysis, one can easily determine the extent to which different variables affects other variables within a dataset. For instance, in a given dataset, regression analysis can be used to determine the effect of price changes on demand and production. As such, regression analysis is crucial eliminating variables with no effects in generating a predictive model.

Linear (Multiple) Regression

The best regression analysis to use in situations where the several predictors and the response variables are continuous is the linear regression model. There are however other assumptions which need to be considered. This includes the fact that in linear regression, as the name suggest, there must be an existing linear relationship between response variables and predictive variables. However, the two should not be correlated in any way. This is owing to the fact that existence of correlation between predictors leads to multi-collinearity. In addition, it also assumed that there is no correlation between error terms as it leads to autocorrelation. The error terms must also be constant to avoid heteroskedasticity. The following R-codes helps in building the regression model of the avocado prices dataset:

> linear_model <- lm(AveragePrice~ ., data = Avocado)

> summary(linear_model)

The output of the above code is as presented below:

Call:

lm(formula = AveragePrice ~ ., data = Avocado)

Residuals:

Min 1Q Median 3Q Max

-0.9836 -0.1208 0.0033 0.1272 1.4155

Coefficients: (3 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.837e+00 1.095e+00 3.505 0.000458 ***

Unnamed.0 -5.094e-02 2.105e-02 -2.420 0.015542 *

Date1/10/2016 -1.533e-01 5.181e-02 -2.959 0.003090 **

Date1/11/2015 -9.140e-03 5.197e-02 -0.176 0.860408

Date1/14/2018 -1.994e+00 8.843e-01 -2.255 0.024134 *

Date1/15/2017 -6.103e-02 5.211e-02 -1.171 0.241558

Date1/17/2016 -1.616e-01 6.987e-02 -2.313 0.020714 *

Date1/18/2015 -3.926e-02 7.007e-02 -0.560 0.575233

Date1/21/2018 -2.092e+00 9.053e-01 -2.311 0.020848 *

Date1/22/2017 -2.076e-01 7.023e-02 -2.956 0.003121 **

Date1/24/2016 -2.447e-01 8.925e-02 -2.741 0.006124 **

Date1/25/2015 -8.432e-02 8.947e-02 -0.942 0.345950

Date1/28/2018 -2.126e+00 9.263e-01 -2.295 0.021718 *

Date1/29/2017 -2.091e-01 8.964e-02 -2.333 0.019670 *

Date1/3/2016 -1.260e-01 3.703e-02 -3.403 0.000668 ***

Date1/31/2016 -2.857e-01 1.093e-01 -2.615 0.008936 **

Date1/4/2015 -2.647e-02 3.714e-02 -0.713 0.476112

Date1/7/2018 -2.009e+00 8.632e-01 -2.327 0.019953 *

Date1/8/2017 -3.415e-02 3.723e-02 -0.917 0.358896

Date10/1/2017 -1.393e+00 8.211e-01 -1.696 0.089821 .

Date10/11/2015 -1.997e+00 8.634e-01 -2.313 0.020710 *

Date10/15/2017 -1.583e+00 8.632e-01 -1.834 0.066632 .

Date10/16/2016 -1.933e+00 8.843e-01 -2.186 0.028800 *

Date10/18/2015 -2.026e+00 8.844e-01 -2.291 0.021964 *

Date10/2/2016 -1.740e+00 8.422e-01 -2.066 0.038811 *

Date10/22/2017 -1.753e+00 8.843e-01 -1.982 0.047501 *

Date10/23/2016 -1.929e+00 9.053e-01 -2.131 0.033120 *

Date10/25/2015 -2.088e+00 9.055e-01 -2.306 0.021124 *

Date10/29/2017 -1.867e+00 9.053e-01 -2.062 0.039217 *

Date10/30/2016 -1.818e+00 9.263e-01 -1.963 0.049654 *

Date10/4/2015 -1.898e+00 8.424e-01 -2.254 0.024238 *

Date10/8/2017 -1.472e+00 8.422e-01 -1.748 0.080426 .

Date10/9/2016 -1.883e+00 8.632e-01 -2.182 0.029131 *

Date11/1/2015 -2.217e+00 9.265e-01 -2.392 0.016750 *

Date11/12/2017 -2.059e+00 9.474e-01 -2.173 0.029772 *

Date11/13/2016 -2.037e+00 9.684e-01 -2.104 0.035431 *

Date11/15/2015 -2.282e+00 9.686e-01 -2.356 0.018504 *

Date11/19/2017 -2.122e+00 9.684e-01 -2.191 0.028463 *

Date11/20/2016 -2.173e+00 9.894e-01 -2.196 0.028127 *

Date11/22/2015 -2.342e+00 9.896e-01 -2.367 0.017956 *

Date11/26/2017 -2.169e+00 9.895e-01 -2.192 0.028377 *

Date11/27/2016 -2.227e+00 1.010e+00 -2.204 0.027554 *

Date11/29/2015 -2.382e+00 1.011e+00 -2.357 0.018458 *

Date11/5/2017 -1.959e+00 9.263e-01 -2.115 0.034484 *

Date11/6/2016 -1.952e+00 9.474e-01 -2.060 0.039407 *

Date11/8/2015 -2.238e+00 9.476e-01 -2.362 0.018190 *

Date12/10/2017 -2.430e+00 1.032e+00 -2.356 0.018478 *

Date12/11/2016 -2.517e+00 1.053e+00 -2.391 0.016802 *

Date12/13/2015 -2.538e+00 1.053e+00 -2.411 0.015905 *

Date12/17/2017 -2.457e+00 1.053e+00 -2.334 0.019612 *

Date12/18/2016 -2.598e+00 1.074e+00 -2.420 0.015516 *

Date12/20/2015 -2.537e+00 1.074e+00 -2.363 0.018151 *

Date12/24/2017 -2.437e+00 1.074e+00 -2.270 0.023216 *

Date12/25/2016 -2.617e+00 1.095e+00 -2.391 0.016815 *

Date12/27/2015 -2.633e+00 1.095e+00 -2.406 0.016149 *

Date12/3/2017 -2.337e+00 1.010e+00 -2.313 0.020747 *

Date12/31/2017 -2.642e+00 1.095e+00 -2.414 0.015799 *

Date12/4/2016 -2.403e+00 1.032e+00 -2.330 0.019823 *

Date12/6/2015 -2.485e+00 1.032e+00 -2.409 0.016009 *

Date2/1/2015 -2.835e-01 1.095e-01 -2.589 0.009626 **

Date2/11/2018 -2.315e+00 9.684e-01 -2.390 0.016839 *

Date2/12/2017 -3.813e-01 1.300e-01 -2.934 0.003356 **

Date2/14/2016 -4.062e-01 1.501e-01 -2.706 0.006827 **

Date2/15/2015 -2.474e-01 1.504e-01 -1.645 0.099923 .

Date2/18/2018 -2.299e+00 9.895e-01 -2.324 0.020138 *

Date2/19/2017 -3.684e-01 1.505e-01 -2.447 0.014402 *

Date2/21/2016 -4.116e-01 1.708e-01 -2.410 0.015966 *

Date2/22/2015 -3.231e-01 1.710e-01 -1.890 0.058828 .

Date2/25/2018 -2.366e+00 1.010e+00 -2.342 0.019216 *

Date2/26/2017 -4.480e-01 1.712e-01 -2.617 0.008881 **

Date2/28/2016 -4.735e-01 1.915e-01 -2.472 0.013450 *

Date2/4/2018 -2.334e+00 9.474e-01 -2.463 0.013780 *

Date2/5/2017 -3.880e-01 1.096e-01 -3.538 0.000403 ***

Date2/7/2016 -4.158e-01 1.296e-01 -3.208 0.001337 **

Date2/8/2015 -2.715e-01 1.298e-01 -2.092 0.036496 *

Date3/1/2015 -4.258e-01 1.918e-01 -2.221 0.026380 *

Date3/11/2018 -2.493e+00 1.053e+00 -2.369 0.017865 *

Date3/12/2017 -3.550e-01 2.127e-01 -1.669 0.095231 .

Date3/13/2016 -6.448e-01 2.332e-01 -2.765 0.005700 **

Date3/15/2015 -4.465e-01 2.334e-01 -1.913 0.055764 .

Date3/18/2018 -2.565e+00 1.074e+00 -2.389 0.016914 *

Date3/19/2017 -3.736e-01 2.336e-01 -1.599 0.109752

Date3/20/2016 -6.954e-01 2.541e-01 -2.737 0.006209 **

Date3/22/2015 -5.407e-01 2.543e-01 -2.126 0.033496 *

Date3/25/2018 -2.583e+00 1.095e+00 -2.360 0.018302 *

Date3/26/2017 -4.891e-01 2.545e-01 -1.922 0.054633 .

Date3/27/2016 -7.001e-01 2.750e-01 -2.546 0.010909 *

Date3/29/2015 -5.422e-01 2.752e-01 -1.970 0.048825 *

Date3/4/2018 -2.425e+00 1.032e+00 -2.351 0.018729 *

Date3/5/2017 -4.515e-01 1.919e-01 -2.353 0.018655 *

Date3/6/2016 -5.355e-01 2.123e-01 -2.522 0.011691 *

Date3/8/2015 -4.273e-01 2.126e-01 -2.010 0.044428 *

Date4/10/2016 -8.720e-01 3.169e-01 -2.752 0.005933 **

Date4/12/2015 -6.769e-01 3.171e-01 -2.135 0.032795 *

Date4/16/2017 -5.410e-01 3.172e-01 -1.705 0.088149 .

Date4/17/2016 -8.775e-01 3.378e-01 -2.597 0.009400 **

Date4/19/2015 -7.213e-01 3.380e-01 -2.134 0.032859 *

Date4/2/2017 -4.949e-01 2.754e-01 -1.797 0.072298 .

Date4/23/2017 -5.515e-01 3.382e-01 -1.631 0.103008

Date4/24/2016 -9.565e-01 3.588e-01 -2.666 0.007685 **

Date4/26/2015 -7.589e-01 3.590e-01 -2.114 0.034526 *

Date4/3/2016 -7.599e-01 2.959e-01 -2.568 0.010236 *

Date4/30/2017 -5.957e-01 3.592e-01 -1.658 0.097247 .

Date4/5/2015 -5.778e-01 2.961e-01 -1.951 0.051057 .

Date4/9/2017 -5.284e-01 2.963e-01 -1.783 0.074560 .

Date5/1/2016 -1.031e+00 3.798e-01 -2.715 0.006630 **

Date5/10/2015 -9.202e-01 4.010e-01 -2.295 0.021747 *

Date5/14/2017 -7.216e-01 4.011e-01 -1.799 0.072039 .

Date5/15/2016 -1.085e+00 4.217e-01 -2.573 0.010096 *

Date5/17/2015 -9.442e-01 4.220e-01 -2.238 0.025252 *

Date5/21/2017 -7.393e-01 4.221e-01 -1.751 0.079915 .

Date5/22/2016 -1.129e+00 4.427e-01 -2.551 0.010749 *

Date5/24/2015 -9.661e-01 4.430e-01 -2.181 0.029191 *

Date5/28/2017 -7.721e-01 4.431e-01 -1.742 0.081463 .

Date5/29/2016 -1.157e+00 4.637e-01 -2.495 0.012593 *

Date5/3/2015 -9.084e-01 3.800e-01 -2.391 0.016824 *

Date5/31/2015 -1.016e+00 4.640e-01 -2.189 0.028604 *

Date5/7/2017 -7.352e-01 3.802e-01 -1.934 0.053148 .

Date5/8/2016 -1.100e+00 4.007e-01 -2.746 0.006048 **

Date6/11/2017 -9.327e-01 4.851e-01 -1.923 0.054552 .

Date6/12/2016 -1.209e+00 5.058e-01 -2.390 0.016877 *

Date6/14/2015 -1.100e+00 5.060e-01 -2.173 0.029760 *

Date6/18/2017 -9.662e-01 5.058e-01 -1.910 0.056131 .

Date6/19/2016 -1.282e+00 5.268e-01 -2.434 0.014958 *

Date6/21/2015 -1.144e+00 5.270e-01 -2.172 0.029893 *

Date6/25/2017 -9.970e-01 5.268e-01 -1.892 0.058455 .

Date6/26/2016 -1.301e+00 5.478e-01 -2.375 0.017552 *

Date6/28/2015 -1.197e+00 5.480e-01 -2.184 0.028988 *

Date6/4/2017 -8.378e-01 4.641e-01 -1.805 0.071077 .

Date6/5/2016 -1.215e+00 4.847e-01 -2.507 0.012182 *

Date6/7/2015 -1.069e+00 4.850e-01 -2.204 0.027543 *

Date7/10/2016 -1.361e+00 5.898e-01 -2.308 0.021000 *

Date7/12/2015 -1.289e+00 5.900e-01 -2.185 0.028935 *

Date7/16/2017 -1.124e+00 5.898e-01 -1.906 0.056663 .

Date7/17/2016 -1.336e+00 6.108e-01 -2.187 0.028790 *

Date7/19/2015 -1.369e+00 6.110e-01 -2.240 0.025086 *

Date7/2/2017 -1.039e+00 5.478e-01 -1.897 0.057835 .

Date7/23/2017 -1.195e+00 6.109e-01 -1.956 0.050496 .

Date7/24/2016 -1.332e+00 6.319e-01 -2.108 0.035017 *

Date7/26/2015 -1.386e+00 6.321e-01 -2.192 0.028367 *

Date7/3/2016 -1.372e+00 5.688e-01 -2.412 0.015855 *

Date7/30/2017 -1.238e+00 6.319e-01 -1.959 0.050152 .

Date7/31/2016 -1.413e+00 6.529e-01 -2.164 0.030444 *

Date7/5/2015 -1.240e+00 5.690e-01 -2.180 0.029305 *

Date7/9/2017 -1.119e+00 5.688e-01 -1.968 0.049062 *

Date8/13/2017 -1.269e+00 6.739e-01 -1.884 0.059640 .

Date8/14/2016 -1.542e+00 6.949e-01 -2.218 0.026545 *

Date8/16/2015 -1.511e+00 6.952e-01 -2.173 0.029782 *

Date8/2/2015 -1.380e+00 6.531e-01 -2.113 0.034585 *

Date8/20/2017 -1.225e+00 6.950e-01 -1.763 0.077901 .

Date8/21/2016 -1.597e+00 7.160e-01 -2.230 0.025762 *

Date8/23/2015 -1.566e+00 7.162e-01 -2.186 0.028816 *

Date8/27/2017 -1.202e+00 7.160e-01 -1.679 0.093186 .

Date8/28/2016 -1.669e+00 7.370e-01 -2.264 0.023586 *

Date8/30/2015 -1.643e+00 7.372e-01 -2.229 0.025857 *

Date8/6/2017 -1.255e+00 6.529e-01 -1.922 0.054564 .

Date8/7/2016 -1.495e+00 6.739e-01 -2.219 0.026531 *

Date8/9/2015 -1.476e+00 6.741e-01 -2.190 0.028561 *

Date9/10/2017 -1.270e+00 7.580e-01 -1.676 0.093786 .

Date9/11/2016 -1.805e+00 7.791e-01 -2.317 0.020533 *

Date9/13/2015 -1.701e+00 7.793e-01 -2.182 0.029107 *

Date9/17/2017 -1.332e+00 7.791e-01 -1.709 0.087441 .

Date9/18/2016 -1.765e+00 8.001e-01 -2.207 0.027359 *

Date9/20/2015 -1.767e+00 8.003e-01 -2.208 0.027242 *

Date9/24/2017 -1.380e+00 8.001e-01 -1.725 0.084564 .

Date9/25/2016 -1.730e+00 8.211e-01 -2.107 0.035101 *

Date9/27/2015 -1.815e+00 8.213e-01 -2.210 0.027117 *

Date9/3/2017 -1.209e+00 7.370e-01 -1.640 0.100974

Date9/4/2016 -1.739e+00 7.580e-01 -2.294 0.021816 *

Date9/6/2015 -1.678e+00 7.582e-01 -2.212 0.026952 *

Total.Volume -4.448e-05 3.524e-05 -1.262 0.206885

X4046 4.447e-05 3.524e-05 1.262 0.207010

X4225 4.446e-05 3.524e-05 1.262 0.207046

X4770 4.467e-05 3.524e-05 1.268 0.204947

Total.Bags -2.243e-02 2.622e-02 -0.856 0.392153

Small.Bags 2.248e-02 2.622e-02 0.857 0.391218

Large.Bags 2.248e-02 2.622e-02 0.857 0.391218

XLarge.Bags 2.248e-02 2.622e-02 0.857 0.391201

typeorganic 4.940e-01 3.542e-03 139.467 < 2e-16 ***

year2016 NA NA NA NA

year2017 NA NA NA NA

year2018 NA NA NA NA

regionAtlanta -2.215e-01 1.737e-02 -12.747 < 2e-16 ***

regionBaltimoreWashington -2.665e-02 1.739e-02 -1.532 0.125462

regionBoise -2.136e-01 1.736e-02 -12.300 < 2e-16 ***

regionBoston -2.859e-02 1.738e-02 -1.644 0.100126

regionBuffaloRochester -4.458e-02 1.736e-02 -2.568 0.010239 *

regionCalifornia -1.717e-01 1.776e-02 -9.664 < 2e-16 ***

regionCharlotte 4.284e-02 1.737e-02 2.467 0.013634 *

regionChicago -1.260e-02 1.745e-02 -0.722 0.470530

regionCincinnatiDayton -3.513e-01 1.738e-02 -20.219 < 2e-16 ***

regionColumbus -3.095e-01 1.736e-02 -17.828 < 2e-16 ***

regionDallasFtWorth -4.738e-01 1.740e-02 -27.232 < 2e-16 ***

regionDenver -3.384e-01 1.747e-02 -19.373 < 2e-16 ***

regionDetroit -2.928e-01 1.740e-02 -16.832 < 2e-16 ***

regionGrandRapids -5.949e-02 1.736e-02 -3.426 0.000613 ***

regionGreatLakes -2.476e-01 1.805e-02 -13.713 < 2e-16 ***

regionHarrisburgScranton -4.779e-02 1.736e-02 -2.753 0.005910 **

regionHartfordSpringfield 2.586e-01 1.737e-02 14.892 < 2e-16 ***

regionHouston -5.110e-01 1.739e-02 -29.387 < 2e-16 ***

regionIndianapolis -2.476e-01 1.736e-02 -14.261 < 2e-16 ***

regionJacksonville -4.966e-02 1.736e-02 -2.860 0.004238 **

regionLasVegas -1.792e-01 1.736e-02 -10.321 < 2e-16 ***

regionLosAngeles -3.541e-01 1.764e-02 -20.069 < 2e-16 ***

regionLouisville -2.745e-01 1.736e-02 -15.814 < 2e-16 ***

regionMiamiFtLauderdale -1.301e-01 1.738e-02 -7.486 7.40e-14 ***

regionMidsouth -1.587e-01 1.754e-02 -9.047 < 2e-16 ***

regionNashville -3.493e-01 1.736e-02 -20.116 < 2e-16 ***

regionNewOrleansMobile -2.571e-01 1.737e-02 -14.806 < 2e-16 ***

regionNewYork 1.711e-01 1.752e-02 9.765 < 2e-16 ***

regionNortheast 5.327e-02 1.879e-02 2.835 0.004589 **

regionNorthernNewEngland -8.248e-02 1.737e-02 -4.748 2.07e-06 ***

regionOrlando -5.377e-02 1.737e-02 -3.096 0.001964 **

regionPhiladelphia 7.187e-02 1.737e-02 4.138 3.52e-05 ***

regionPhoenixTucson -3.319e-01 1.741e-02 -19.061 < 2e-16 ***

regionPittsburgh -1.969e-01 1.736e-02 -11.345 < 2e-16 ***

regionPlains -1.212e-01 1.741e-02 -6.959 3.54e-12 ***

regionPortland -2.436e-01 1.738e-02 -14.018 < 2e-16 ***

regionRaleighGreensboro -8.023e-03 1.737e-02 -0.462 0.644139

regionRichmondNorfolk -2.701e-01 1.736e-02 -15.557 < 2e-16 ***

regionRoanoke -3.132e-01 1.736e-02 -18.039 < 2e-16 ***

regionSacramento 6.138e-02 1.736e-02 3.535 0.000408 ***

regionSanDiego -1.631e-01 1.736e-02 -9.396 < 2e-16 ***

regionSanFrancisco 2.450e-01 1.737e-02 14.098 < 2e-16 ***

regionSeattle -1.178e-01 1.738e-02 -6.780 1.24e-11 ***

regionSouthCarolina -1.580e-01 1.736e-02 -9.098 < 2e-16 ***

regionSouthCentral -4.513e-01 1.799e-02 -25.094 < 2e-16 ***

regionSoutheast -1.517e-01 1.785e-02 -8.495 < 2e-16 ***

regionSpokane -1.157e-01 1.736e-02 -6.666 2.70e-11 ***

regionStLouis -1.308e-01 1.736e-02 -7.536 5.06e-14 ***

regionSyracuse -4.104e-02 1.736e-02 -2.364 0.018082 *

regionTampa -1.509e-01 1.737e-02 -8.688 < 2e-16 ***

regionTotalUS -2.173e-01 2.167e-02 -10.028 < 2e-16 ***

regionWest -2.692e-01 1.811e-02 -14.864 < 2e-16 ***

regionWestTexNewMexico -3.092e-01 1.847e-02 -16.745 < 2e-16 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2257 on 18017 degrees of freedom

Multiple R-squared: 0.6899, Adjusted R-squared: 0.686

F-statistic: 173.5 on 231 and 18017 DF, p-value: < 2.2e-16

The adjusted R-squired is used to determine the effectiveness of fit, which is also known as goodness fit of the regression model. An increased value of Adjusted R-squared indicates that the model is good. In this case, and as shown in the above results, the Adjusted R-squared was 0.686. Although this is a good indication, the model can be improved further by adjusting the parameters as shown in the code below

> par(mfrow=c(2,2))

> plot(linear_model)

The residual vs. fitted graph is the most significant figure among the four charts indicated above. The residual value stands for the variance of the actual value and the fitted value, while fitted value stands for the predicted value of the model. As shown in the graphical representation, the graph assumes a funnel like shape from starting from right side to the left side. This implies that the variances used to generate the regression model were not equal, a situation called heteroskedasticity. This also explains the visible patterns displayed in the graph, as a result of variation of constant between variables. Heteroskedasticity can be handled by determining the log of response variables. It is crucial to note that although the above results is n indication of a good model, as indicated by the adjusted R-squared, the reduction of heteroskedasticity can further improve the model through adjustment of parameters by taking log of the response variables as shown in the set of codes shown below

> par(mfrow=c(2,2))

> plot(linear_model)

> linear_model <- lm(log(AveragePrice) ~ ., data = Avocado)

> summary(linear_model)

The output of running this code is:

Residual standard error: 0.1582 on 18017 degrees of freedom

Multiple R-squared: 0.7054, Adjusted R-squared: 0.7016

F-statistic: 186.7 on 231 and 18017 DF, p-value: < 2.2e-1

The adjustment of parameters has improved the adjusted R-squared from 0.686 to 0.7016, and therefore making the model even much better. Loading Metric package through “install packages (“Metrics”) command can help in calculating root mean squared error (RMSE) as shown below.

> library(Metrics)

> rmse(Avocado$AveragePrice, exp(linear_model$fitted.values))

[1] 0.231306

Decision Tree Algorithm

This is a form of supervised learning algorithm with predefined target variables. One of the moost common uses of decision tree algorithm is classification of data. Decision tree algorithm works by dividing of dataset into different but similar sets, which is determined by differentiator in the input variables. A decision tree comprises of root nodes, splitting, decision node, terminal node, sub-tree, parent, and child node. The root node represents the entire sample and can be further be grouped into other multiple homogenous sets. The process by which nodes are divided into two or more sub-nodes is called splitting and leads to formation of a decision node. On the other hand, there are some nodes that cannot be divided further, and are known as the terminal nodes. In the current project, decision tree algorithm is used to design a predictive model in R, and it is implemented using various available packages. Some of the most useful packages in this case include the caret and rpart package. The caret package therefore in this case is used to ensure that the model that is generated here is robust and it is not susceptible to over fitting. In addition, the decision tree is implemented in R using cp (complexity parameter). With cp, the variation between the precision of the dataset and the complexity of the actual model to be generated is measured. The smaller the complexity parameter, the bigger the decision tree, which is likely to over fit the model, additionally, there are chances of under fitting the model, if it fails to capture the underlying trends property.The following code was used to obtain the out presented below:

#setting the tree control parameters

> fitControl <- trainControl(method = “cv”, number = 5)

> cartGrid <- expand.grid(.cp= (1:50)*0.01

#decision tree

> tree_model <- train(AveragePrice ~ ., data = Avocado, method = “rpart”, trControl = fitControl, tuneGrid = cartGrid)

>print(tree_model)

CART

18249 samples

13 predictor

No pre-processing

Resampling: Cross-Validated (5 fold)

Summary of sample sizes: 14599, 14600, 14597, 14600, 14600

Resampling results across tuning parameters:

cp RMSE Rsquared MAE

0.01 0.2752946 0.5326203 0.2125589

0.02 0.2997619 0.4458872 0.2296112

0.03 0.3075545 0.4166823 0.2380681

0.04 0.3122469 0.3986878 0.2413319

0.05 0.3172388 0.3794028 0.2442293

0.06 0.3172388 0.3794028 0.2442293

0.07 0.3172388 0.3794028 0.2442293

0.08 0.3172388 0.3794028 0.2442293

0.09 0.3172388 0.3794028 0.2442293

0.10 0.3172388 0.3794028 0.2442293

0.11 0.3172388 0.3794028 0.2442293

0.12 0.3172388 0.3794028 0.2442293

0.13 0.3172388 0.3794028 0.2442293

0.14 0.3172388 0.3794028 0.2442293

0.15 0.3172388 0.3794028 0.2442293

0.16 0.3172388 0.3794028 0.2442293

0.17 0.3172388 0.3794028 0.2442293

0.18 0.3172388 0.3794028 0.2442293

0.19 0.3172388 0.3794028 0.2442293

0.20 0.3172388 0.3794028 0.2442293

0.21 0.3172388 0.3794028 0.2442293

0.22 0.3172388 0.3794028 0.2442293

0.23 0.3172388 0.3794028 0.2442293

0.24 0.3172388 0.3794028 0.2442293

0.25 0.3172388 0.3794028 0.2442293

0.26 0.3172388 0.3794028 0.2442293

0.27 0.3172388 0.3794028 0.2442293

0.28 0.3172388 0.3794028 0.2442293

0.29 0.3172388 0.3794028 0.2442293

0.30 0.3172388 0.3794028 0.2442293

0.31 0.3172388 0.3794028 0.2442293

0.32 0.3172388 0.3794028 0.2442293

0.33 0.3172388 0.3794028 0.2442293

0.34 0.3172388 0.3794028 0.2442293

0.35 0.3172388 0.3794028 0.2442293

0.36 0.3172388 0.3794028 0.2442293

0.37 0.3172388 0.3794028 0.2442293

0.38 0.3863415 0.3628635 0.3091749

0.39 0.4026596 NaN 0.3242786

0.40 0.4026596 NaN 0.3242786

0.41 0.4026596 NaN 0.3242786

0.42 0.4026596 NaN 0.3242786

0.43 0.4026596 NaN 0.3242786

0.44 0.4026596 NaN 0.3242786

0.45 0.4026596 NaN 0.3242786

0.46 0.4026596 NaN 0.3242786

0.47 0.4026596 NaN 0.3242786

0.48 0.4026596 NaN 0.3242786

0.49 0.4026596 NaN 0.3242786

0.50 0.4026596 NaN 0.3242786

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was cp = 0.01.

Recommended Classifier

The main objective in this project was to observe trend in the avocado market in the United State based on average price, regions, and the type of avocado. This was done using different machine learning algorithms to evaluate and estimate the number of avocados sold. For this purpose two major algorithms were evaluated using R. These included linear regression and decision tree algorithms. Under linear regression, the extent to which a linear relationship between the dependent variable and one or more dependent variables was determined. Here the average price of a single avocado was used to predict the value of other variables. The data obtained from the regression model was then used to plot residual vs. fit graph, where the residual value was used to represent the difference between the actual and predicted outcome value. On the other hand, the fitted value was used to represent the predicted values. The final graph ws funnel shaped from right to left, indicating that the regression model used here is affected unequal variance in error terms, also known as heteroskedasticity. The other algorithm used to model the avocado price dataset was the decision tree. Owing to the large size of the data and challenges in selecting the best cp, the tree could not be generated due to over-plotting errors. As such linear regression was selected as the best classifier of the avocado price dataset.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Data Analysis ”

Get high-quality paper

NEW! AI matching with writer

Continue to order Get a quote

Homework help cost calculator

Homework type:

Pages:

600 words

Academic level:

We'll send you the complete homework by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

Free title page and bibliography
Unlimited revisions
Plagiarism-free guarantee
Money-back guarantee
24/7 customer support

On-demand options

Writer’s samples
Part-by-part delivery
4 hour deadline
Copies of used sources
Expert Proofreading

Paper format

300 words per page
12 pt Arial/Times New Roman
Double line spacing
Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Data Analysis

Introduction

Problem Statement

Data Source

Dataset

Trend of Avocado Prices

Trend of avocado prices by Avocado type in last four years by region

Maximum Avocado Price in Last Four Years by each Region and Type

This will be achieved by running the following set of code

> bottom5_volume <- avocado_type_DT[order(Volume), head(.SD,3),by = .(`Avocado Type`, year, type)]

> bottom5_price <- avocado_type_DT[order(NetPrice), head(.SD,3),by = .(`Avocado Type`, year, type)]

> options(repr.plot.width= 9, repr.plot.height=7)

> ggplot(bottom5_volume, aes(x=region, y=Volume))+

+ geom_point(aes(col=year), size=2) +

+ facet_grid(bottom5_volume$type ~bottom5_volume$`Avocado Type`, scales=”free”)+

+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+

+ labs(y = “Total Price (in Millions)”, x =”Region”, title = “Total Price for bottom 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+

+ coord_flip()

Predictive Modeling using Machine Learning

Regression Analysis

Linear (Multiple) Regression

Decision Tree Algorithm

Recommended Classifier

Homework help cost calculator

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee