Data Analysis

Data Analysis

Introduction

Avocado, also known as Persea Americana was first introduced in California, United State in 1871. By 1950s, there were different varieties of avocado being sold in the market in different parts of the US, with Fuerte being the most consumed variety. This however changed in the next twenty years, where Hass avocado became the most consumed variety not only in the United State, but also in other different parts of the world. At the moment, Avocado has become one of the most consumed fruit, not only due to its taste, but also because of its contribution to health.  Studies have shown that 85% of the avocados produced and sold globally are of Hass variety. This can be attributed to the fact that Hass variety can grow in almost all regions, and throughout the year. Although Mexico is the leading producer of Avocado in the world, the US is the leading country concerning imports of avocados, accounting to about a million tons per year.  Since 2008, the avocado market in the US has been growing at 16% every year, and this is expected to continue in the coming future. The amount of Avocado produced in states such as California and Hawaii is huge, however, it does not satisfy the market demand, and therefore more has to be imported from countries such as Mexico, Chile, Peru, Dominican Republic, among other countries.  It is however to note that the consumption of avocado is not uniform throughout the country. For instance, in California, 90% of the families consume about three units of avocado per month.

Problem Statement

 Study has shown that about fifty million dollars are spent every year on advertising and carrying out promotional activities on healthy avocado consumption (Cavaletto 465).In the light of this, collection and scientific analysis of information on the avocado market and the rate of consumption in different state could be of great help to producers, vendors, avocado association, and companies dealing with processing of this fruit.  Such data could be used in selecting the right places to sell avocado and to determine places where marketing campaigns can be carried out successfully or help in development of production innovations and new strategies for increasing the sale of such a product. The following paper aims to using machine learning techniques to analyze a dataset in the bid to determine trends in the sales of avocado in different states in the US, the number of units sold per month, and the total sales in different parts of the country. Such information will go a long way in helping avocado producers, vendors, associations and companies in making informed decision when planning on their sales and marketing campaigns as well as getting to know the sales expected to be registered in advance for a given state. The result obtained in this paper could be an essential input for making rational decision in relation to the avocado market and making the right step toward encouraging consumption of avocado, or shifting supplies to areas whose demand is much higher. 

Data Source

The data used in this project was obtained from Kaggle Inc.as provided by the Hass Avocado Board program. The Hass Avocado Board is a program by the US government that is financed through tax applied on all Hass avocados sold in the US market. Most of these funds are channeled toward advertisement and promotion programs.  Hass Avocado Board also helps in collecting, tracking, analyzing, and dissemination of information on the sales of Hass avocados in the US market.  This information is used for research and making decision on cultivation, harvesting, distribution, and marketing of avocados.  

Dataset

The dataset, named Avocado Prices, was collected between 2013 and 2018 and comprises of 18249 rows and 14 columns. This is determined by first loading the dataset into the R-studio before determining the dimensions as shown below:

> Avocado <- read.csv (“Avoc.csv”)

> dim(Avocado)

[1] 18249    14

 Some of the most relevant columns here include Dates, for data when the observation was made, AveragePrice, for the average price of a single avocado, type, for the either conventional or organic avocados, year, for the year when the observation was made, Region, for the city or the region where the observation was made, Total Volume for the summation of all the avocados sold, 4046, representing the total number of the avocados with Product Lookup codes (PLU) 4046 sold, 4225, for the total number of avocados with PLU 4225 sold, and 4770 for the total number of avocados with PLU 4770 sold. This was determined as shown below

> str (Avocado)

‘data.frame’:   18249 obs. of  15 variables:

 $ Unnamed.0   : int  0 1 2 3 4 5 6 7 8 9 …

 $ Date        : Factor w/ 169 levels “1/1/2017″,”1/10/2016”,..: 54 51 48 58 42 39 36 45 33 27 …

 $ months      : logi  NA NA NA NA NA NA …

 $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 …

 $ Total.Volume: num  64237 54877 118220 78992 51040 …

 $ X4046       : num  1037 674 795 1132 941 …

 $ X4225       : num  54455 44639 109150 71976 43838 …

 $ X4770       : num  48.2 58.3 130.5 72.6 75.8 …

 $ Total.Bags  : num  8697 9506 8145 5811 6184 …

 $ Small.Bags  : num  8604 9408 8042 5677 5986 …

 $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 …

 $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 …

 $ type        : Factor w/ 2 levels “conventional”,..: 1 1 1 1 1 1 1 1 1 1 …

 $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 …

 $ region      : Factor w/ 54 levels “Albany”,”Atlanta”,..: 1 1 1 1 1 1 1 1 1 1 …

To ensure that the dataset was appropriate for this study, it was subjected to several cleaning process, following the Cross Industry Standard process for Data Mining (CRISP-DM) methodology using R-studio. This was done by first searching for missing values and blank spaces. Fortunately there were no any missing data or values in the Avocado prices dataset and only the first column was unnamed as shown in the output below:

> colSums(is.na(Avocado))

Unnamed.0         Date   AveragePrice Total.Volume        X4046 

           0            0        18249            0            0            0 

X4225        X4770   Total.Bags   Small.Bags   Large.Bags  XLarge.Bags 

           0            0            0            0            0            0 

        type         year       region 

           0            0            0

>summary(Avocado     
  Unnamed.0DateAveragePriceTotal.VolumeSmall.Bags         XLarge.Bags                 
Min.   : 0.001/1/2017 :108Min.   :0.440Min.   :      85Min.   :       0   Min.   :     0
1st Qu.:10.001/10/2016:1081st Qu.:1.1001st Qu.:   108391st Qu.:    28491st Qu.:     0
Median:24.001/11/2015: 108Median :1.370Median :  107377Median :   26363   Median :     0.
Mean   :24.231/14/2018:108Mean   :1.406Mean   :  850644Mean   :  182195   Mean   :  3106.4
3rd Qu.:38.001/15/2017: 1083rd Qu.:1.6603rd Qu.:  4329623rd Qu.:   833383rd Qu.:   132.5
Max.   :52.001/17/2016:  108Max.   :3.250Max.   :62505647Max.   :13384587   Max.   :551693.7                       
                                                                                                      (Other)  :17601                                                                                             
X4046X4225X4770        Total.Bags         Large.Bags       Type
Min.   :     0Min.   :       0Min.   :      0  Min.   :       0   Min.   :       0   conventional:9126   
1st Qu.:    8541st Qu.:    30091st Qu.:      0  1st Qu.:    5089   1st Qu.:    127organic     :9123   
Median :   8645Median :   29061         Median :    185  Median :   39744   Median :   2648    
Mean   :  293008Mean   :  295155Mean   :  22840  Mean   :  239639   Mean   :  54338    
3rd Qu.:  1110203rd Qu.:  1502073rd Qu.:   6243  3rd Qu.:  110783   3rd Qu.:  22029    
Max.   :22743616Max.   :20470573      Max.   :2546439  Max.   :19373134   Max.   :5719097    
                                                                                                              (Other)            :16221  

Trend of Avocado Prices

The trend in the price of avocados will be done using several visualization diagrams. The first to be used is the box plot, which will help in determining the trend in the price of both organic and conventional avocados. This will be done using the following set of codes:

> options(repr.plot.width= 7, repr.plot.height=5)

> ggplot(Avocado, aes(Avocado$type, Avocado$AveragePrice))+

+     geom_boxplot(aes(colour = Avocado$year))+

+     labs(colour = “Year”, x = “Type”, y =”Average Price”, title = ” Average price per year by avocado type”)

Trend of avocado prices by Avocado type in last four years by region

This will be done by first grouping regions using the following set of codes:

>min_con = round(min(grouped_region_conv$AveragePrice),1)-0.1

> max_con = round(max(grouped_region_conv$AveragePrice),1)+0.1

> grouped_region_org = Avocado %>% 

+   select(year, region, type, AveragePrice) %>%

+   filter(type == ‘organic’)

> min_org = round(min(grouped_region_org$AveragePrice),1)-0.1

> max_org = round(max(grouped_region_org$AveragePrice),1)+0.1

> options(repr.plot.width= 10, repr.plot.height=12)

> ggplot(grouped_region_conv, aes(x=region, y=AveragePrice)) +

+ geom_tufteboxplot() + 

+     facet_grid(.~grouped_region_conv$year, scales=”free”) +

+   labs(colour = “Year”, x = “Region”, y =”Average Price”, title = “Average prices of Conventional Avocados for each region by year”)+

+   scale_y_continuous(breaks=c(seq(min_con,max_con,0.2)), limits = c(min_con,max_con)) +

+   coord_flip() + 

+   theme(axis.text.x = element_text(angle = 90, vjust = 0))

Maximum Avocado Price in Last Four Years by each Region and Type

This will be achieved by running the following set of code

> grouped_region_max_conventional = Avocado %>% 

+   group_by(region, type) %>% 

+   select(region, type,AveragePrice) %>%

+   summarise(maxPrice = max(AveragePrice)) %>%

+   filter(type ==’conventional’)

> grouped_region_max_conventional$region <- factor(grouped_region_max_conventional$region, levels= pull(arrange(grouped_region_max_conventional,                                                                                                 (grouped_region_max_conventional$maxPrice)), region))

> grouped_region_max_organic = Avocado %>% 

+   group_by(region, type) %>% 

+   select(region, type,AveragePrice) %>%

+   summarise(maxPrice = max(AveragePrice)) %>%

+   filter(type ==’organic’)

> grouped_region_max_organic$region <- factor(grouped_region_max_organic$region, levels= pull(arrange(grouped_region_max_organic,

+  (grouped_region_max_organic$maxPrice)), region))

> plot3 <- ggplot(grouped_region_max_conventional, aes(x=maxPrice, y=region, label = round(maxPrice, 1)))+

+  geom_segment(aes(x = 0, y = region, xend = maxPrice, yend = region), color = “grey50”)+

+ labs(x = “Region”, y =”Average Price”, title = “Max. prices of avocado (Conventional)”)+

+   geom_point()  +

+   geom_text(nudge_x = 0.3)

> options(repr.plot.width= 10, repr.plot.height=10)

> grid.arrange(plot3,plot2, ncol=2)

The top region in production of avocado by PLU, type of avocado and year in relation to the total volume of avocado sold  and net price can also be determined  and visualized using the following set of code 

> grouped_avocado_type = avocadoCodes %>% 

+   select(year,type, `Avocado Type`,Volume, TotalPrice, region, AveragePrice) %>%

+   group_by(year,`Avocado Type`, region, type) %>% 

+   summarise(Volume = sum(Volume)/1000000, NetPrice = sum(TotalPrice)/1000000) %>%

+   filter(Volume > 0  & NetPrice > 0)

> avocado_type_DT <- data.table(grouped_avocado_type)

> top5_volume <-  avocado_type_DT[order(-Volume), head(.SD,5),by = .(`Avocado Type`, year, type)]

> top5_price <-  avocado_type_DT[order(-NetPrice), head(.SD,5),by = .(`Avocado Type`, year, type)]

> options(repr.plot.width= 8, repr.plot.height=5)

> ggplot(top5_volume, aes(x=region, y=Volume))+

+     geom_point(aes(col=year), size=2)  +

+     facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+

+     labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+

+     coord_flip(

Analysis of the data and the above visual representation shows that total US is an outlier and need to be eliminated from the data to obtain a clear view of the data. This is essential in the bid to increase visibility of trends in other regions better. This is done using the following set of code

> top5_volume = top5_volume %>% filter(region != ‘TotalUS’)

> top5_price = top5_price %>% filter(region != ‘TotalUS’)

> ggplot(top5_volume, aes(x=region, y=Volume))+

+     geom_point(aes(col=year), size=2)  +

+     facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+

+      labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region (excluding TotalUS)”, colour = ‘Year’)++     coord_flip()

The bottom regions by the PLU codes, avocado type, and year can also be used to visually determine their trend in relation to the total volume and net price of the avocado. This is done using the following set of code:

> bottom5_volume <-  avocado_type_DT[order(Volume), head(.SD,3),by = .(`Avocado Type`, year, type)]

> bottom5_price <-  avocado_type_DT[order(NetPrice), head(.SD,3),by = .(`Avocado Type`, year, type)]

> options(repr.plot.width= 9, repr.plot.height=7)

> ggplot(bottom5_volume, aes(x=region, y=Volume))+

+   geom_point(aes(col=year), size=2)  +

+   facet_grid(bottom5_volume$type ~bottom5_volume$`Avocado Type`, scales=”free”)+

+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+

+    labs(y = “Total Price (in Millions)”, x =”Region”, title = “Total Price for bottom 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+

+   coord_flip()

Predictive Modeling using Machine Learning

Regression Analysis

Regression analysis defines a predictive modeling method that is used to determine the relationship between dependent and independent variables. It is one of the most suitable methods that is used in investigating causal effects relationship between different variables in a dataset, Some of the benefits associated with use of regression analysis include the fact that it helps in determining the significant relationship between dependent and independent variable, as well as indicating the strength of the impact of different independent variables on the dependent variable. Using regression analysis, one can easily determine the extent to which different variables affects other variables within a dataset. For instance, in a given dataset, regression analysis can be used to determine the effect of price changes on demand and production. As such, regression analysis is crucial eliminating variables with no effects in generating a predictive model.

Linear (Multiple) Regression

The best regression analysis to use in situations where the several predictors and the response variables are continuous is the linear regression model. There are however other assumptions which need to be considered. This includes the fact that in linear regression, as the name suggest, there must be an existing linear relationship between response variables and predictive variables. However, the two should not be correlated in any way. This is owing to the fact that existence of correlation between predictors leads to multi-collinearity. In addition, it also assumed that there is no correlation between error terms as it leads to autocorrelation. The error terms must also be constant to avoid heteroskedasticity. The following R-codes helps in building the regression model of the avocado prices dataset:

> linear_model <- lm(AveragePrice~ ., data = Avocado)

> summary(linear_model)

 The output of the above code is as presented below:

Call:

lm(formula = AveragePrice ~ ., data = Avocado)

Residuals:

    Min      1Q  Median      3Q     Max 

-0.9836 -0.1208  0.0033  0.1272  1.4155 

Coefficients: (3 not defined because of singularities)

                            Estimate Std. Error t value Pr(>|t|)    

(Intercept)                3.837e+00  1.095e+00   3.505 0.000458 ***

Unnamed.0                 -5.094e-02  2.105e-02  -2.420 0.015542 *  

Date1/10/2016             -1.533e-01  5.181e-02  -2.959 0.003090 ** 

Date1/11/2015             -9.140e-03  5.197e-02  -0.176 0.860408    

Date1/14/2018             -1.994e+00  8.843e-01  -2.255 0.024134 *  

Date1/15/2017             -6.103e-02  5.211e-02  -1.171 0.241558    

Date1/17/2016             -1.616e-01  6.987e-02  -2.313 0.020714 *  

Date1/18/2015             -3.926e-02  7.007e-02  -0.560 0.575233    

Date1/21/2018             -2.092e+00  9.053e-01  -2.311 0.020848 *  

Date1/22/2017             -2.076e-01  7.023e-02  -2.956 0.003121 ** 

Date1/24/2016             -2.447e-01  8.925e-02  -2.741 0.006124 ** 

Date1/25/2015             -8.432e-02  8.947e-02  -0.942 0.345950    

Date1/28/2018             -2.126e+00  9.263e-01  -2.295 0.021718 *  

Date1/29/2017             -2.091e-01  8.964e-02  -2.333 0.019670 *  

Date1/3/2016              -1.260e-01  3.703e-02  -3.403 0.000668 ***

Date1/31/2016             -2.857e-01  1.093e-01  -2.615 0.008936 ** 

Date1/4/2015              -2.647e-02  3.714e-02  -0.713 0.476112    

Date1/7/2018              -2.009e+00  8.632e-01  -2.327 0.019953 *  

Date1/8/2017              -3.415e-02  3.723e-02  -0.917 0.358896    

Date10/1/2017             -1.393e+00  8.211e-01  -1.696 0.089821 .  

Date10/11/2015            -1.997e+00  8.634e-01  -2.313 0.020710 *  

Date10/15/2017            -1.583e+00  8.632e-01  -1.834 0.066632 .  

Date10/16/2016            -1.933e+00  8.843e-01  -2.186 0.028800 *  

Date10/18/2015            -2.026e+00  8.844e-01  -2.291 0.021964 *  

Date10/2/2016             -1.740e+00  8.422e-01  -2.066 0.038811 *  

Date10/22/2017            -1.753e+00  8.843e-01  -1.982 0.047501 *  

Date10/23/2016            -1.929e+00  9.053e-01  -2.131 0.033120 *  

Date10/25/2015            -2.088e+00  9.055e-01  -2.306 0.021124 *  

Date10/29/2017            -1.867e+00  9.053e-01  -2.062 0.039217 *  

Date10/30/2016            -1.818e+00  9.263e-01  -1.963 0.049654 *  

Date10/4/2015             -1.898e+00  8.424e-01  -2.254 0.024238 *  

Date10/8/2017             -1.472e+00  8.422e-01  -1.748 0.080426 .  

Date10/9/2016             -1.883e+00  8.632e-01  -2.182 0.029131 *  

Date11/1/2015             -2.217e+00  9.265e-01  -2.392 0.016750 *  

Date11/12/2017            -2.059e+00  9.474e-01  -2.173 0.029772 *  

Date11/13/2016            -2.037e+00  9.684e-01  -2.104 0.035431 *  

Date11/15/2015            -2.282e+00  9.686e-01  -2.356 0.018504 *  

Date11/19/2017            -2.122e+00  9.684e-01  -2.191 0.028463 *  

Date11/20/2016            -2.173e+00  9.894e-01  -2.196 0.028127 *  

Date11/22/2015            -2.342e+00  9.896e-01  -2.367 0.017956 *  

Date11/26/2017            -2.169e+00  9.895e-01  -2.192 0.028377 *  

Date11/27/2016            -2.227e+00  1.010e+00  -2.204 0.027554 *  

Date11/29/2015            -2.382e+00  1.011e+00  -2.357 0.018458 *  

Date11/5/2017             -1.959e+00  9.263e-01  -2.115 0.034484 *  

Date11/6/2016             -1.952e+00  9.474e-01  -2.060 0.039407 *  

Date11/8/2015             -2.238e+00  9.476e-01  -2.362 0.018190 *  

Date12/10/2017            -2.430e+00  1.032e+00  -2.356 0.018478 *  

Date12/11/2016            -2.517e+00  1.053e+00  -2.391 0.016802 *  

Date12/13/2015            -2.538e+00  1.053e+00  -2.411 0.015905 *  

Date12/17/2017            -2.457e+00  1.053e+00  -2.334 0.019612 *  

Date12/18/2016            -2.598e+00  1.074e+00  -2.420 0.015516 *  

Date12/20/2015            -2.537e+00  1.074e+00  -2.363 0.018151 *  

Date12/24/2017            -2.437e+00  1.074e+00  -2.270 0.023216 *  

Date12/25/2016            -2.617e+00  1.095e+00  -2.391 0.016815 *  

Date12/27/2015            -2.633e+00  1.095e+00  -2.406 0.016149 *  

Date12/3/2017             -2.337e+00  1.010e+00  -2.313 0.020747 *  

Date12/31/2017            -2.642e+00  1.095e+00  -2.414 0.015799 *  

Date12/4/2016             -2.403e+00  1.032e+00  -2.330 0.019823 *  

Date12/6/2015             -2.485e+00  1.032e+00  -2.409 0.016009 *  

Date2/1/2015              -2.835e-01  1.095e-01  -2.589 0.009626 ** 

Date2/11/2018             -2.315e+00  9.684e-01  -2.390 0.016839 *  

Date2/12/2017             -3.813e-01  1.300e-01  -2.934 0.003356 ** 

Date2/14/2016             -4.062e-01  1.501e-01  -2.706 0.006827 ** 

Date2/15/2015             -2.474e-01  1.504e-01  -1.645 0.099923 .  

Date2/18/2018             -2.299e+00  9.895e-01  -2.324 0.020138 *  

Date2/19/2017             -3.684e-01  1.505e-01  -2.447 0.014402 *  

Date2/21/2016             -4.116e-01  1.708e-01  -2.410 0.015966 *  

Date2/22/2015             -3.231e-01  1.710e-01  -1.890 0.058828 .  

Date2/25/2018             -2.366e+00  1.010e+00  -2.342 0.019216 *  

Date2/26/2017             -4.480e-01  1.712e-01  -2.617 0.008881 ** 

Date2/28/2016             -4.735e-01  1.915e-01  -2.472 0.013450 *  

Date2/4/2018              -2.334e+00  9.474e-01  -2.463 0.013780 *  

Date2/5/2017              -3.880e-01  1.096e-01  -3.538 0.000403 ***

Date2/7/2016              -4.158e-01  1.296e-01  -3.208 0.001337 ** 

Date2/8/2015              -2.715e-01  1.298e-01  -2.092 0.036496 *  

Date3/1/2015              -4.258e-01  1.918e-01  -2.221 0.026380 *  

Date3/11/2018             -2.493e+00  1.053e+00  -2.369 0.017865 *  

Date3/12/2017             -3.550e-01  2.127e-01  -1.669 0.095231 .  

Date3/13/2016             -6.448e-01  2.332e-01  -2.765 0.005700 ** 

Date3/15/2015             -4.465e-01  2.334e-01  -1.913 0.055764 .  

Date3/18/2018             -2.565e+00  1.074e+00  -2.389 0.016914 *  

Date3/19/2017             -3.736e-01  2.336e-01  -1.599 0.109752    

Date3/20/2016             -6.954e-01  2.541e-01  -2.737 0.006209 ** 

Date3/22/2015             -5.407e-01  2.543e-01  -2.126 0.033496 *  

Date3/25/2018             -2.583e+00  1.095e+00  -2.360 0.018302 *  

Date3/26/2017             -4.891e-01  2.545e-01  -1.922 0.054633 .  

Date3/27/2016             -7.001e-01  2.750e-01  -2.546 0.010909 *  

Date3/29/2015             -5.422e-01  2.752e-01  -1.970 0.048825 *  

Date3/4/2018              -2.425e+00  1.032e+00  -2.351 0.018729 *  

Date3/5/2017              -4.515e-01  1.919e-01  -2.353 0.018655 *  

Date3/6/2016              -5.355e-01  2.123e-01  -2.522 0.011691 *  

Date3/8/2015              -4.273e-01  2.126e-01  -2.010 0.044428 *  

Date4/10/2016             -8.720e-01  3.169e-01  -2.752 0.005933 ** 

Date4/12/2015             -6.769e-01  3.171e-01  -2.135 0.032795 *  

Date4/16/2017             -5.410e-01  3.172e-01  -1.705 0.088149 .  

Date4/17/2016             -8.775e-01  3.378e-01  -2.597 0.009400 ** 

Date4/19/2015             -7.213e-01  3.380e-01  -2.134 0.032859 *  

Date4/2/2017              -4.949e-01  2.754e-01  -1.797 0.072298 .  

Date4/23/2017             -5.515e-01  3.382e-01  -1.631 0.103008    

Date4/24/2016             -9.565e-01  3.588e-01  -2.666 0.007685 ** 

Date4/26/2015             -7.589e-01  3.590e-01  -2.114 0.034526 *  

Date4/3/2016              -7.599e-01  2.959e-01  -2.568 0.010236 *  

Date4/30/2017             -5.957e-01  3.592e-01  -1.658 0.097247 .  

Date4/5/2015              -5.778e-01  2.961e-01  -1.951 0.051057 .  

Date4/9/2017              -5.284e-01  2.963e-01  -1.783 0.074560 .  

Date5/1/2016              -1.031e+00  3.798e-01  -2.715 0.006630 ** 

Date5/10/2015             -9.202e-01  4.010e-01  -2.295 0.021747 *  

Date5/14/2017             -7.216e-01  4.011e-01  -1.799 0.072039 .  

Date5/15/2016             -1.085e+00  4.217e-01  -2.573 0.010096 *  

Date5/17/2015             -9.442e-01  4.220e-01  -2.238 0.025252 *  

Date5/21/2017             -7.393e-01  4.221e-01  -1.751 0.079915 .  

Date5/22/2016             -1.129e+00  4.427e-01  -2.551 0.010749 *  

Date5/24/2015             -9.661e-01  4.430e-01  -2.181 0.029191 *  

Date5/28/2017             -7.721e-01  4.431e-01  -1.742 0.081463 .  

Date5/29/2016             -1.157e+00  4.637e-01  -2.495 0.012593 *  

Date5/3/2015              -9.084e-01  3.800e-01  -2.391 0.016824 *  

Date5/31/2015             -1.016e+00  4.640e-01  -2.189 0.028604 *  

Date5/7/2017              -7.352e-01  3.802e-01  -1.934 0.053148 .  

Date5/8/2016              -1.100e+00  4.007e-01  -2.746 0.006048 ** 

Date6/11/2017             -9.327e-01  4.851e-01  -1.923 0.054552 .  

Date6/12/2016             -1.209e+00  5.058e-01  -2.390 0.016877 *  

Date6/14/2015             -1.100e+00  5.060e-01  -2.173 0.029760 *  

Date6/18/2017             -9.662e-01  5.058e-01  -1.910 0.056131 .  

Date6/19/2016             -1.282e+00  5.268e-01  -2.434 0.014958 *  

Date6/21/2015             -1.144e+00  5.270e-01  -2.172 0.029893 *  

Date6/25/2017             -9.970e-01  5.268e-01  -1.892 0.058455 .  

Date6/26/2016             -1.301e+00  5.478e-01  -2.375 0.017552 *  

Date6/28/2015             -1.197e+00  5.480e-01  -2.184 0.028988 *  

Date6/4/2017              -8.378e-01  4.641e-01  -1.805 0.071077 .  

Date6/5/2016              -1.215e+00  4.847e-01  -2.507 0.012182 *  

Date6/7/2015              -1.069e+00  4.850e-01  -2.204 0.027543 *  

Date7/10/2016             -1.361e+00  5.898e-01  -2.308 0.021000 *  

Date7/12/2015             -1.289e+00  5.900e-01  -2.185 0.028935 *  

Date7/16/2017             -1.124e+00  5.898e-01  -1.906 0.056663 .  

Date7/17/2016             -1.336e+00  6.108e-01  -2.187 0.028790 *  

Date7/19/2015             -1.369e+00  6.110e-01  -2.240 0.025086 *  

Date7/2/2017              -1.039e+00  5.478e-01  -1.897 0.057835 .  

Date7/23/2017             -1.195e+00  6.109e-01  -1.956 0.050496 .  

Date7/24/2016             -1.332e+00  6.319e-01  -2.108 0.035017 *  

Date7/26/2015             -1.386e+00  6.321e-01  -2.192 0.028367 *  

Date7/3/2016              -1.372e+00  5.688e-01  -2.412 0.015855 *  

Date7/30/2017             -1.238e+00  6.319e-01  -1.959 0.050152 .  

Date7/31/2016             -1.413e+00  6.529e-01  -2.164 0.030444 *  

Date7/5/2015              -1.240e+00  5.690e-01  -2.180 0.029305 *  

Date7/9/2017              -1.119e+00  5.688e-01  -1.968 0.049062 *  

Date8/13/2017             -1.269e+00  6.739e-01  -1.884 0.059640 .  

Date8/14/2016             -1.542e+00  6.949e-01  -2.218 0.026545 *  

Date8/16/2015             -1.511e+00  6.952e-01  -2.173 0.029782 *  

Date8/2/2015              -1.380e+00  6.531e-01  -2.113 0.034585 *  

Date8/20/2017             -1.225e+00  6.950e-01  -1.763 0.077901 .  

Date8/21/2016             -1.597e+00  7.160e-01  -2.230 0.025762 *  

Date8/23/2015             -1.566e+00  7.162e-01  -2.186 0.028816 *  

Date8/27/2017             -1.202e+00  7.160e-01  -1.679 0.093186 .  

Date8/28/2016             -1.669e+00  7.370e-01  -2.264 0.023586 *  

Date8/30/2015             -1.643e+00  7.372e-01  -2.229 0.025857 *  

Date8/6/2017              -1.255e+00  6.529e-01  -1.922 0.054564 .  

Date8/7/2016              -1.495e+00  6.739e-01  -2.219 0.026531 *  

Date8/9/2015              -1.476e+00  6.741e-01  -2.190 0.028561 *  

Date9/10/2017             -1.270e+00  7.580e-01  -1.676 0.093786 .  

Date9/11/2016             -1.805e+00  7.791e-01  -2.317 0.020533 *  

Date9/13/2015             -1.701e+00  7.793e-01  -2.182 0.029107 *  

Date9/17/2017             -1.332e+00  7.791e-01  -1.709 0.087441 .  

Date9/18/2016             -1.765e+00  8.001e-01  -2.207 0.027359 *  

Date9/20/2015             -1.767e+00  8.003e-01  -2.208 0.027242 *  

Date9/24/2017             -1.380e+00  8.001e-01  -1.725 0.084564 .  

Date9/25/2016             -1.730e+00  8.211e-01  -2.107 0.035101 *  

Date9/27/2015             -1.815e+00  8.213e-01  -2.210 0.027117 *  

Date9/3/2017              -1.209e+00  7.370e-01  -1.640 0.100974    

Date9/4/2016              -1.739e+00  7.580e-01  -2.294 0.021816 *  

Date9/6/2015              -1.678e+00  7.582e-01  -2.212 0.026952 *  

Total.Volume              -4.448e-05  3.524e-05  -1.262 0.206885    

X4046                      4.447e-05  3.524e-05   1.262 0.207010    

X4225                      4.446e-05  3.524e-05   1.262 0.207046    

X4770                      4.467e-05  3.524e-05   1.268 0.204947    

Total.Bags                -2.243e-02  2.622e-02  -0.856 0.392153    

Small.Bags                 2.248e-02  2.622e-02   0.857 0.391218    

Large.Bags                 2.248e-02  2.622e-02   0.857 0.391218    

XLarge.Bags                2.248e-02  2.622e-02   0.857 0.391201    

typeorganic                4.940e-01  3.542e-03 139.467  < 2e-16 ***

year2016                          NA         NA      NA       NA    

year2017                          NA         NA      NA       NA    

year2018                          NA         NA      NA       NA    

regionAtlanta             -2.215e-01  1.737e-02 -12.747  < 2e-16 ***

regionBaltimoreWashington -2.665e-02  1.739e-02  -1.532 0.125462    

regionBoise               -2.136e-01  1.736e-02 -12.300  < 2e-16 ***

regionBoston              -2.859e-02  1.738e-02  -1.644 0.100126    

regionBuffaloRochester    -4.458e-02  1.736e-02  -2.568 0.010239 *  

regionCalifornia          -1.717e-01  1.776e-02  -9.664  < 2e-16 ***

regionCharlotte            4.284e-02  1.737e-02   2.467 0.013634 *  

regionChicago             -1.260e-02  1.745e-02  -0.722 0.470530    

regionCincinnatiDayton    -3.513e-01  1.738e-02 -20.219  < 2e-16 ***

regionColumbus            -3.095e-01  1.736e-02 -17.828  < 2e-16 ***

regionDallasFtWorth       -4.738e-01  1.740e-02 -27.232  < 2e-16 ***

regionDenver              -3.384e-01  1.747e-02 -19.373  < 2e-16 ***

regionDetroit             -2.928e-01  1.740e-02 -16.832  < 2e-16 ***

regionGrandRapids         -5.949e-02  1.736e-02  -3.426 0.000613 ***

regionGreatLakes          -2.476e-01  1.805e-02 -13.713  < 2e-16 ***

regionHarrisburgScranton  -4.779e-02  1.736e-02  -2.753 0.005910 ** 

regionHartfordSpringfield  2.586e-01  1.737e-02  14.892  < 2e-16 ***

regionHouston             -5.110e-01  1.739e-02 -29.387  < 2e-16 ***

regionIndianapolis        -2.476e-01  1.736e-02 -14.261  < 2e-16 ***

regionJacksonville        -4.966e-02  1.736e-02  -2.860 0.004238 ** 

regionLasVegas            -1.792e-01  1.736e-02 -10.321  < 2e-16 ***

regionLosAngeles          -3.541e-01  1.764e-02 -20.069  < 2e-16 ***

regionLouisville          -2.745e-01  1.736e-02 -15.814  < 2e-16 ***

regionMiamiFtLauderdale   -1.301e-01  1.738e-02  -7.486 7.40e-14 ***

regionMidsouth            -1.587e-01  1.754e-02  -9.047  < 2e-16 ***

regionNashville           -3.493e-01  1.736e-02 -20.116  < 2e-16 ***

regionNewOrleansMobile    -2.571e-01  1.737e-02 -14.806  < 2e-16 ***

regionNewYork              1.711e-01  1.752e-02   9.765  < 2e-16 ***

regionNortheast            5.327e-02  1.879e-02   2.835 0.004589 ** 

regionNorthernNewEngland  -8.248e-02  1.737e-02  -4.748 2.07e-06 ***

regionOrlando             -5.377e-02  1.737e-02  -3.096 0.001964 ** 

regionPhiladelphia         7.187e-02  1.737e-02   4.138 3.52e-05 ***

regionPhoenixTucson       -3.319e-01  1.741e-02 -19.061  < 2e-16 ***

regionPittsburgh          -1.969e-01  1.736e-02 -11.345  < 2e-16 ***

regionPlains              -1.212e-01  1.741e-02  -6.959 3.54e-12 ***

regionPortland            -2.436e-01  1.738e-02 -14.018  < 2e-16 ***

regionRaleighGreensboro   -8.023e-03  1.737e-02  -0.462 0.644139    

regionRichmondNorfolk     -2.701e-01  1.736e-02 -15.557  < 2e-16 ***

regionRoanoke             -3.132e-01  1.736e-02 -18.039  < 2e-16 ***

regionSacramento           6.138e-02  1.736e-02   3.535 0.000408 ***

regionSanDiego            -1.631e-01  1.736e-02  -9.396  < 2e-16 ***

regionSanFrancisco         2.450e-01  1.737e-02  14.098  < 2e-16 ***

regionSeattle             -1.178e-01  1.738e-02  -6.780 1.24e-11 ***

regionSouthCarolina       -1.580e-01  1.736e-02  -9.098  < 2e-16 ***

regionSouthCentral        -4.513e-01  1.799e-02 -25.094  < 2e-16 ***

regionSoutheast           -1.517e-01  1.785e-02  -8.495  < 2e-16 ***

regionSpokane             -1.157e-01  1.736e-02  -6.666 2.70e-11 ***

regionStLouis             -1.308e-01  1.736e-02  -7.536 5.06e-14 ***

regionSyracuse            -4.104e-02  1.736e-02  -2.364 0.018082 *  

regionTampa               -1.509e-01  1.737e-02  -8.688  < 2e-16 ***

regionTotalUS             -2.173e-01  2.167e-02 -10.028  < 2e-16 ***

regionWest                -2.692e-01  1.811e-02 -14.864  < 2e-16 ***

regionWestTexNewMexico    -3.092e-01  1.847e-02 -16.745  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2257 on 18017 degrees of freedom

Multiple R-squared:  0.6899,    Adjusted R-squared:  0.686 

F-statistic: 173.5 on 231 and 18017 DF,  p-value: < 2.2e-16

The adjusted R-squired is used to determine the effectiveness of fit, which is also known as goodness fit of the regression model. An increased value of Adjusted R-squared indicates that the model is good. In this case, and as shown in the above results, the Adjusted R-squared was 0.686. Although this is a good indication, the model can be improved further by adjusting the parameters as shown in the code below

> par(mfrow=c(2,2))

> plot(linear_model)

The residual vs. fitted graph is the most significant figure among the four charts indicated above.  The residual value stands for the variance of the actual value and the fitted value, while fitted value stands for the predicted value of the model. As shown in the graphical representation, the graph assumes a funnel like shape from starting from right side to the left side. This implies that the variances used to generate the regression model were not equal, a situation called heteroskedasticity.  This also explains the visible patterns displayed in the graph, as a result of variation of constant between variables. Heteroskedasticity can be handled by determining the log of response variables. It is crucial to note that although the above results is n indication of a good model, as indicated by the adjusted R-squared, the reduction of heteroskedasticity can further improve the model through adjustment of parameters by taking log of the response variables as shown in the set of codes shown below

> par(mfrow=c(2,2))

> plot(linear_model)

> linear_model <- lm(log(AveragePrice) ~ ., data = Avocado)

> summary(linear_model)

The output of running this code is:

Residual standard error: 0.1582 on 18017 degrees of freedom

Multiple R-squared:  0.7054,    Adjusted R-squared:  0.7016 

F-statistic: 186.7 on 231 and 18017 DF,  p-value: < 2.2e-1

The adjustment of parameters has improved the adjusted R-squared from 0.686 to 0.7016, and therefore making the model even much better. Loading Metric package through “install packages (“Metrics”) command can help in calculating root mean squared error (RMSE) as shown below.

> library(Metrics)

> rmse(Avocado$AveragePrice, exp(linear_model$fitted.values))

[1] 0.231306

Decision Tree Algorithm

This is a form of supervised learning algorithm with predefined target variables. One of the moost common uses of decision tree algorithm is classification of data. Decision tree algorithm works by dividing of dataset into different but similar sets, which is determined by differentiator in the input variables. A decision tree comprises of root nodes, splitting, decision node, terminal node, sub-tree, parent, and child node. The root node represents the entire sample and can be further be grouped into other multiple homogenous sets. The process by which nodes are divided into two or more sub-nodes is called splitting and leads to formation of a decision node. On the other hand, there are some nodes that cannot be divided further, and are known as the terminal nodes. In the current project, decision tree algorithm is used to design a predictive model in R, and it is implemented using various available packages. Some of the most useful packages in this case include the caret and rpart package. The caret package therefore in this case is used to ensure that the model that is generated here is robust and it is not susceptible to over fitting. In addition, the decision tree is implemented in R using cp (complexity parameter). With cp, the variation between the precision of the dataset and the complexity of the actual model to be generated is measured. The smaller the complexity parameter, the bigger the decision tree, which is likely to over fit the model, additionally, there are chances of under fitting the model, if it fails to capture the underlying trends property.The following code was used to obtain the out presented below:

#setting the tree control parameters 

> fitControl <- trainControl(method = “cv”, number = 5)

> cartGrid <- expand.grid(.cp= (1:50)*0.01

#decision tree 

> tree_model <- train(AveragePrice ~ ., data = Avocado, method = “rpart”, trControl = fitControl, tuneGrid = cartGrid)

>print(tree_model)

CART 

18249 samples

   13 predictor

No pre-processing

Resampling: Cross-Validated (5 fold) 

Summary of sample sizes: 14599, 14600, 14597, 14600, 14600 

Resampling results across tuning parameters:

  cp         RMSE       Rsquared       MAE      

  0.01   0.2752946  0.5326203  0.2125589

  0.02  0.2997619  0.4458872  0.2296112

  0.03  0.3075545  0.4166823  0.2380681

  0.04  0.3122469  0.3986878  0.2413319

  0.05  0.3172388  0.3794028  0.2442293

  0.06  0.3172388  0.3794028  0.2442293

  0.07  0.3172388  0.3794028  0.2442293

  0.08  0.3172388  0.3794028  0.2442293

  0.09  0.3172388  0.3794028  0.2442293

  0.10  0.3172388  0.3794028  0.2442293

  0.11  0.3172388  0.3794028  0.2442293

  0.12  0.3172388  0.3794028  0.2442293

  0.13  0.3172388  0.3794028  0.2442293

  0.14  0.3172388  0.3794028  0.2442293

  0.15  0.3172388  0.3794028  0.2442293

  0.16  0.3172388  0.3794028  0.2442293

  0.17  0.3172388  0.3794028  0.2442293

  0.18  0.3172388  0.3794028  0.2442293

  0.19  0.3172388  0.3794028  0.2442293

  0.20  0.3172388  0.3794028  0.2442293

  0.21  0.3172388  0.3794028  0.2442293

  0.22  0.3172388  0.3794028  0.2442293

  0.23  0.3172388  0.3794028  0.2442293

  0.24  0.3172388  0.3794028  0.2442293

  0.25  0.3172388  0.3794028  0.2442293

  0.26  0.3172388  0.3794028  0.2442293

  0.27  0.3172388  0.3794028  0.2442293

  0.28  0.3172388  0.3794028  0.2442293

  0.29  0.3172388  0.3794028  0.2442293

  0.30  0.3172388  0.3794028  0.2442293

  0.31  0.3172388  0.3794028  0.2442293

  0.32  0.3172388  0.3794028  0.2442293

  0.33  0.3172388  0.3794028  0.2442293

  0.34  0.3172388  0.3794028  0.2442293

  0.35  0.3172388  0.3794028  0.2442293

  0.36  0.3172388  0.3794028  0.2442293

  0.37  0.3172388  0.3794028  0.2442293

  0.38  0.3863415  0.3628635  0.3091749

  0.39  0.4026596        NaN  0.3242786

  0.40  0.4026596        NaN  0.3242786

  0.41  0.4026596        NaN  0.3242786

  0.42  0.4026596        NaN  0.3242786

  0.43  0.4026596        NaN  0.3242786

  0.44  0.4026596        NaN  0.3242786

  0.45  0.4026596        NaN  0.3242786

  0.46  0.4026596        NaN  0.3242786

  0.47  0.4026596        NaN  0.3242786

  0.48  0.4026596        NaN  0.3242786

  0.49  0.4026596        NaN  0.3242786

  0.50  0.4026596        NaN  0.3242786

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was cp = 0.01.

Recommended Classifier

The main objective in this project was to observe trend in the avocado market in the United State based on average price, regions, and the type of avocado. This was done using different machine learning algorithms to evaluate and estimate the number of avocados sold. For this purpose two major algorithms were evaluated using R. These included linear regression and decision tree algorithms. Under linear regression, the extent to which a linear relationship between the dependent variable and one or more dependent variables was determined. Here the average price of a single avocado was used to predict the value of other variables. The data obtained from the regression model was then used to plot residual vs. fit graph, where the residual value was used to represent the difference between the actual and predicted outcome value. On the other hand, the fitted value was used to represent the predicted values. The final graph ws funnel shaped from right to left, indicating that the regression model used here is affected unequal variance in error terms, also known as heteroskedasticity. The other algorithm used to model the avocado price dataset was the decision tree. Owing to the large size of the data and challenges in selecting the best cp, the tree could not be generated due to over-plotting errors. As such linear regression was selected as the best classifier of the avocado price dataset.

Place your order
(550 words)

Approximate price: $22

Homework help cost calculator

600 words
We'll send you the complete homework by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 customer support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • 4 hour deadline
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 300 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
× How can I help you?