Data Analysis
Avocado, also known as Persea Americana was first introduced in California, United State in 1871. By 1950s, there were different varieties of avocado being sold in the market in different parts of the US, with Fuerte being the most consumed variety. This however changed in the next twenty years, where Hass avocado became the most consumed variety not only in the United State, but also in other different parts of the world. At the moment, Avocado has become one of the most consumed fruit, not only due to its taste, but also because of its contribution to health. Studies have shown that 85% of the avocados produced and sold globally are of Hass variety. This can be attributed to the fact that Hass variety can grow in almost all regions, and throughout the year. Although Mexico is the leading producer of Avocado in the world, the US is the leading country concerning imports of avocados, accounting to about a million tons per year. Since 2008, the avocado market in the US has been growing at 16% every year, and this is expected to continue in the coming future. The amount of Avocado produced in states such as California and Hawaii is huge, however, it does not satisfy the market demand, and therefore more has to be imported from countries such as Mexico, Chile, Peru, Dominican Republic, among other countries. It is however to note that the consumption of avocado is not uniform throughout the country. For instance, in California, 90% of the families consume about three units of avocado per month.
Study has shown that about fifty million dollars are spent every year on advertising and carrying out promotional activities on healthy avocado consumption (Cavaletto 465).In the light of this, collection and scientific analysis of information on the avocado market and the rate of consumption in different state could be of great help to producers, vendors, avocado association, and companies dealing with processing of this fruit. Such data could be used in selecting the right places to sell avocado and to determine places where marketing campaigns can be carried out successfully or help in development of production innovations and new strategies for increasing the sale of such a product. The following paper aims to using machine learning techniques to analyze a dataset in the bid to determine trends in the sales of avocado in different states in the US, the number of units sold per month, and the total sales in different parts of the country. Such information will go a long way in helping avocado producers, vendors, associations and companies in making informed decision when planning on their sales and marketing campaigns as well as getting to know the sales expected to be registered in advance for a given state. The result obtained in this paper could be an essential input for making rational decision in relation to the avocado market and making the right step toward encouraging consumption of avocado, or shifting supplies to areas whose demand is much higher.
The data used in this project was obtained from Kaggle Inc.as provided by the Hass Avocado Board program. The Hass Avocado Board is a program by the US government that is financed through tax applied on all Hass avocados sold in the US market. Most of these funds are channeled toward advertisement and promotion programs. Hass Avocado Board also helps in collecting, tracking, analyzing, and dissemination of information on the sales of Hass avocados in the US market. This information is used for research and making decision on cultivation, harvesting, distribution, and marketing of avocados.
The dataset, named Avocado Prices, was collected between 2013 and 2018 and comprises of 18249 rows and 14 columns. This is determined by first loading the dataset into the R-studio before determining the dimensions as shown below:
> Avocado <- read.csv (“Avoc.csv”)
> dim(Avocado)
[1] 18249 14
Some of the most relevant columns here include Dates, for data when the observation was made, AveragePrice, for the average price of a single avocado, type, for the either conventional or organic avocados, year, for the year when the observation was made, Region, for the city or the region where the observation was made, Total Volume for the summation of all the avocados sold, 4046, representing the total number of the avocados with Product Lookup codes (PLU) 4046 sold, 4225, for the total number of avocados with PLU 4225 sold, and 4770 for the total number of avocados with PLU 4770 sold. This was determined as shown below
> str (Avocado)
‘data.frame’: 18249 obs. of 15 variables:
$ Unnamed.0 : int 0 1 2 3 4 5 6 7 8 9 …
$ Date : Factor w/ 169 levels “1/1/2017″,”1/10/2016”,..: 54 51 48 58 42 39 36 45 33 27 …
$ months : logi NA NA NA NA NA NA …
$ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 …
$ Total.Volume: num 64237 54877 118220 78992 51040 …
$ X4046 : num 1037 674 795 1132 941 …
$ X4225 : num 54455 44639 109150 71976 43838 …
$ X4770 : num 48.2 58.3 130.5 72.6 75.8 …
$ Total.Bags : num 8697 9506 8145 5811 6184 …
$ Small.Bags : num 8604 9408 8042 5677 5986 …
$ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 …
$ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 …
$ type : Factor w/ 2 levels “conventional”,..: 1 1 1 1 1 1 1 1 1 1 …
$ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 …
$ region : Factor w/ 54 levels “Albany”,”Atlanta”,..: 1 1 1 1 1 1 1 1 1 1 …
To ensure that the dataset was appropriate for this study, it was subjected to several cleaning process, following the Cross Industry Standard process for Data Mining (CRISP-DM) methodology using R-studio. This was done by first searching for missing values and blank spaces. Fortunately there were no any missing data or values in the Avocado prices dataset and only the first column was unnamed as shown in the output below:
> colSums(is.na(Avocado))
Unnamed.0 Date AveragePrice Total.Volume X4046
0 0 18249 0 0 0
X4225 X4770 Total.Bags Small.Bags Large.Bags XLarge.Bags
0 0 0 0 0 0
type year region
0 0 0
>summary(Avocado | |||||
Unnamed.0 | Date | AveragePrice | Total.Volume | Small.Bags | XLarge.Bags |
Min. : 0.00 | 1/1/2017 :108 | Min. :0.440 | Min. : 85 | Min. : 0 | Min. : 0 |
1st Qu.:10.00 | 1/10/2016:108 | 1st Qu.:1.100 | 1st Qu.: 10839 | 1st Qu.: 2849 | 1st Qu.: 0 |
Median:24.00 | 1/11/2015: 108 | Median :1.370 | Median : 107377 | Median : 26363 | Median : 0. |
Mean :24.23 | 1/14/2018:108 | Mean :1.406 | Mean : 850644 | Mean : 182195 | Mean : 3106.4 |
3rd Qu.:38.00 | 1/15/2017: 108 | 3rd Qu.:1.660 | 3rd Qu.: 432962 | 3rd Qu.: 83338 | 3rd Qu.: 132.5 |
Max. :52.00 | 1/17/2016: 108 | Max. :3.250 | Max. :62505647 | Max. :13384587 | Max. :551693.7 |
(Other) :17601 |
X4046 | X4225 | X4770 | Total.Bags | Large.Bags | Type |
Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | conventional:9126 |
1st Qu.: 854 | 1st Qu.: 3009 | 1st Qu.: 0 | 1st Qu.: 5089 | 1st Qu.: 127 | organic :9123 |
Median : 8645 | Median : 29061 | Median : 185 | Median : 39744 | Median : 2648 | |
Mean : 293008 | Mean : 295155 | Mean : 22840 | Mean : 239639 | Mean : 54338 | |
3rd Qu.: 111020 | 3rd Qu.: 150207 | 3rd Qu.: 6243 | 3rd Qu.: 110783 | 3rd Qu.: 22029 | |
Max. :22743616 | Max. :20470573 | Max. :2546439 | Max. :19373134 | Max. :5719097 | |
(Other) :16221 |
The trend in the price of avocados will be done using several visualization diagrams. The first to be used is the box plot, which will help in determining the trend in the price of both organic and conventional avocados. This will be done using the following set of codes:
> options(repr.plot.width= 7, repr.plot.height=5)
> ggplot(Avocado, aes(Avocado$type, Avocado$AveragePrice))+
+ geom_boxplot(aes(colour = Avocado$year))+
+ labs(colour = “Year”, x = “Type”, y =”Average Price”, title = ” Average price per year by avocado type”)
This will be done by first grouping regions using the following set of codes:
>min_con = round(min(grouped_region_conv$AveragePrice),1)-0.1
> max_con = round(max(grouped_region_conv$AveragePrice),1)+0.1
> grouped_region_org = Avocado %>%
+ select(year, region, type, AveragePrice) %>%
+ filter(type == ‘organic’)
> min_org = round(min(grouped_region_org$AveragePrice),1)-0.1
> max_org = round(max(grouped_region_org$AveragePrice),1)+0.1
> options(repr.plot.width= 10, repr.plot.height=12)
> ggplot(grouped_region_conv, aes(x=region, y=AveragePrice)) +
+ geom_tufteboxplot() +
+ facet_grid(.~grouped_region_conv$year, scales=”free”) +
+ labs(colour = “Year”, x = “Region”, y =”Average Price”, title = “Average prices of Conventional Avocados for each region by year”)+
+ scale_y_continuous(breaks=c(seq(min_con,max_con,0.2)), limits = c(min_con,max_con)) +
+ coord_flip() +
+ theme(axis.text.x = element_text(angle = 90, vjust = 0))
> grouped_region_max_conventional = Avocado %>%
+ group_by(region, type) %>%
+ select(region, type,AveragePrice) %>%
+ summarise(maxPrice = max(AveragePrice)) %>%
+ filter(type ==’conventional’)
> grouped_region_max_conventional$region <- factor(grouped_region_max_conventional$region, levels= pull(arrange(grouped_region_max_conventional, (grouped_region_max_conventional$maxPrice)), region))
> grouped_region_max_organic = Avocado %>%
+ group_by(region, type) %>%
+ select(region, type,AveragePrice) %>%
+ summarise(maxPrice = max(AveragePrice)) %>%
+ filter(type ==’organic’)
> grouped_region_max_organic$region <- factor(grouped_region_max_organic$region, levels= pull(arrange(grouped_region_max_organic,
+ (grouped_region_max_organic$maxPrice)), region))
> plot3 <- ggplot(grouped_region_max_conventional, aes(x=maxPrice, y=region, label = round(maxPrice, 1)))+
+ geom_segment(aes(x = 0, y = region, xend = maxPrice, yend = region), color = “grey50”)+
+ labs(x = “Region”, y =”Average Price”, title = “Max. prices of avocado (Conventional)”)+
+ geom_point() +
+ geom_text(nudge_x = 0.3)
> options(repr.plot.width= 10, repr.plot.height=10)
> grid.arrange(plot3,plot2, ncol=2)
The top region in production of avocado by PLU, type of avocado and year in relation to the total volume of avocado sold and net price can also be determined and visualized using the following set of code
> grouped_avocado_type = avocadoCodes %>%
+ select(year,type, `Avocado Type`,Volume, TotalPrice, region, AveragePrice) %>%
+ group_by(year,`Avocado Type`, region, type) %>%
+ summarise(Volume = sum(Volume)/1000000, NetPrice = sum(TotalPrice)/1000000) %>%
+ filter(Volume > 0 & NetPrice > 0)
> avocado_type_DT <- data.table(grouped_avocado_type)
> top5_volume <- avocado_type_DT[order(-Volume), head(.SD,5),by = .(`Avocado Type`, year, type)]
> top5_price <- avocado_type_DT[order(-NetPrice), head(.SD,5),by = .(`Avocado Type`, year, type)]
> options(repr.plot.width= 8, repr.plot.height=5)
> ggplot(top5_volume, aes(x=region, y=Volume))+
+ geom_point(aes(col=year), size=2) +
+ facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+
+ labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region, each year by Avocado type and codes (PLUs)”, colour = ‘Year’)+
+ coord_flip(
Analysis of the data and the above visual representation shows that total US is an outlier and need to be eliminated from the data to obtain a clear view of the data. This is essential in the bid to increase visibility of trends in other regions better. This is done using the following set of code
> top5_volume = top5_volume %>% filter(region != ‘TotalUS’)
> top5_price = top5_price %>% filter(region != ‘TotalUS’)
> ggplot(top5_volume, aes(x=region, y=Volume))+
+ geom_point(aes(col=year), size=2) +
+ facet_grid(top5_volume$type ~top5_volume$`Avocado Type`, scales=”free”)+
+ labs(y = “Total Volume (in Millions)”, x =”Region”, title = “Total Volume for top 5 region (excluding TotalUS)”, colour = ‘Year’)++ coord_flip()
The bottom regions by the PLU codes, avocado type, and year can also be used to visually determine their trend in relation to the total volume and net price of the avocado. This is done using the following set of code:
Regression analysis defines a predictive modeling method that is used to determine the relationship between dependent and independent variables. It is one of the most suitable methods that is used in investigating causal effects relationship between different variables in a dataset, Some of the benefits associated with use of regression analysis include the fact that it helps in determining the significant relationship between dependent and independent variable, as well as indicating the strength of the impact of different independent variables on the dependent variable. Using regression analysis, one can easily determine the extent to which different variables affects other variables within a dataset. For instance, in a given dataset, regression analysis can be used to determine the effect of price changes on demand and production. As such, regression analysis is crucial eliminating variables with no effects in generating a predictive model.
The best regression analysis to use in situations where the several predictors and the response variables are continuous is the linear regression model. There are however other assumptions which need to be considered. This includes the fact that in linear regression, as the name suggest, there must be an existing linear relationship between response variables and predictive variables. However, the two should not be correlated in any way. This is owing to the fact that existence of correlation between predictors leads to multi-collinearity. In addition, it also assumed that there is no correlation between error terms as it leads to autocorrelation. The error terms must also be constant to avoid heteroskedasticity. The following R-codes helps in building the regression model of the avocado prices dataset:
> linear_model <- lm(AveragePrice~ ., data = Avocado)
> summary(linear_model)
The output of the above code is as presented below:
Call:
lm(formula = AveragePrice ~ ., data = Avocado)
Residuals:
Min 1Q Median 3Q Max
-0.9836 -0.1208 0.0033 0.1272 1.4155
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.837e+00 1.095e+00 3.505 0.000458 ***
Unnamed.0 -5.094e-02 2.105e-02 -2.420 0.015542 *
Date1/10/2016 -1.533e-01 5.181e-02 -2.959 0.003090 **
Date1/11/2015 -9.140e-03 5.197e-02 -0.176 0.860408
Date1/14/2018 -1.994e+00 8.843e-01 -2.255 0.024134 *
Date1/15/2017 -6.103e-02 5.211e-02 -1.171 0.241558
Date1/17/2016 -1.616e-01 6.987e-02 -2.313 0.020714 *
Date1/18/2015 -3.926e-02 7.007e-02 -0.560 0.575233
Date1/21/2018 -2.092e+00 9.053e-01 -2.311 0.020848 *
Date1/22/2017 -2.076e-01 7.023e-02 -2.956 0.003121 **
Date1/24/2016 -2.447e-01 8.925e-02 -2.741 0.006124 **
Date1/25/2015 -8.432e-02 8.947e-02 -0.942 0.345950
Date1/28/2018 -2.126e+00 9.263e-01 -2.295 0.021718 *
Date1/29/2017 -2.091e-01 8.964e-02 -2.333 0.019670 *
Date1/3/2016 -1.260e-01 3.703e-02 -3.403 0.000668 ***
Date1/31/2016 -2.857e-01 1.093e-01 -2.615 0.008936 **
Date1/4/2015 -2.647e-02 3.714e-02 -0.713 0.476112
Date1/7/2018 -2.009e+00 8.632e-01 -2.327 0.019953 *
Date1/8/2017 -3.415e-02 3.723e-02 -0.917 0.358896
Date10/1/2017 -1.393e+00 8.211e-01 -1.696 0.089821 .
Date10/11/2015 -1.997e+00 8.634e-01 -2.313 0.020710 *
Date10/15/2017 -1.583e+00 8.632e-01 -1.834 0.066632 .
Date10/16/2016 -1.933e+00 8.843e-01 -2.186 0.028800 *
Date10/18/2015 -2.026e+00 8.844e-01 -2.291 0.021964 *
Date10/2/2016 -1.740e+00 8.422e-01 -2.066 0.038811 *
Date10/22/2017 -1.753e+00 8.843e-01 -1.982 0.047501 *
Date10/23/2016 -1.929e+00 9.053e-01 -2.131 0.033120 *
Date10/25/2015 -2.088e+00 9.055e-01 -2.306 0.021124 *
Date10/29/2017 -1.867e+00 9.053e-01 -2.062 0.039217 *
Date10/30/2016 -1.818e+00 9.263e-01 -1.963 0.049654 *
Date10/4/2015 -1.898e+00 8.424e-01 -2.254 0.024238 *
Date10/8/2017 -1.472e+00 8.422e-01 -1.748 0.080426 .
Date10/9/2016 -1.883e+00 8.632e-01 -2.182 0.029131 *
Date11/1/2015 -2.217e+00 9.265e-01 -2.392 0.016750 *
Date11/12/2017 -2.059e+00 9.474e-01 -2.173 0.029772 *
Date11/13/2016 -2.037e+00 9.684e-01 -2.104 0.035431 *
Date11/15/2015 -2.282e+00 9.686e-01 -2.356 0.018504 *
Date11/19/2017 -2.122e+00 9.684e-01 -2.191 0.028463 *
Date11/20/2016 -2.173e+00 9.894e-01 -2.196 0.028127 *
Date11/22/2015 -2.342e+00 9.896e-01 -2.367 0.017956 *
Date11/26/2017 -2.169e+00 9.895e-01 -2.192 0.028377 *
Date11/27/2016 -2.227e+00 1.010e+00 -2.204 0.027554 *
Date11/29/2015 -2.382e+00 1.011e+00 -2.357 0.018458 *
Date11/5/2017 -1.959e+00 9.263e-01 -2.115 0.034484 *
Date11/6/2016 -1.952e+00 9.474e-01 -2.060 0.039407 *
Date11/8/2015 -2.238e+00 9.476e-01 -2.362 0.018190 *
Date12/10/2017 -2.430e+00 1.032e+00 -2.356 0.018478 *
Date12/11/2016 -2.517e+00 1.053e+00 -2.391 0.016802 *
Date12/13/2015 -2.538e+00 1.053e+00 -2.411 0.015905 *
Date12/17/2017 -2.457e+00 1.053e+00 -2.334 0.019612 *
Date12/18/2016 -2.598e+00 1.074e+00 -2.420 0.015516 *
Date12/20/2015 -2.537e+00 1.074e+00 -2.363 0.018151 *
Date12/24/2017 -2.437e+00 1.074e+00 -2.270 0.023216 *
Date12/25/2016 -2.617e+00 1.095e+00 -2.391 0.016815 *
Date12/27/2015 -2.633e+00 1.095e+00 -2.406 0.016149 *
Date12/3/2017 -2.337e+00 1.010e+00 -2.313 0.020747 *
Date12/31/2017 -2.642e+00 1.095e+00 -2.414 0.015799 *
Date12/4/2016 -2.403e+00 1.032e+00 -2.330 0.019823 *
Date12/6/2015 -2.485e+00 1.032e+00 -2.409 0.016009 *
Date2/1/2015 -2.835e-01 1.095e-01 -2.589 0.009626 **
Date2/11/2018 -2.315e+00 9.684e-01 -2.390 0.016839 *
Date2/12/2017 -3.813e-01 1.300e-01 -2.934 0.003356 **
Date2/14/2016 -4.062e-01 1.501e-01 -2.706 0.006827 **
Date2/15/2015 -2.474e-01 1.504e-01 -1.645 0.099923 .
Date2/18/2018 -2.299e+00 9.895e-01 -2.324 0.020138 *
Date2/19/2017 -3.684e-01 1.505e-01 -2.447 0.014402 *
Date2/21/2016 -4.116e-01 1.708e-01 -2.410 0.015966 *
Date2/22/2015 -3.231e-01 1.710e-01 -1.890 0.058828 .
Date2/25/2018 -2.366e+00 1.010e+00 -2.342 0.019216 *
Date2/26/2017 -4.480e-01 1.712e-01 -2.617 0.008881 **
Date2/28/2016 -4.735e-01 1.915e-01 -2.472 0.013450 *
Date2/4/2018 -2.334e+00 9.474e-01 -2.463 0.013780 *
Date2/5/2017 -3.880e-01 1.096e-01 -3.538 0.000403 ***
Date2/7/2016 -4.158e-01 1.296e-01 -3.208 0.001337 **
Date2/8/2015 -2.715e-01 1.298e-01 -2.092 0.036496 *
Date3/1/2015 -4.258e-01 1.918e-01 -2.221 0.026380 *
Date3/11/2018 -2.493e+00 1.053e+00 -2.369 0.017865 *
Date3/12/2017 -3.550e-01 2.127e-01 -1.669 0.095231 .
Date3/13/2016 -6.448e-01 2.332e-01 -2.765 0.005700 **
Date3/15/2015 -4.465e-01 2.334e-01 -1.913 0.055764 .
Date3/18/2018 -2.565e+00 1.074e+00 -2.389 0.016914 *
Date3/19/2017 -3.736e-01 2.336e-01 -1.599 0.109752
Date3/20/2016 -6.954e-01 2.541e-01 -2.737 0.006209 **
Date3/22/2015 -5.407e-01 2.543e-01 -2.126 0.033496 *
Date3/25/2018 -2.583e+00 1.095e+00 -2.360 0.018302 *
Date3/26/2017 -4.891e-01 2.545e-01 -1.922 0.054633 .
Date3/27/2016 -7.001e-01 2.750e-01 -2.546 0.010909 *
Date3/29/2015 -5.422e-01 2.752e-01 -1.970 0.048825 *
Date3/4/2018 -2.425e+00 1.032e+00 -2.351 0.018729 *
Date3/5/2017 -4.515e-01 1.919e-01 -2.353 0.018655 *
Date3/6/2016 -5.355e-01 2.123e-01 -2.522 0.011691 *
Date3/8/2015 -4.273e-01 2.126e-01 -2.010 0.044428 *
Date4/10/2016 -8.720e-01 3.169e-01 -2.752 0.005933 **
Date4/12/2015 -6.769e-01 3.171e-01 -2.135 0.032795 *
Date4/16/2017 -5.410e-01 3.172e-01 -1.705 0.088149 .
Date4/17/2016 -8.775e-01 3.378e-01 -2.597 0.009400 **
Date4/19/2015 -7.213e-01 3.380e-01 -2.134 0.032859 *
Date4/2/2017 -4.949e-01 2.754e-01 -1.797 0.072298 .
Date4/23/2017 -5.515e-01 3.382e-01 -1.631 0.103008
Date4/24/2016 -9.565e-01 3.588e-01 -2.666 0.007685 **
Date4/26/2015 -7.589e-01 3.590e-01 -2.114 0.034526 *
Date4/3/2016 -7.599e-01 2.959e-01 -2.568 0.010236 *
Date4/30/2017 -5.957e-01 3.592e-01 -1.658 0.097247 .
Date4/5/2015 -5.778e-01 2.961e-01 -1.951 0.051057 .
Date4/9/2017 -5.284e-01 2.963e-01 -1.783 0.074560 .
Date5/1/2016 -1.031e+00 3.798e-01 -2.715 0.006630 **
Date5/10/2015 -9.202e-01 4.010e-01 -2.295 0.021747 *
Date5/14/2017 -7.216e-01 4.011e-01 -1.799 0.072039 .
Date5/15/2016 -1.085e+00 4.217e-01 -2.573 0.010096 *
Date5/17/2015 -9.442e-01 4.220e-01 -2.238 0.025252 *
Date5/21/2017 -7.393e-01 4.221e-01 -1.751 0.079915 .
Date5/22/2016 -1.129e+00 4.427e-01 -2.551 0.010749 *
Date5/24/2015 -9.661e-01 4.430e-01 -2.181 0.029191 *
Date5/28/2017 -7.721e-01 4.431e-01 -1.742 0.081463 .
Date5/29/2016 -1.157e+00 4.637e-01 -2.495 0.012593 *
Date5/3/2015 -9.084e-01 3.800e-01 -2.391 0.016824 *
Date5/31/2015 -1.016e+00 4.640e-01 -2.189 0.028604 *
Date5/7/2017 -7.352e-01 3.802e-01 -1.934 0.053148 .
Date5/8/2016 -1.100e+00 4.007e-01 -2.746 0.006048 **
Date6/11/2017 -9.327e-01 4.851e-01 -1.923 0.054552 .
Date6/12/2016 -1.209e+00 5.058e-01 -2.390 0.016877 *
Date6/14/2015 -1.100e+00 5.060e-01 -2.173 0.029760 *
Date6/18/2017 -9.662e-01 5.058e-01 -1.910 0.056131 .
Date6/19/2016 -1.282e+00 5.268e-01 -2.434 0.014958 *
Date6/21/2015 -1.144e+00 5.270e-01 -2.172 0.029893 *
Date6/25/2017 -9.970e-01 5.268e-01 -1.892 0.058455 .
Date6/26/2016 -1.301e+00 5.478e-01 -2.375 0.017552 *
Date6/28/2015 -1.197e+00 5.480e-01 -2.184 0.028988 *
Date6/4/2017 -8.378e-01 4.641e-01 -1.805 0.071077 .
Date6/5/2016 -1.215e+00 4.847e-01 -2.507 0.012182 *
Date6/7/2015 -1.069e+00 4.850e-01 -2.204 0.027543 *
Date7/10/2016 -1.361e+00 5.898e-01 -2.308 0.021000 *
Date7/12/2015 -1.289e+00 5.900e-01 -2.185 0.028935 *
Date7/16/2017 -1.124e+00 5.898e-01 -1.906 0.056663 .
Date7/17/2016 -1.336e+00 6.108e-01 -2.187 0.028790 *
Date7/19/2015 -1.369e+00 6.110e-01 -2.240 0.025086 *
Date7/2/2017 -1.039e+00 5.478e-01 -1.897 0.057835 .
Date7/23/2017 -1.195e+00 6.109e-01 -1.956 0.050496 .
Date7/24/2016 -1.332e+00 6.319e-01 -2.108 0.035017 *
Date7/26/2015 -1.386e+00 6.321e-01 -2.192 0.028367 *
Date7/3/2016 -1.372e+00 5.688e-01 -2.412 0.015855 *
Date7/30/2017 -1.238e+00 6.319e-01 -1.959 0.050152 .
Date7/31/2016 -1.413e+00 6.529e-01 -2.164 0.030444 *
Date7/5/2015 -1.240e+00 5.690e-01 -2.180 0.029305 *
Date7/9/2017 -1.119e+00 5.688e-01 -1.968 0.049062 *
Date8/13/2017 -1.269e+00 6.739e-01 -1.884 0.059640 .
Date8/14/2016 -1.542e+00 6.949e-01 -2.218 0.026545 *
Date8/16/2015 -1.511e+00 6.952e-01 -2.173 0.029782 *
Date8/2/2015 -1.380e+00 6.531e-01 -2.113 0.034585 *
Date8/20/2017 -1.225e+00 6.950e-01 -1.763 0.077901 .
Date8/21/2016 -1.597e+00 7.160e-01 -2.230 0.025762 *
Date8/23/2015 -1.566e+00 7.162e-01 -2.186 0.028816 *
Date8/27/2017 -1.202e+00 7.160e-01 -1.679 0.093186 .
Date8/28/2016 -1.669e+00 7.370e-01 -2.264 0.023586 *
Date8/30/2015 -1.643e+00 7.372e-01 -2.229 0.025857 *
Date8/6/2017 -1.255e+00 6.529e-01 -1.922 0.054564 .
Date8/7/2016 -1.495e+00 6.739e-01 -2.219 0.026531 *
Date8/9/2015 -1.476e+00 6.741e-01 -2.190 0.028561 *
Date9/10/2017 -1.270e+00 7.580e-01 -1.676 0.093786 .
Date9/11/2016 -1.805e+00 7.791e-01 -2.317 0.020533 *
Date9/13/2015 -1.701e+00 7.793e-01 -2.182 0.029107 *
Date9/17/2017 -1.332e+00 7.791e-01 -1.709 0.087441 .
Date9/18/2016 -1.765e+00 8.001e-01 -2.207 0.027359 *
Date9/20/2015 -1.767e+00 8.003e-01 -2.208 0.027242 *
Date9/24/2017 -1.380e+00 8.001e-01 -1.725 0.084564 .
Date9/25/2016 -1.730e+00 8.211e-01 -2.107 0.035101 *
Date9/27/2015 -1.815e+00 8.213e-01 -2.210 0.027117 *
Date9/3/2017 -1.209e+00 7.370e-01 -1.640 0.100974
Date9/4/2016 -1.739e+00 7.580e-01 -2.294 0.021816 *
Date9/6/2015 -1.678e+00 7.582e-01 -2.212 0.026952 *
Total.Volume -4.448e-05 3.524e-05 -1.262 0.206885
X4046 4.447e-05 3.524e-05 1.262 0.207010
X4225 4.446e-05 3.524e-05 1.262 0.207046
X4770 4.467e-05 3.524e-05 1.268 0.204947
Total.Bags -2.243e-02 2.622e-02 -0.856 0.392153
Small.Bags 2.248e-02 2.622e-02 0.857 0.391218
Large.Bags 2.248e-02 2.622e-02 0.857 0.391218
XLarge.Bags 2.248e-02 2.622e-02 0.857 0.391201
typeorganic 4.940e-01 3.542e-03 139.467 < 2e-16 ***
year2016 NA NA NA NA
year2017 NA NA NA NA
year2018 NA NA NA NA
regionAtlanta -2.215e-01 1.737e-02 -12.747 < 2e-16 ***
regionBaltimoreWashington -2.665e-02 1.739e-02 -1.532 0.125462
regionBoise -2.136e-01 1.736e-02 -12.300 < 2e-16 ***
regionBoston -2.859e-02 1.738e-02 -1.644 0.100126
regionBuffaloRochester -4.458e-02 1.736e-02 -2.568 0.010239 *
regionCalifornia -1.717e-01 1.776e-02 -9.664 < 2e-16 ***
regionCharlotte 4.284e-02 1.737e-02 2.467 0.013634 *
regionChicago -1.260e-02 1.745e-02 -0.722 0.470530
regionCincinnatiDayton -3.513e-01 1.738e-02 -20.219 < 2e-16 ***
regionColumbus -3.095e-01 1.736e-02 -17.828 < 2e-16 ***
regionDallasFtWorth -4.738e-01 1.740e-02 -27.232 < 2e-16 ***
regionDenver -3.384e-01 1.747e-02 -19.373 < 2e-16 ***
regionDetroit -2.928e-01 1.740e-02 -16.832 < 2e-16 ***
regionGrandRapids -5.949e-02 1.736e-02 -3.426 0.000613 ***
regionGreatLakes -2.476e-01 1.805e-02 -13.713 < 2e-16 ***
regionHarrisburgScranton -4.779e-02 1.736e-02 -2.753 0.005910 **
regionHartfordSpringfield 2.586e-01 1.737e-02 14.892 < 2e-16 ***
regionHouston -5.110e-01 1.739e-02 -29.387 < 2e-16 ***
regionIndianapolis -2.476e-01 1.736e-02 -14.261 < 2e-16 ***
regionJacksonville -4.966e-02 1.736e-02 -2.860 0.004238 **
regionLasVegas -1.792e-01 1.736e-02 -10.321 < 2e-16 ***
regionLosAngeles -3.541e-01 1.764e-02 -20.069 < 2e-16 ***
regionLouisville -2.745e-01 1.736e-02 -15.814 < 2e-16 ***
regionMiamiFtLauderdale -1.301e-01 1.738e-02 -7.486 7.40e-14 ***
regionMidsouth -1.587e-01 1.754e-02 -9.047 < 2e-16 ***
regionNashville -3.493e-01 1.736e-02 -20.116 < 2e-16 ***
regionNewOrleansMobile -2.571e-01 1.737e-02 -14.806 < 2e-16 ***
regionNewYork 1.711e-01 1.752e-02 9.765 < 2e-16 ***
regionNortheast 5.327e-02 1.879e-02 2.835 0.004589 **
regionNorthernNewEngland -8.248e-02 1.737e-02 -4.748 2.07e-06 ***
regionOrlando -5.377e-02 1.737e-02 -3.096 0.001964 **
regionPhiladelphia 7.187e-02 1.737e-02 4.138 3.52e-05 ***
regionPhoenixTucson -3.319e-01 1.741e-02 -19.061 < 2e-16 ***
regionPittsburgh -1.969e-01 1.736e-02 -11.345 < 2e-16 ***
regionPlains -1.212e-01 1.741e-02 -6.959 3.54e-12 ***
regionPortland -2.436e-01 1.738e-02 -14.018 < 2e-16 ***
regionRaleighGreensboro -8.023e-03 1.737e-02 -0.462 0.644139
regionRichmondNorfolk -2.701e-01 1.736e-02 -15.557 < 2e-16 ***
regionRoanoke -3.132e-01 1.736e-02 -18.039 < 2e-16 ***
regionSacramento 6.138e-02 1.736e-02 3.535 0.000408 ***
regionSanDiego -1.631e-01 1.736e-02 -9.396 < 2e-16 ***
regionSanFrancisco 2.450e-01 1.737e-02 14.098 < 2e-16 ***
regionSeattle -1.178e-01 1.738e-02 -6.780 1.24e-11 ***
regionSouthCarolina -1.580e-01 1.736e-02 -9.098 < 2e-16 ***
regionSouthCentral -4.513e-01 1.799e-02 -25.094 < 2e-16 ***
regionSoutheast -1.517e-01 1.785e-02 -8.495 < 2e-16 ***
regionSpokane -1.157e-01 1.736e-02 -6.666 2.70e-11 ***
regionStLouis -1.308e-01 1.736e-02 -7.536 5.06e-14 ***
regionSyracuse -4.104e-02 1.736e-02 -2.364 0.018082 *
regionTampa -1.509e-01 1.737e-02 -8.688 < 2e-16 ***
regionTotalUS -2.173e-01 2.167e-02 -10.028 < 2e-16 ***
regionWest -2.692e-01 1.811e-02 -14.864 < 2e-16 ***
regionWestTexNewMexico -3.092e-01 1.847e-02 -16.745 < 2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2257 on 18017 degrees of freedom
Multiple R-squared: 0.6899, Adjusted R-squared: 0.686
F-statistic: 173.5 on 231 and 18017 DF, p-value: < 2.2e-16
The adjusted R-squired is used to determine the effectiveness of fit, which is also known as goodness fit of the regression model. An increased value of Adjusted R-squared indicates that the model is good. In this case, and as shown in the above results, the Adjusted R-squared was 0.686. Although this is a good indication, the model can be improved further by adjusting the parameters as shown in the code below
> par(mfrow=c(2,2))
> plot(linear_model)
The residual vs. fitted graph is the most significant figure among the four charts indicated above. The residual value stands for the variance of the actual value and the fitted value, while fitted value stands for the predicted value of the model. As shown in the graphical representation, the graph assumes a funnel like shape from starting from right side to the left side. This implies that the variances used to generate the regression model were not equal, a situation called heteroskedasticity. This also explains the visible patterns displayed in the graph, as a result of variation of constant between variables. Heteroskedasticity can be handled by determining the log of response variables. It is crucial to note that although the above results is n indication of a good model, as indicated by the adjusted R-squared, the reduction of heteroskedasticity can further improve the model through adjustment of parameters by taking log of the response variables as shown in the set of codes shown below
> par(mfrow=c(2,2))
> plot(linear_model)
> linear_model <- lm(log(AveragePrice) ~ ., data = Avocado)
> summary(linear_model)
The output of running this code is:
Residual standard error: 0.1582 on 18017 degrees of freedom
Multiple R-squared: 0.7054, Adjusted R-squared: 0.7016
F-statistic: 186.7 on 231 and 18017 DF, p-value: < 2.2e-1
The adjustment of parameters has improved the adjusted R-squared from 0.686 to 0.7016, and therefore making the model even much better. Loading Metric package through “install packages (“Metrics”) command can help in calculating root mean squared error (RMSE) as shown below.
> library(Metrics)
> rmse(Avocado$AveragePrice, exp(linear_model$fitted.values))
[1] 0.231306
This is a form of supervised learning algorithm with predefined target variables. One of the moost common uses of decision tree algorithm is classification of data. Decision tree algorithm works by dividing of dataset into different but similar sets, which is determined by differentiator in the input variables. A decision tree comprises of root nodes, splitting, decision node, terminal node, sub-tree, parent, and child node. The root node represents the entire sample and can be further be grouped into other multiple homogenous sets. The process by which nodes are divided into two or more sub-nodes is called splitting and leads to formation of a decision node. On the other hand, there are some nodes that cannot be divided further, and are known as the terminal nodes. In the current project, decision tree algorithm is used to design a predictive model in R, and it is implemented using various available packages. Some of the most useful packages in this case include the caret and rpart package. The caret package therefore in this case is used to ensure that the model that is generated here is robust and it is not susceptible to over fitting. In addition, the decision tree is implemented in R using cp (complexity parameter). With cp, the variation between the precision of the dataset and the complexity of the actual model to be generated is measured. The smaller the complexity parameter, the bigger the decision tree, which is likely to over fit the model, additionally, there are chances of under fitting the model, if it fails to capture the underlying trends property.The following code was used to obtain the out presented below:
#setting the tree control parameters
> fitControl <- trainControl(method = “cv”, number = 5)
> cartGrid <- expand.grid(.cp= (1:50)*0.01
#decision tree
> tree_model <- train(AveragePrice ~ ., data = Avocado, method = “rpart”, trControl = fitControl, tuneGrid = cartGrid)
>print(tree_model)
CART
18249 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 14599, 14600, 14597, 14600, 14600
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01 0.2752946 0.5326203 0.2125589
0.02 0.2997619 0.4458872 0.2296112
0.03 0.3075545 0.4166823 0.2380681
0.04 0.3122469 0.3986878 0.2413319
0.05 0.3172388 0.3794028 0.2442293
0.06 0.3172388 0.3794028 0.2442293
0.07 0.3172388 0.3794028 0.2442293
0.08 0.3172388 0.3794028 0.2442293
0.09 0.3172388 0.3794028 0.2442293
0.10 0.3172388 0.3794028 0.2442293
0.11 0.3172388 0.3794028 0.2442293
0.12 0.3172388 0.3794028 0.2442293
0.13 0.3172388 0.3794028 0.2442293
0.14 0.3172388 0.3794028 0.2442293
0.15 0.3172388 0.3794028 0.2442293
0.16 0.3172388 0.3794028 0.2442293
0.17 0.3172388 0.3794028 0.2442293
0.18 0.3172388 0.3794028 0.2442293
0.19 0.3172388 0.3794028 0.2442293
0.20 0.3172388 0.3794028 0.2442293
0.21 0.3172388 0.3794028 0.2442293
0.22 0.3172388 0.3794028 0.2442293
0.23 0.3172388 0.3794028 0.2442293
0.24 0.3172388 0.3794028 0.2442293
0.25 0.3172388 0.3794028 0.2442293
0.26 0.3172388 0.3794028 0.2442293
0.27 0.3172388 0.3794028 0.2442293
0.28 0.3172388 0.3794028 0.2442293
0.29 0.3172388 0.3794028 0.2442293
0.30 0.3172388 0.3794028 0.2442293
0.31 0.3172388 0.3794028 0.2442293
0.32 0.3172388 0.3794028 0.2442293
0.33 0.3172388 0.3794028 0.2442293
0.34 0.3172388 0.3794028 0.2442293
0.35 0.3172388 0.3794028 0.2442293
0.36 0.3172388 0.3794028 0.2442293
0.37 0.3172388 0.3794028 0.2442293
0.38 0.3863415 0.3628635 0.3091749
0.39 0.4026596 NaN 0.3242786
0.40 0.4026596 NaN 0.3242786
0.41 0.4026596 NaN 0.3242786
0.42 0.4026596 NaN 0.3242786
0.43 0.4026596 NaN 0.3242786
0.44 0.4026596 NaN 0.3242786
0.45 0.4026596 NaN 0.3242786
0.46 0.4026596 NaN 0.3242786
0.47 0.4026596 NaN 0.3242786
0.48 0.4026596 NaN 0.3242786
0.49 0.4026596 NaN 0.3242786
0.50 0.4026596 NaN 0.3242786
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01.
The main objective in this project was to observe trend in the avocado market in the United State based on average price, regions, and the type of avocado. This was done using different machine learning algorithms to evaluate and estimate the number of avocados sold. For this purpose two major algorithms were evaluated using R. These included linear regression and decision tree algorithms. Under linear regression, the extent to which a linear relationship between the dependent variable and one or more dependent variables was determined. Here the average price of a single avocado was used to predict the value of other variables. The data obtained from the regression model was then used to plot residual vs. fit graph, where the residual value was used to represent the difference between the actual and predicted outcome value. On the other hand, the fitted value was used to represent the predicted values. The final graph ws funnel shaped from right to left, indicating that the regression model used here is affected unequal variance in error terms, also known as heteroskedasticity. The other algorithm used to model the avocado price dataset was the decision tree. Owing to the large size of the data and challenges in selecting the best cp, the tree could not be generated due to over-plotting errors. As such linear regression was selected as the best classifier of the avocado price dataset.
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more