Data Analysis of Google Play apps using Python

Olakunle Yusuf
5 min readMar 23, 2022

Introduction

Apps are a part of our lives now more than ever. The average phone user is likely to check his/her phone 47 times in a day. Gen-Z users however check their phones more often, a whooping 86 times a day according to Deloitte’s 2017 Global Mobile Consumer Survey: U.S. edition, released in December 2017.

Business Task

According to emarketer, 88% of the time spent on phones is spent on apps. What exactly are people doing on these apps? Which apps are the most popular? Which apps are free? How can a developer make more money from these apps? These questions and many more are what I plan to use data analysis to answer.

Data Sources

The dataset for this analysis was gotten from Kaggle. The data was stored in a csv format, it is structured, organized in rows and columns.

Limitations

This dataset does not contain all apps on the Google play store but it is a web scraped data of 10k Play Store apps for analyzing the Android market.

Data Cleaning and Manipulation

Python is the tool I have chosen to use for this project. Pandas libraries provide efficient cleaning tools and visualizations in order to gain quick insights.

I downloaded the googlestore dataset from kaggle. This csv file contains 10,841 rows and 13 columns. The columns are App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, Android Ver.

first 5 rows of the googlestore dataset
summary information of the googlestore dataframe

A boxplot of the Rating shows that the Rating values are concentrated around 4.5 and there is an outlier at 19.0. We know the maximum rating possible is 5.0 hence this 19.0 value should be an error. This row is removed from the dataset.

Box plot showing the rating distribution before data cleaning
Histogram showing the data distribution before data cleaning

A fresh plot shows the values concentrated around 4.0 to 4.5 as well as all values being between 0 and 5.

Boxplot showing the Rating distribution after data cleaning
Histogram showing the data distribution after data cleaning

The dataset consists of null values. The Rating column has 1474 null values, the Type colum has 1 null value while the Current ver and Android ver have 8 and 2 null values respectively. In order to include these rows with null values in the analysis, suitable average values were chosen to fll the null values.

The Reviews, Installs and Price columns are currently formatted in non-numeric datatypes. These columns were formatted appropriately and converted to numeric datatypes.

The categories with most apps in the Google Playstore are FAMILY, GAME, TOOLS, MEDICAL and BUSINESS.

Bar chart showing total number of apps in each category

The Category of apps with most installs are GAME, COMMUNICATION, PRODUCTIVITY, SOCIAL and TOOLS.

Number of app installs per category

The app category with the highest ratings in total are FAMILY, GAME and TOOLS. This chart does not say a lot, it is infact very similar to the chart for total number of apps in each category. The average ratings are closely distributed around 4.5 hence the above chart only accentuates the number of apps in each category stat.

A more insightful rating chart will be the average rating of apps in each category. The app category with the highest average rating per app are EVENTS, EDUCATION, ART_AND_DESIGN, BOOKS_AND_REFERENCE and PERSONALIZATION

Average app rating per category
Average app rating per category line chart

Free apps make up 92.62% of the apps in googleplaystore while Paid apps account for 7.38%.

Percentage of free and paid apps

The apps that have made the most earning are in the category FAMILY, LIFESTYLE, GAME, FINANCE and PHOTOGRAPHY.

total earnings of apps per category

There is no noticeable correlation observed in the numeric fields such as price, reviews, installs, ratings and earnings in the dataset.

correlation among all the apps (numbers)
correlation among all the apps

However, when only the paid apps were considered, some interesting correlations came up. It was observed that the number of installs have a strong correlation with the number of reviews. Also, the earnings have a strong correlation with the number of installs. The price of the app and the earnings do not have a strong correlation.

correlation among the paid apps (number
correlation for paid apps

CONCLUSION/RECOMMENDATIONS

  1. The apps with most earnings are in the FAMILY, LIFESTYLE, GAME, FINANCE and PHOTOGRAPHY categories. A developer/entrepreneur who wants to invest can explore these genres.
  2. Apps that got the best ratings are in the EVENTS, EDUCATION, ART_AND_DESIGN, BOOKS_AND_REFERENCE and PERSONALIZATION categories.
  3. The majority of apps in the Google Play store are free.
  4. There is a high correlation between the number of installs and reviews for paid apps.
  5. There is a high correlation between the number of installs and earnings for paid apps.
  6. The number of reviews and earnings also show a strong correlation.

--

--

Olakunle Yusuf

I am a data analyst with strong analytical skills. I recently earned Google Data Analytics Professional Certificate. SQL | R | Python | Tableau