What do we work on?

aditivrao94
Apr 20, 2020
4 min read

Updated: Apr 27, 2020

In the first week, we discussed our initial plans. There were three candidates for the final project. The first dataset was about Airbnb that can be found by the open resource website, “Get the Data - Inside Airbnb” with the link http://insideairbnb.com/get-the-data.html. The second dataset was about the social network from Twitter that can be found by the open resource website, “Stanford Large Network Dataset Collection”. The link for the Twitter dataset is https://snap.stanford.edu/data/ego-Twitter.html. The third dataset was about Heart Disease - https://www.kaggle.com/volodymyrgavrysh/heart-disease/data?fbclid=IwAR02hGuTleo_sp1xpDAO_hRVGGwI5YxE5C3tchkJ5yYXjsqnIuWP4tL2pZ0

The following part gives more details about those candidate datasets.

AirBnB data

The course on data visualization seemed to be one that required a great deal of creativity and would take some amount of effort. So, I decided the best way to have fun would be to combine my work with my passion - travel. Among the first few Google searches, I found this website that had the full listing, neighborhood, host and review data from AirBnB for the past two years in 70+ cities across the World. This seemed like a great dataset to analyse and maintain a blog on. The benefit of this was the vastness of the data available and the kinds of analysis that could be performed.

The preliminary idea was to perform text analytics on the review of each property listed and try to find the best property in Europe to visit. But that poses a great problem as the reviews were very unstructured and there were not enough reviews about every property - the scraped data was not accurate. So, we decided to focus on the listing information. This has several columns describing each property that is listed in that city. The columns themselves could be grouped into 3 - host related, property related and review related. We would like to work on the 41 cities in Europe (those available on the website) - hopefully make some kind of "Must Visit list".

Social Network Data (Twitter)

Why did I choose the social network data for the final project? The reason for it was that the network data was highly appropriate for data visualization. In the social network analysis, visualized exploration is important content. Experts in this field tend to use different graphs of networks to show the configuration and find patterns, which can explain something about the relationship between nodes. For example, in the social network data, the patterns can be visualized from the dyad, the triad to the group level. Those features can be highlighted in different ways by the techniques of data visualization and readers can easily get the key information. Besides, some advanced visualization techniques, like the interactive setting, can also be performed based on the network data, which can help readers finding information by themselves.

For this case, based on the introduction shown on the website, Twitter data has 81306 nodes, indicating the information about 81306 twitters are collected, and the edges (ties) are 1768149. Other basic parameters are also presented on the website, which gives some information about the clustering and size. Although it is a fascinating and big dataset, there are some drawbacks to it. First, without any social background information, the analysis cannot be meaningful. The dataset just includes the basic features of twitter networks without any attributes, like gender or other behavioral factors. Even though some graphic patterns can be found in the technical view, it could be hard to give an explanation about why and how those patterns occur. Second, it is more academic and less practical. Compared to some commercial data, like AirBnB data, we still do not how to combine the finding with real life.

Heart Disease Data

One of the other datasets we were planning to analyze was the ‘Heart Disease’ dataset which had 14 attributes(variables). The response variable was the presence of heart disease in the patient.

There were two major reasons because of which this data was not selected. The dataset did not have a lot of explanatory variables(columns) and thus had a lesser scope for different data visualizations and data analysis. Since it is heart disease data, most of the variables were biomedical terms and it was not very easy to interpret because of the lack of knowledge as to what each of those represents.

Why do we choose the final idea?

After comparing the merits and flaws of three candidates, we decided to use the AirBnB data for our final project. There are two reasons for that:

The data website provides plenty of information about AirBnB, including the time, the housing type, the price, the location and even reviews of guests. Various factors can extend the application of data visualization and different combinations can make the visualization more fun.
The analysis of AirBnB data is practical and meaningful. Since it is from a commercial company focusing on traveling and housing, we can directly get some obvious and useful findings for readers, maybe like listing the top 10 recommended rooms.

This was our discussion of the first week. We came up with three potential datasets and plans and decided the final choice.

What do we work on?

Recent Posts

Comments