Data Mining

Data Mining

London, 24th April 2016

Word Cloud Data Mining


Data pours in at unprecedented speeds and volumes from everywhere. But making fact-based decisions is not totally dependent on the amount of data we have. Actually, having so much data can be a paralyzing factor sometimes as it is hard to know where do we even begin?

So, success will depend on how quickly insights can be discovered from all that data and use those insights to drive better actions across the entire organization.

So much data and multitudes of possible decisions! It seems that the majorities of organizations everywhere struggle with this dilemma. The data is growing, but what about the ability to make decisions based on those huge volumes of data? Is that growing too? For many, unfortunately, the answer is no.

That’s where predictive analytics, data mining, machine learning and decision management come into play.

Predictive analytics helps assess what will happen in the future.

Data mining looks for hidden patterns in data that can be used to predict future behaviour.

Businesses, scientists and governments have used this approach for years to transform data into proactive insights.

Decision management turns those insights into actions that are used in your operational processes.

So while the same approaches can still be applied today – they need to happen faster and at a larger scale, using the most modern techniques available.

Forward-thinking organizations, like Facebook, Wall-Mart,  Amazon, Pfizer  use data mining and predictive analytics to detect fraud and cybersecurity issues, manage risk, anticipate resource demands, increase response rates for marketing campaigns, generate next-best offers, curb customer attrition and identify adverse drug effects during clinical trials, among many other things.

Because they can produce predictive insights from large and diverse data, the technologies of data mining, machine learning and advanced analytical modelling are essential for identifying the factors that can improve organizational performance and, when automated in everyday decisions, create competitive advantage. And with more of everything these days (data, computing power, business questions, risks and consumers), the ability to scale analytical power is essential for staying ahead of your competitors.

Deploying analytical insights quickly ensures that the timeliness of models is not lost due to slow processing of writing code. If you we can rapidly deploy an analytical models, the context and relevance of the models is not lost and competitive advantage is retained. So how do we create an environment that can help an organization to deal with all of the data being collected, all of the models being created and all of the decisions that need to be made, all at an increasing scale? The answer is an iterative analytical life cycle that brings together:

• Data – the foundation for decisions.

• Discovery – the process of identifying new insights in data.

• Deployment – the process of using newly found insights to drive improved actions.


Even though the majority of this blog is focused on using data mining for insights discovery, let’s see at the entire iterative analytical life cycle, because that’s what makes predictive discovery achievable and the actions from it more valuable.

  • Ask a business question. It all starts here. First we need a question to start the process. The discovery process is driven by asking business questions that produce innovation. This step is focused on exploring what need to be known, and how predictive analytics can be applied to the data to solve a problem or improve a process.
  • Prepare data. Collecting data certainly isn’t a problem these days – it’s streaming in from everywhere. Technologies like Hadoop and faster, cheaper computers have made it possible to store and use more data, and more types of data, than ever before. But there is still the issue of joining data in different forms and format from different sources and the need to transform raw data into data that can be used as input for data mining. Has been assessed that data scientists still spend much of their time, up to 90%, dealing with completeness of data.
  • Explore the data. Interactive, self-service visualization tools need to serve a wide range of users in an organization (from the business analyst with no analytical knowledge to a data scientist) to allow searches for relationships, trends and patterns to gain deeper understanding of the information captured by variables in the data. In this step, the hypothesis formed in the initial phase of the project should be refined and ideas on how to address the business problem from an analytical perspective are developed and tested.
  • Model the data. In this stage, the data scientist applies numerous analytical modelling algorithms to the data to find out a robust representation of the relationships in the data that help answers the business question. Analytical tools search for a combination of data and modelling techniques that reliably predict a desired outcome. Experimentation is key to finding the most reliable answer, and automated model building can help minimize the time to results and boost the productivity of analytical teams. In the past, with manual model-building tools, data miners and data scientists were able to create several models in a week or month. Today, they can create hundreds or even thousands. But how can they quickly and reliably find the one model (out of many) that performs best? With automated tournaments of machine-learning algorithms and a clearly defined champion model, this has become a fairly easy process. Analysts and data scientists can now spend their time focusing on more strategic questions and investigations.
  • Implement the models. Here there is the transition from the discovery phase to deployment phase  – taking the insights learned and putting them into action using repeatable, automated processes. The faster the business can use the answers generated by predictive analytics for better decision making, the more value will be generated. And, a transparent process is important for everyone – especially auditors.

Act on the new information. There are two types of decisions that can be made based on analytical results. Strategic decisions are made by humans who look at results and take action. Operational decisions are often automated – like credit scores or recommended best offers – and require a very little human intervention, if not none.

Evaluate your results. The next – and perhaps most important – step is to evaluate the outcome of the actions produced by the analytical model. Did the predictive models produce tangible results, such as increased revenue or decreased costs? With continuous monitoring and measurement of the models’ performance, success can be evaluated making sure they continue to produce the desired results.

More and more organizations are looking to automate operational decisions and provide real-time answers and results to reduce decision latencies. Basing operational decisions on answers from analytical models also makes decisions more objective, repeatable and measurable. The integration with enterprise decision management tools enables organizations to build comprehensive and complete operational decision flows that combine data-driven analytics and business rules for optimal automated decisions.

Ask again. Because the data is always growing and continuosly changing, relationships in data that  models use for predictions also change over time. Constant evaluation of analytical results should identify the degradation of model accuracy. Even the most accurate models will have to be refreshed over time, and organizations will need to go through the discovery and deployment steps again. It’s a constant and evolving process.

Try R



Data Visualisation & Data Science

Dublin – 13th March 2016
RschoolRstudioR datacamp


R try









Try R

Data Visualisation & Data Science

Dublin – 13th March 2016    Try R

With the Big Data gaining more  and more importance every day, “R” is growing in parallel and is R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows and organisations are using it in their day-to-day activities to examine and mining their data for future business purposes.

Using R Studio

To begin using R, head to r-proj to download and install R for your desktop or laptop. It runs on Windows and OS X. It is not the only one available and if you experiencing issues with the set up, it is worth mentioning DataCamp, which also give you the certificate of completion to link to your LinkedIn account! …but there are 63 chapters to go through, versus the 8 of Try R Code School.

Here below the completion of course.

Try R course finished

This is how R Studio looks like once installed:

R console

The top left window is where you’ll probably do most of your work. That’s the R code editor allowing you to create a file with multiple lines of R code — or open an existing file — and then run the entire file or portions of it.

Bottom left is the interactive console where you can type in R statements one line at a time. Any lines of code that are run from the editor window also appear in the console.

The top right window shows your workspace, which includes a list of objects currently in memory. There’s also a history tab with a list of your prior commands; what’s handy there is that you can select one, some or all of those lines of code and one-click to send them either to the console or to whatever file is active in your code editor.

The window at bottom right shows a plot if you’ve created a data visualization with your R code. There’s a history of previous plots and an option to export a plot to an image file or PDF. This window also shows external packages (R extensions) that are available on your system, files in your working directory and help files when called from the console.

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Decimals values like 4.5 are called numerics.

Natural numbers like 4 are called integers. Integers are also numerics.

Boolean values (TRUE or FALSE) are called logical (TRUE can be abbreviated to T and FALSE to F).

Text (or string) values are called characters.

What is a Variable?

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable x with the command:  x <- 4

Suppose you have a fruit basket with five apples. As a data analyst in training, (if you are a beginner as myself 🙂 you want to store the number of apples in a variable with the name my_apples.

The following code in the editor: my_apples <- 5, assigns the value 5 to my_apples.

Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable my_oranges and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:

my_apples + my_oranges

Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The my_apples and my_oranges variables both contained a number in the example above. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges, it would not work. You would be trying to assign the addition of a numeric and a character variable to the variable my_fruit. This is not possible. See error returned below.

# Assign a value to the variable called ‘my_apples’

>my_apples <- 5

# Assign a value to the variable ‘my_oranges’

>my_oranges <- “six”


# New variable that contains the total amount of fruit

my_fruit <- my_apples + my_oranges


the results returned by your console would be the following:

Error in my_apples + my_oranges : non-numeric argument to binary operator

> my_fruit

Error: object ‘my_fruit’ not found

It makes sense.

Exploratory Vs. Explanatory

Data visualisation is an essential component of your school set as a data scientist.

Data visualisation is a statistic and design combined in a meaningful and appropriate ways that means data visualisation is form of a graphical data analysis.

There is the importance to understand and differentiate between the exploratory pot versus the explanatory pot.

Exploratory  visualisations are easily generated data heavy and intended for a small specialist audience. The primary purpose is graphical data analysis

Explanatory  visualisations are labour intensive, data specific and intended for a broader audience for example publications or presentations. They are part of the communication process.

As data analyst our job is to exploring our data, but also to explaining to a specific audience.

Effectiveness of Insect Sprays: introduction.

In my example below, I took an example of effectiveness of insect sprays: the counts of insects killed, in agricultural experimental unit, treated with different insecticides.

My dataset contains 72 observations on 2 variables.

[,1] count: numeric Insect count

[,2] spray factor: The type of spray.

Taken from Source: ‘Beall, G., (1942) The Transformation of data from entomological field experiments, Biometrika, 29, 243–262.’

count spray

1 10 A

2 7 A

3 20 A

4 14 A

5 14 A

6 12 A

7 10 A

8 23 A

9 17 A

10 20 A

11 14 A

12 13 A

13 11 B

14 17 B

15 21 B

16 11 B

17 16 B

18 14 B

19 17 B

20 17 B

21 19 B

22 21 B

23 7 B

24 13 B

25 0 C

26 1 C

27 7 C

28 2 C

29 3 C

30 1 C

31 2 C

32 1 C

33 3 C

34 0 C

35 1 C

36 4 C

37 3 D

38 5 D

39 12 D

40 6 D

41 4 D

42 3 D

43 5 D

44 5 D

45 5 D

46 5 D

47 2 D

48 4 D

49 3 E

50 5 E

51 3 E

52 5 E

53 3 E

54 6 E

55 1 E

56 1 E

57 3 E

58 2 E

59 6 E

60 4 E

61 11 F

62 9 F

63 15 F

64 22 F

65 15 F

66 16 F

67 13 F

68 10 F

69 26 F

70 26 F

71 24 F

72 13 F

In our example, e we are interested in looking at the relationship of these two continuos variables (the insect count and the type of spray used) so the most obvious first step is to make the scatter plot, like the one below.

Effectiveness of Insect Sprays: explanatory

We begin to explore our data which reveals quite immediately that spray C, D and E are the less effective, while further analysis is required for spray A, B and F.

R spray simple plot

Same data, displayed differently using command boxplot, which Produce box-and-whisker plot(s) of the given (grouped) values.

It is noticeable that in each spray C and D data there is a value which is an anomaly compared to the other sampled data, in our case higher that the others (in C is 7 insects and D is 12).

This graph was made with following command:

R boxplot

boxplot(count ~ spray, data = InsectSprays, xlab = “Type of spray”, ylab = “Insect count”,main = “InsectSprays data”, varwidth = TRUE, col = “lightblue”)

Understanding R: bit of arithmetics…

If you are newer than a dummies :), as I was, then let’s start with bit of arithmetics!

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

Addition: +

Subtraction: –

Multiplication: *

Division: /

Exponentiation: ^

Modulo: %%

The last two might need some explaining:

The ^ exponentiation raises the number to its left to the power of the number to its right: for example 3^2 is 9.

The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.

Create a vector

Feeling lucky? You better, because we are going on a trip to the City of Sins, also known as “Statisticians Paradise” ;-).

Thanks to R and your new data-analytical skills, you will learn how to uplift your performance at the tables and fire off your career as a professional gambler. This chapter will show how you can easily keep track of your betting progress and how you can do some simple analyses on past actions. Next Stop, Vegas Baby… VEGAS!!

Let us focus first!

On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. For example, you can store your daily gains and losses in the casinos.

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the brackets. For example:

numeric_vector <- c(1, 2, 3)

character_vector <- c(“a”, “b”, “c”)

boolean_vector <- c(TRUE, FALSE)

Once you have created these vectors in R, you can use them to do calculations.

After one week in Las Vegas and still zero Ferraris in your garage, you decide that it is time to start using your data analytical superpowers.

Before doing a first analysis, you decide to first collect all the winnings and losses for the last week:

For poker_vector:

On Monday you won 140$

Tuesday you lost 50$

Wednesday you won 20$

Thursday you lost 120$

Friday you won 240$

For roulette_vector:

On Monday you lost 24$

Tuesday you lost 50$

Wednesday you won 100$

Thursday you lost 350$

Friday you won 10$

You only played poker and roulette, since there was a delegation of mediums that occupied the craps tables. To be able to use this data in R, you decide to create the variables poker_vector and roulette_vector.

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <-c(-24,-50,100,-350,100)

As a data analyst, it is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential.

We created a vector with your winnings over the week and with  Each vector element refers to a day of the week but it is hard to tell which element belongs to which day. It would be nice if you could show that in the vector itself.

Calculating total winnings

Now that you have the poker and roulette winnings nicely as a named vector, you can start doing some data analytical magic.

You might want to find out:

1) How much has been your overall profit or loss per day of the week?

2) Have you lost money over the week in total?

3) Are you winning/losing money on poker or on roulette?

To get the answers, we have to do arithmetic calculations on vectors.

It is important to know is that if you sum two vectors in R, it takes the element-wise sum.

Very important consideration: if you want to become a good statistician, you have to become lazy. (If you are already lazy, chances are high you are one of those exceptional, natural-born statistical talents.)

In the previous examples you probably experienced that it is boring and frustrating to type and retype information such as the days of the week. However, when you look at it from a higher perspective, there is a more efficient way to do this, namely, to assign the days of the week vector to a variable!

Just like you did with your poker and roulette returns, you can also create a variable that contains the days of the week. This way you can use and re-use it.

# Creating the variable ‘days_vector’

days_vector <-c(“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”)

#Assign the names of the day to ‘roulette_vector’ and ‘poker_vector’

names(poker_vector) <- (days_vector) 

names(roulette_vector) <-(days_vector)

# Total daily and plot

total_daily <- (poker_vector+roulette_vector)



   Monday   Tuesday Wednes day  Thursday    Friday

      116      -100       120      -470       250


Based on the previous analysis, it looks like you had a mix of good and bad days. This is not what your ego expected, and you wonder if there may be a (very very very) tiny chance you have lost money over the week in total?

R poker roulette

A function that helps you to answer this question is sum(). It calculates the sum of all elements of a vector. For example, to calculate the total amount of money you have lost/won with poker you do:

total_poker <- sum(poker_vector)

# Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)

# Give names to both ‘poker_vector’ and ‘roulette_vector’

days_vector <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”)

names(roulette_vector) <- days_vector

names(poker_vector) <- days_vector

# Total winnings with poker

total_poker <- sum(poker_vector)


# Up to you now:

total_roulette <- sum(roulette_vector)

total_week <- sum(poker_vector,roulette_vector)


Which displays a single dot: a total loss of -84$.

R weekly loss

Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This will require some deeper analysis…

After a short brainstorm in your hotel’s jacuzzi, you realize that a possible explanation might be that your skills in roulette are not as well developed as your skills in poker. So maybe your total gains in poker are higher (or > ) than in roulette…?!


> # Calculate total gains for poker and roulette

> total_poker= sum(poker_vector)

> total_roulette=sum(roulette_vector)

> total_poker

[1] 230

> total_roulette

[1] -314

> # Check if you realized higher total gains in poker than in roulette

> answer=(total_poker>total_roulette)

> answer

[1] TRUE

Vector selection: the good times

Your guess seemed to be right. It appears that the poker game is more your cup of tea than roulette.

Another possible route for investigation is your performance at the beginning of the working week compared to the end of it. You did have a couple of Margarita cocktails at the end of the week…

To answer that question, you only want to focus on a selection of the total_vector. In other words, our goal is to select specific elements of the vector. To select elements of a vector, you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first element of the vector, you type poker_vector[1]. To select the second element of the vector, you type poker_vector[2], etc.

> # Define a new variable based on a selection

> poker_wednesday <- poker_vector[3]

> poker_wednesday



How about analysing your week results now?

To select multiple elements from a vector, you can add square brackets at the end of it. You can indicate between the brackets what elements should be selected. For example: suppose you want to select the first and the fifth day of the week: use the vector c(1,5) between the square brackets. For example, the code below selects the first and fifth element of poker_vector:


> # Define a new variable based on a selection

> poker_midweek <- poker_vector[c(2,3,4)]

> poker_midweek

  Tuesday Wednesday  Thursday

      -50        20      -120

> Poker_earlyweek=poker_vector[c(1:3)]

> Poker_earlyweek

   Monday   Tuesday Wednesday

      140       -50        20

poker_midweek <- poker_vector[c(2,3,4)]

> poker_midweek

  Tuesday Wednesday  Thursday

      -50        20      -120

> Poker_earlyweek=poker_vector[c(1:3)]

> Poker_earlyweek

   Monday   Tuesday Wednesday

      140       -50        20

> poker_endweek=poker_vector[c(4,5)]

> poker_endweek

Thursday   Friday

    -120      240

> mean(Poker_earlyweek)

[1] 36.66667

> mean(poker_midweek)

[1] -50

> mean(poker_endweek)

[1] 60

Another way to use vectors is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions. For example,


will select the first element of poker_vector since “Monday” is the name of that first element.

Selection by comparison

By making use of comparison operators, we can approach the previous question in a more proactive way.

The (logical) comparison operators known to R are:

< for less than

> for greater than

<= for less than or equal to

>= for greater than or equal to

== for equal to each other

!= not equal to each other

Stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example, the statement c(4,5,6) > 5 returns: FALSE FALSE TRUE. In other words, you test for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE. Do not take our word for it! Try it in the console ;-).

Behind the scenes, R recycles the value 5 when you execute c(4,5,6) > 5. R wants to do an element-wise comparison of each element in c(4,5,6) with each corresponding element in 5. However, 5 is not a vector of length three. To solve this, R automatically replicates the value 5 to generate a vector of three elements, c(5, 5, 5) and then carries out the element-wise comparison.

# What days of the week did you make money on poker?

> selection_vector <- poker_vector > 0

> selection_vector

   Monday   Tuesday Wednesday  Thursday    Friday

     TRUE     FALSE      TRUE     FALSE      TRUE

> selection_vector <- roulette_vector > 0

> selection_vector

   Monday   Tuesday Wednesday  Thursday    Friday

    FALSE     FALSE      TRUE     FALSE      TRUE

Working with comparisons will make your data analytical life easier. Instead of selecting a subset of days to investigate yourself (like before), you can simply ask R to return only those days where you realized a positive return for poker.

You used selection_vector <- poker_vector > 0 to find the days on which you had a positive poker return. Now, you would like to know not only the days on which you won, but also how much you won on those days.

You can select the desired elements, by putting selection_vector between the square brackets that follow poker_vector. This works, because R only selects those elements where selection_vector is TRUE by default. For selection_vector this means where poker_vector > 0.

> # Select from poker_vector these days

> poker_winning_days <- poker_vector[selection_vector]

> poker_winning_days

   Monday Wednesday    Friday

      140        20       240

Just like you did for poker, you also want to know those days where you realised a positive return for roulette.

> # What days of the week did you make money on roulette?

> selection_vector <-roulette_vector>0


> # Select from roulette_vector these days

> roulette_winning_days= roulette_vector[selection_vector]

> roulette_winning_days

Wednesday    Friday

      100        10

Ok, I hope you would feel bit more confident now about R: although we just scratch the surface, at the end of the day it is not so difficult, believe me! So, keep it going, discover it and enjoy the visualisation of data with plots and charts!












Google Fusion tables – the beauty of having data visualised on a map!

2011 Census data

Displaying results   using fusion tables


Dublin – 14 February 2016

Fusion tables run by Google: what are they?

Hi! Firstly I would like to say that I am not going to write another guide on how to use fusion tables but I will share few useful tips down below (this should increase my passing rate was told me! 🙂 that I discovered while using fusion tables. Well firstly what is a fusion table run by Google?

Google Fusion Tables is a cloud Software as a Service (SaaS) application that enables the hosting, management, sharing and publishing of data online.

Google Fusion Tables primarily enables the visualisation of data stored in tables in the form of graphical charts, maps, time lines and plots, with the ability to publish, share and integrate them with individual users and websites. Google Fusion works by importing data values from the tables created online or from a user spreadsheet and converting it into a meaningful graphical data representation.

Google Fusion Tables is entirely hosted on the Google cloud infrastructure, which maintains the most updated version of data across all the shared users.

…so in few words a great FREE tool to use to visualise your data using geolocation, so, visualising on maps! 

fusion table ex 2 fusion tab example

First step: prepare your data

Displaying data properly is very important but in order to achieve it the data quality behind the graph or map has to be tidy up and in order; it means data has to have a format, and each column has to contain the same type of data. In my example here, I started with a table which contained redundant data and had to be deleted, simplified and checked to make sense for the geolocation function, the counties borders and ultimately the final result.

Census 2011 CSO Central Statistics Office

A word has to spend also for the for the geolocation function which reads addresses or simply county names (in my case).

One big caveat, however: the address data needs to be in one field (or two, if you have latitude and longitude data, which I don’t in our example below).

Fusion Tables automatically begins geocoding when you visualise the location on a map but before doing that you might want to consider to add an hint to fusion tables before starting geolocating your entries: what that?

Imagine your work contains the word ‘Springfield’ and considering that there are more than 70 Springfields locations in the US alone… is now understandable why would be good to use the location hint!

Here below where to click to add the hint:

Geocode location hint

For fusion table working properly, as was said, the address data needs to be in one field (or two, if you have latitude and longitude data). Do you know how to do that? If you are not very familiar with Excel spreadsheets there is the CONCATENATE formula which groups data in one cell.

Please also note that file has to be saved in .CSV format, otherwise Google fusion tables do not read it.

So after I clean and correct the data it appeared like  this.Data table census 2011

As you might notice I have added ‘Ireland’ beside counties name – it is another alternative to the geolocation hint described above.

Display your data!

Once I had this data I the created the following heat maps which in all fairness does not tell you much, does it?! 

Irish population 2011 by county Table 1 Google Fusion TablesIt simply displays a dot for each county where if you scroll over with the mouse return the data for that county.

It is not incorrect but it does not fit the purpose either.

So, back into Tools menu, change map, change feature style, is where I changed  the fixed marker icons with Buckets….

I have assigned a scale and a legenda, which is visible on the bottom right of the image below and…. the same data returned a slightly more meaningful result.

Irish population 2011 by county TABLE 1 BUCKETS and pinsIt is now bit easier to identify, with the help of the legenda, that less amount of population is living in the areas stretching from Sligo down to Kilkenny, while the heaviest populated area is Dublin, followed by Cork and Galway.

Despite this representation is better than the one before is still basic and require a certain amount of effort in reading it.

So I went a step ahead and researched for a file (well we had a hint where to find this file – thank you Darren!) containing the boundaries for each county. (see example below).

Screen Shot 2016-02-16 at 23.10.55-55

File was uploaded in Google fusion table and the map returned was similar to my opening image, with the only difference that counties name were not displayed.

From here the map was then merged to the existing one: during this merging process is important to identify first the relationships between the new imported data  with the existing one.

In our case it had to be linked to the county names.

Next step was to repeat the process of the buckets and I porously applied the same numbers of buckets and values.

As you can see below the image is now much clearer and easier to be interpreted.


At the first glance it is visible that the most populate county is Dublin, followed by Cork and Galway. A clear yellow path in the middle shows the counties less populated.

The curiosity corner

Considering that our table had also a column for Mens and a column for women I compared the two genders and the visual results are the following.

Including all counties in a bar chart the most visible difference is county Dublin where there is a highest difference between women and man in all Ireland. (Thumbs up folks! If you are reading this blog in DBS, you are in the right county!).

Fusion table - M vs F

In all other counties the difference is not so clear due to the amount of population that lives in Dublin and the scale of the graph minimise the differences among the less populated countries, so I took out the three most populated: Dublin, Cork and Galway and the result is the following:

The 7 least populated – with the exception of Sligo – have more men than women….


The first 7 most populated counties have more women than men.

The ‘Sunny South East’  counties (Wexford, Waterford and Wicklow) have a greater difference in gender than other Irish counties. (except Dublin, as mentioned above)

Fusion table - M vs F excluding D C G K

15 counties out of 26 have more women, with the lowest difference overall between the two gender, is in Donegal with just 91 more women.

TIPS and freebies

As I was mentioning at the beginning of this post, here a useful tip to know before loosing the plot as ‘someone’ (myself) did it!

Only if you create a new chart the option ‘Changing appearance’ will be available, otherwise if you duplicate a graph it turns grey and graph adjustments can be very limited! An example is my last graph above, where I duplicated it and applying a filter I excluded Dublin, Cork, Galway and Kildare, but I could not change the graph title!

Why? Because I duplicated an existing chart.

Only when you create a chart,’Change appearance’ will be selectable (click on Tools –> change chart –> and on the top right  click Change appearance.

Darren, did you know this one?! 🙂

See you at my next post!