The join is the central operation for extracting knowledge from relational databases, and allows you to relate data from different tables, having a common key. It is possible to make a join between two tables also directly in R (in this case we are talking about dataframes, but the operation is the same), without the need to go first to an RDBMS, which is not always immediately available to the data analyst. Here are the main options available in R – DPLYR to create a join between two tables t1 and t2:
- inner-join, all the records present in both tables
- left-join, all the records present in t1
- Anti-join, all the records of t1 not present in the t2
A concrete example of joins between two tables
The generic format for the join statement with DPLYR is of the type:
typeof_join (t1, t2, by = c (“column-to-join-t1” = “column-to-join-t2 “)
Below are the two starting tables:
- t1 – country ISO codes
- t2 – Geographical E
In our example we use as table t1 the two and three digit ISO codes of 256 nations, freely downloadable from https://gist.github.com/tadast/8827699. The table contains ISO codes and descriptions of the individual countries and also the relative geographical coordinates (latitude and longitude).
Table t2 comes from the wikipedia page https://en.wikipedia.org/wiki/List_of_European_countries_by_area which contains the countries of the geographical Europe (therefore also includes countries that are only partially in Europe such as Russia, Turkey etc.) for a total of 50 occurrences. t2 is therefore a subset of t1.
It’s not unusual that, having data from different systems, we do not have a unique ID code to be used for the join. However, we have the name of the nation, which we can use with a bit of caution (it is not unlikely that the name of the same country could be written differently in the two data sources) to relate the two starting tables.
The most common operation is the inner join, which returns all the records in t1 that are also present in t2:
# loading t1: iso country codes # Countries with their (ISO 3166-1) Alpha-2 code, # Alpha-3 code, UN M49, average latitude and longitude # from https://gist.github.com/tadast/8827699 # 256 countries and IDs country_codes <- read_csv("countries_codes_and_coordinates.csv", col_types = cols(`Numeric code` = col_character())) loading t2: european countries # Country names, areas and notes # from https://en.wikipedia.org/wiki/List_of_European_countries_by_area # converted in xlsx, flags removed # 49 countries european_countries <- read_excel("european_countries.xlsx") #inner jon, all matched records in the two tables test_inner_join<-inner_join(country_codes, european_countries, by=c("Country"= "Nation"))
Result of the inner join is a table containing all the fields of the two t1 and t2, with only the records common to both tables. It should be noted that the join returned only 39 records (instead of 50) because probably the name (we do not have an ID field for both tables) of some (eleven) countries is written in a different manner between t1 and t2. We will further investigate the matter in the following paragraphs.
The left Join shows all the records of t1. The fields of the t2 are valued only in correspondence of the matches between t1 and t2, and show NAs in the other cases. The field used for the join is shown only for t1. .
#left join, all the records from table 1, null values for unmatched record in table2 test_left_join<-left_join(country_codes, european_countries, by=c("Country"= "Nation"))
With the anti-join we select all the records in t1 not present in t2
#anti-join, records in table 1 unmatched in table 2 test_anti_join<-anti_join(country_codes, european_countries, by=c("Country"= "Nation"))
It is of course, also possible to invert the relationship by selecting all the records in table 2 that are not found in table 1
test_anti_join2<-anti_join(european_countries, country_codes, by=c("Nation"="Country"))
Notably, we see that the countries with the “*” (indicating the fact that the nation is present in Europe but also in other continents) in t1 are obviously not found in t2. We also can’t find a match, for example, for “Moldova”, which in t1 is written as “Moldova, Republic of” and so on. Anti-joining can easily become a tool for exploring and cleaning the dataset.
With these three operations we have a very powerful tool for our analysis, without the need of a RDBMS. Good analysis.