Accessing World Bank Data from Julia
While looking for ways to access the World Bank data bank programmatically, I found this very useful library to access it with Julia (along with many others for python and R). The library is WorldBankData.jl, by 4gh. Although the repo doesn’t have much activity, the library works great. Here are some examples of what can be done with the World Bank data, and the powerful Julia dataframes. I will also use Plots.jl for visualizations;
using WorldBankData
using Plots
using DataFrames
The World Bank offers two kind of data sources: Countries and Indicators. Indicators are the variables being measured in every country. Every year 13,398 indicators are measured in 258 countries. I have selected some countries to work with, since not all data from the world bank is complete. This filtering will facilitate our task:
countries = ["AW","AF","AO","AL","AD","AE","AR","AM","AS","AG","AU","AT","AZ","BI","BE","BJ","BF","BD","BG","BH","BS","BA","BY","BZ","BM","BO","BR","BB","BN","BT","BW","CF","CA","CH","CL","CN","CI","CM","CD","CG","CO","KM","CV","CR","CU","KY","CY","CZ","DE","DJ","DM","DK","DO","DZ","EC","EG","ER","ES","EE","ET","FI","FJ","FR","FO","FM","GA","GB","GE","GH","GN","GM","GW","GQ","GR","GD","GL","GT","GU","GY","HK","HN","HR","HT","HU","ID","IM","IN","IE","IR","IQ","IS","IL","IT","JM","JO","JP","KZ","KE","KG","KH","KI","KN","KR","XK","KW","LA","LB","LR","LY","LC","LI","LK","LS","LT","LU","LV","MO","MA","MC","MD","MG","MV","MX","MH","MK","ML","MT","MM","ME","MN","MP","MZ","MR","MU","MW","MY","NA","NC","NE","NG","NI","NL","NO","NP","NZ","OM","PK","PA","PE","PH","PW","PG","PL","PR","KP","PT","PY","PF","QA","RO","RU","RW","SA","SD","SN","SG","SB","SL","SV","SM","SO","RS","SS","ST","SR","SK","SI","SE","SZ","SC","SY","TC","TD","TG","TH","TJ","TM","TL","TO","TT","TN","TR","TV","TZ","UG","UA","UY","US","UZ","VC","VE","VI","VN","VU","WS","YE","ZA","ZM","ZW"];
A list of countries has to be given to the wdi()
function, along with an indicator. This function queries the WBDB and returns a dataframe with all data available. If a year is added to the query, only the data for that year will be returned. Let’s work with total country population.
df=wdi("SP.POP.TOTL", countries, 2015)
iso2c | country | SP_POP_TOTL | year | |
---|---|---|---|---|
1 | AW | Aruba | 103889.0 | 2015.0 |
2 | AF | Afghanistan | 3.2526562e7 | 2015.0 |
3 | AO | Angola | 2.5021974e7 | 2015.0 |
4 | AL | Albania | 2.889167e6 | 2015.0 |
5 | AD | Andorra | 70473.0 | 2015.0 |
6 | AE | United Arab Emirates | 9.156963e6 | 2015.0 |
7 | AR | Argentina | 4.3416755e7 | 2015.0 |
8 | AM | Armenia | 3.017712e6 | 2015.0 |
9 | AS | American Samoa | 55538.0 | 2015.0 |
10 | AG | Antigua and Barbuda | 91818.0 | 2015.0 |
11 | AU | Australia | 2.3781169e7 | 2015.0 |
12 | AT | Austria | 8.611088e6 | 2015.0 |
13 | AZ | Azerbaijan | 9.651349e6 | 2015.0 |
14 | BI | Burundi | 1.1178921e7 | 2015.0 |
15 | BE | Belgium | 1.1285721e7 | 2015.0 |
16 | BJ | Benin | 1.0879829e7 | 2015.0 |
17 | BF | Burkina Faso | 1.810557e7 | 2015.0 |
18 | BD | Bangladesh | 1.60995642e8 | 2015.0 |
19 | BG | Bulgaria | 7.177991e6 | 2015.0 |
20 | BH | Bahrain | 1.377237e6 | 2015.0 |
21 | BS | Bahamas, The | 388019.0 | 2015.0 |
22 | BA | Bosnia and Herzegovina | 3.810416e6 | 2015.0 |
23 | BY | Belarus | 9.513e6 | 2015.0 |
24 | BZ | Belize | 359287.0 | 2015.0 |
25 | BM | Bermuda | 65235.0 | 2015.0 |
26 | BO | Bolivia | 1.0724705e7 | 2015.0 |
27 | BR | Brazil | 2.07847528e8 | 2015.0 |
28 | BB | Barbados | 284215.0 | 2015.0 |
29 | BN | Brunei Darussalam | 423188.0 | 2015.0 |
30 | BT | Bhutan | 774830.0 | 2015.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Lets visualize the first 50 countries with a barplot.
bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50], size=(900,400), label="Total Population")
The result is somehow expected. China is within the first 50 countries in the list, and it scales the graph for itself. Let’s find out exactly how much is big.
maximum(df[:SP_POP_TOTL])
1.37122e9
More than one billion people live only in china. This is almost 20% of the world’s population:
maximum(df[:SP_POP_TOTL])/sum(df[:SP_POP_TOTL])
0.18750300206223874
Right now we have the population for 208 countries. Almost all the countries measured by the WB.
size(df[:SP_POP_TOTL])
(208,)
In julia to find the index of the maximum value, the indmax()
function can be used. This is useful if you want to know more about the element with the maximum value in a column.
indmax(df[:SP_POP_TOTL])
36
In fact, element number 36 in our dataframe is China:
df[:country][36]
"China"
Now let’s try something else. We want to see the most populated countries in the world. The first thought to come to mind is to sort the dataframe, right?
sort(df)
iso2c | country | SP_POP_TOTL | year | |
---|---|---|---|---|
1 | AD | Andorra | 70473.0 | 2015.0 |
2 | AE | United Arab Emirates | 9.156963e6 | 2015.0 |
3 | AF | Afghanistan | 3.2526562e7 | 2015.0 |
4 | AG | Antigua and Barbuda | 91818.0 | 2015.0 |
5 | AL | Albania | 2.889167e6 | 2015.0 |
6 | AM | Armenia | 3.017712e6 | 2015.0 |
7 | AO | Angola | 2.5021974e7 | 2015.0 |
8 | AR | Argentina | 4.3416755e7 | 2015.0 |
9 | AS | American Samoa | 55538.0 | 2015.0 |
10 | AT | Austria | 8.611088e6 | 2015.0 |
11 | AU | Australia | 2.3781169e7 | 2015.0 |
12 | AW | Aruba | 103889.0 | 2015.0 |
13 | AZ | Azerbaijan | 9.651349e6 | 2015.0 |
14 | BA | Bosnia and Herzegovina | 3.810416e6 | 2015.0 |
15 | BB | Barbados | 284215.0 | 2015.0 |
16 | BD | Bangladesh | 1.60995642e8 | 2015.0 |
17 | BE | Belgium | 1.1285721e7 | 2015.0 |
18 | BF | Burkina Faso | 1.810557e7 | 2015.0 |
19 | BG | Bulgaria | 7.177991e6 | 2015.0 |
20 | BH | Bahrain | 1.377237e6 | 2015.0 |
21 | BI | Burundi | 1.1178921e7 | 2015.0 |
22 | BJ | Benin | 1.0879829e7 | 2015.0 |
23 | BM | Bermuda | 65235.0 | 2015.0 |
24 | BN | Brunei Darussalam | 423188.0 | 2015.0 |
25 | BO | Bolivia | 1.0724705e7 | 2015.0 |
26 | BR | Brazil | 2.07847528e8 | 2015.0 |
27 | BS | Bahamas, The | 388019.0 | 2015.0 |
28 | BT | Bhutan | 774830.0 | 2015.0 |
29 | BW | Botswana | 2.262485e6 | 2015.0 |
30 | BY | Belarus | 9.513e6 | 2015.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Well, not really. This function will sort only by the first column. To sort by other column we can add the cols
argument to the function.
sort(df, cols=[:SP_POP_TOTL])
iso2c | country | SP_POP_TOTL | year | |
---|---|---|---|---|
1 | TV | Tuvalu | 9916.0 | 2015.0 |
2 | PW | Palau | 21291.0 | 2015.0 |
3 | SM | San Marino | 31781.0 | 2015.0 |
4 | TC | Turks and Caicos Islands | 34339.0 | 2015.0 |
5 | LI | Liechtenstein | 37531.0 | 2015.0 |
6 | MC | Monaco | 37731.0 | 2015.0 |
7 | FO | Faroe Islands | 48199.0 | 2015.0 |
8 | MH | Marshall Islands | 52993.0 | 2015.0 |
9 | MP | Northern Mariana Islands | 55070.0 | 2015.0 |
10 | AS | American Samoa | 55538.0 | 2015.0 |
11 | KN | St. Kitts and Nevis | 55572.0 | 2015.0 |
12 | GL | Greenland | 56114.0 | 2015.0 |
13 | KY | Cayman Islands | 59967.0 | 2015.0 |
14 | BM | Bermuda | 65235.0 | 2015.0 |
15 | AD | Andorra | 70473.0 | 2015.0 |
16 | DM | Dominica | 72680.0 | 2015.0 |
17 | IM | Isle of Man | 87780.0 | 2015.0 |
18 | AG | Antigua and Barbuda | 91818.0 | 2015.0 |
19 | SC | Seychelles | 92900.0 | 2015.0 |
20 | VI | Virgin Islands (U.S.) | 103574.0 | 2015.0 |
21 | AW | Aruba | 103889.0 | 2015.0 |
22 | FM | Micronesia, Fed. Sts. | 104460.0 | 2015.0 |
23 | TO | Tonga | 106170.0 | 2015.0 |
24 | GD | Grenada | 106825.0 | 2015.0 |
25 | VC | St. Vincent and the Grenadines | 109462.0 | 2015.0 |
26 | KI | Kiribati | 112423.0 | 2015.0 |
27 | GU | Guam | 169885.0 | 2015.0 |
28 | LC | St. Lucia | 184999.0 | 2015.0 |
29 | ST | Sao Tome and Principe | 190344.0 | 2015.0 |
30 | WS | Samoa | 193228.0 | 2015.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Just one more little detail. This last query has given us a dataframe with the smallest cities in the world. We want to revert the order of the sorting. This time I’ll use the sort!()
function, to save the ordered dataframe into the df
variable.
sort!(df, cols=[:SP_POP_TOTL], rev=true)
iso2c | country | SP_POP_TOTL | year | |
---|---|---|---|---|
1 | CN | China | 1.37122e9 | 2015.0 |
2 | IN | India | 1.311050527e9 | 2015.0 |
3 | US | United States | 3.2141882e8 | 2015.0 |
4 | ID | Indonesia | 2.57563815e8 | 2015.0 |
5 | BR | Brazil | 2.07847528e8 | 2015.0 |
6 | PK | Pakistan | 1.88924874e8 | 2015.0 |
7 | NG | Nigeria | 1.82201962e8 | 2015.0 |
8 | BD | Bangladesh | 1.60995642e8 | 2015.0 |
9 | RU | Russian Federation | 1.44096812e8 | 2015.0 |
10 | MX | Mexico | 1.27017224e8 | 2015.0 |
11 | JP | Japan | 1.26958472e8 | 2015.0 |
12 | PH | Philippines | 1.00699395e8 | 2015.0 |
13 | ET | Ethiopia | 9.939075e7 | 2015.0 |
14 | VN | Vietnam | 9.17038e7 | 2015.0 |
15 | EG | Egypt, Arab Rep. | 9.1508084e7 | 2015.0 |
16 | DE | Germany | 8.1413145e7 | 2015.0 |
17 | IR | Iran, Islamic Rep. | 7.9109272e7 | 2015.0 |
18 | TR | Turkey | 7.866583e7 | 2015.0 |
19 | CD | Congo, Dem. Rep. | 7.7266814e7 | 2015.0 |
20 | TH | Thailand | 6.7959359e7 | 2015.0 |
21 | FR | France | 6.6808385e7 | 2015.0 |
22 | GB | United Kingdom | 6.5138232e7 | 2015.0 |
23 | IT | Italy | 6.0802085e7 | 2015.0 |
24 | ZA | South Africa | 5.495692e7 | 2015.0 |
25 | MM | Myanmar | 5.3897154e7 | 2015.0 |
26 | TZ | Tanzania | 5.347042e7 | 2015.0 |
27 | KR | Korea, Rep. | 5.0617045e7 | 2015.0 |
28 | CO | Colombia | 4.8228704e7 | 2015.0 |
29 | ES | Spain | 4.6418269e7 | 2015.0 |
30 | KE | Kenya | 4.6050302e7 | 2015.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Ready! now our df
variable is in order. Let’s try to visualize it again.
bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50],
size=(1000,400),
label="Population Total")
There it is! Clean and tidy. In fact, let’s only use these 50 cities.
biggest = df[1:50,:]
iso2c | country | SP_POP_TOTL | year | |
---|---|---|---|---|
1 | CN | China | 1.37122e9 | 2015.0 |
2 | IN | India | 1.311050527e9 | 2015.0 |
3 | US | United States | 3.2141882e8 | 2015.0 |
4 | ID | Indonesia | 2.57563815e8 | 2015.0 |
5 | BR | Brazil | 2.07847528e8 | 2015.0 |
6 | PK | Pakistan | 1.88924874e8 | 2015.0 |
7 | NG | Nigeria | 1.82201962e8 | 2015.0 |
8 | BD | Bangladesh | 1.60995642e8 | 2015.0 |
9 | RU | Russian Federation | 1.44096812e8 | 2015.0 |
10 | MX | Mexico | 1.27017224e8 | 2015.0 |
11 | JP | Japan | 1.26958472e8 | 2015.0 |
12 | PH | Philippines | 1.00699395e8 | 2015.0 |
13 | ET | Ethiopia | 9.939075e7 | 2015.0 |
14 | VN | Vietnam | 9.17038e7 | 2015.0 |
15 | EG | Egypt, Arab Rep. | 9.1508084e7 | 2015.0 |
16 | DE | Germany | 8.1413145e7 | 2015.0 |
17 | IR | Iran, Islamic Rep. | 7.9109272e7 | 2015.0 |
18 | TR | Turkey | 7.866583e7 | 2015.0 |
19 | CD | Congo, Dem. Rep. | 7.7266814e7 | 2015.0 |
20 | TH | Thailand | 6.7959359e7 | 2015.0 |
21 | FR | France | 6.6808385e7 | 2015.0 |
22 | GB | United Kingdom | 6.5138232e7 | 2015.0 |
23 | IT | Italy | 6.0802085e7 | 2015.0 |
24 | ZA | South Africa | 5.495692e7 | 2015.0 |
25 | MM | Myanmar | 5.3897154e7 | 2015.0 |
26 | TZ | Tanzania | 5.347042e7 | 2015.0 |
27 | KR | Korea, Rep. | 5.0617045e7 | 2015.0 |
28 | CO | Colombia | 4.8228704e7 | 2015.0 |
29 | ES | Spain | 4.6418269e7 | 2015.0 |
30 | KE | Kenya | 4.6050302e7 | 2015.0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Next we want to add another indicator: Female population percentage. This indicator has the code: ‘SP.POP.TOTL.FE.ZS’, and it represents the percentage of the total population that are registered as female.
df1 = wdi("SP.POP.TOTL.FE.ZS", countries, 2015)
println(size(df1))
println(size(df))
(191,4)
(208,4)
As you can see, the size of the two data frames we have right now are not the same. This is because some information is missing from the female percentage indicator. Not all indicators can be accurately measured in all countries. therefore, once in a while we can find missing values in them.
A workaround for this is to create a new dataframe with the countries we DO have the information for.
df = DataFrame(iso2c=[],country=[],SP_POP_TOTL=[],SP_POP_TOTL_FE_ZS=[])
for country in biggest[:iso2c]
push!(df, @data([country,biggest[biggest[:iso2c].==country,:][:country][1],
biggest[biggest[:iso2c].==country,:][:SP_POP_TOTL][1],
df1[df1[:iso2c].==country,:][:SP_POP_TOTL_FE_ZS][1]]))
end
df[1:5,:]
iso2c | country | SP_POP_TOTL | SP_POP_TOTL_FE_ZS | |
---|---|---|---|---|
1 | CN | China | 1.37122e9 | 48.4773329025405 |
2 | IN | India | 1.311050527e9 | 48.1676415969283 |
3 | US | United States | 3.2141882e8 | 50.4329265563715 |
4 | ID | Indonesia | 2.57563815e8 | 49.648253579409 |
5 | BR | Brazil | 2.07847528e8 | 50.828985562917 |
See, now we have a dataframe with no missing values. Still, percentage of female population is actually a variable that depends on the total population. Let’s create a new dataframe with absolute values.
population = DataFrame(iso2c=df[:,1],
country=df[:,2],
total=df[:,3],
females=df[:,3].*(df[:,4]./100),
males=df[:,3].*(df[:,4].+(-100))./-100)
population[1:5,:]
iso2c | country | total | females | males | |
---|---|---|---|---|---|
1 | CN | China | 1.37122e9 | 6.64730884226216e8 | 7.064891157737842e8 |
2 | IN | India | 1.311050527e9 | 6.315021189999998e8 | 6.795484080000004e8 |
3 | US | United States | 3.2141882e8 | 1.6210091742895588e8 | 1.593179025710441e8 |
4 | ID | Indonesia | 2.57563815e8 | 1.2787593599999988e8 | 1.2968787900000012e8 |
5 | BR | Brazil | 2.07847528e8 | 1.0564678999999987e8 | 1.0220073800000013e8 |
Of course, this is now an approximation, but it will work for our visualization. Lastly, to plot these three variables we need to set the bounds to plot. This is done with the minimum()
and maximum()
functions.
minimum(population[:total])
2.5155317e7
Lastly, let’s create a visualization to describe all information we have acquired. Since there are three variables we have the option of plotting 3D points in space, but It will look clearer if we use the total population as color or size of a scatterplot. I like to use gr
backend for my plots, but this code will work with almost all of Plots.jl backends.
gr() # GR backend
plt = scatter(population[:,end],population[:,end-1], # Ploting male vs female
xlim=(minimum(population[:total])/2,maximum(population[:total])/1.07), # Limits for the X axis
ylim=(minimum(population[:total])/2,maximum(population[:total])/1.25), # Limits for the Y axis
size=(1000,1000), left_margin=[50px 0px],label="", # Window Size, lateral margins and label
ylabel="Female population",xlabel="Male Papulation", # X and Y labels
xscale=:log10, yscale=:log10, bottom_margin=50px, # Logarithmic scale and bottom margin
markersize=population[:,end-2]./minimum(population[:total]), # The size of the marker is given by the total population
zcolor=map(log,convert(Array,population[:,end-2]./minimum(population[:total]))), # The color is given by the logarithm of the total population
grid=false) # NoGrid Science is art ;)
# Last a line is added to mark weather a coutry hase more male or female population
plot!(plt,x->x,minimum(population[:total])/2,maximum(population[:total])/2,label="", color=:green)
<?xml version=”1.0” encoding=”utf-8”?>