Accessing World Bank Data from Julia

While looking for ways to access the World Bank data bank programmatically, I found this very useful library to access it with Julia (along with many others for python and R). The library is WorldBankData.jl, by 4gh. Although the repo doesn’t have much activity, the library works great. Here are some examples of what can be done with the World Bank data, and the powerful Julia dataframes. I will also use Plots.jl for visualizations;

using WorldBankData
using Plots
using DataFrames

The World Bank offers two kind of data sources: Countries and Indicators. Indicators are the variables being measured in every country. Every year 13,398 indicators are measured in 258 countries. I have selected some countries to work with, since not all data from the world bank is complete. This filtering will facilitate our task:

countries = ["AW","AF","AO","AL","AD","AE","AR","AM","AS","AG","AU","AT","AZ","BI","BE","BJ","BF","BD","BG","BH","BS","BA","BY","BZ","BM","BO","BR","BB","BN","BT","BW","CF","CA","CH","CL","CN","CI","CM","CD","CG","CO","KM","CV","CR","CU","KY","CY","CZ","DE","DJ","DM","DK","DO","DZ","EC","EG","ER","ES","EE","ET","FI","FJ","FR","FO","FM","GA","GB","GE","GH","GN","GM","GW","GQ","GR","GD","GL","GT","GU","GY","HK","HN","HR","HT","HU","ID","IM","IN","IE","IR","IQ","IS","IL","IT","JM","JO","JP","KZ","KE","KG","KH","KI","KN","KR","XK","KW","LA","LB","LR","LY","LC","LI","LK","LS","LT","LU","LV","MO","MA","MC","MD","MG","MV","MX","MH","MK","ML","MT","MM","ME","MN","MP","MZ","MR","MU","MW","MY","NA","NC","NE","NG","NI","NL","NO","NP","NZ","OM","PK","PA","PE","PH","PW","PG","PL","PR","KP","PT","PY","PF","QA","RO","RU","RW","SA","SD","SN","SG","SB","SL","SV","SM","SO","RS","SS","ST","SR","SK","SI","SE","SZ","SC","SY","TC","TD","TG","TH","TJ","TM","TL","TO","TT","TN","TR","TV","TZ","UG","UA","UY","US","UZ","VC","VE","VI","VN","VU","WS","YE","ZA","ZM","ZW"];

A list of countries has to be given to the wdi() function, along with an indicator. This function queries the WBDB and returns a dataframe with all data available. If a year is added to the query, only the data for that year will be returned. Let’s work with total country population.

df=wdi("SP.POP.TOTL", countries, 2015)

	iso2c	country	SP_POP_TOTL	year
1	AW	Aruba	103889.0	2015.0
2	AF	Afghanistan	3.2526562e7	2015.0
3	AO	Angola	2.5021974e7	2015.0
4	AL	Albania	2.889167e6	2015.0
5	AD	Andorra	70473.0	2015.0
6	AE	United Arab Emirates	9.156963e6	2015.0
7	AR	Argentina	4.3416755e7	2015.0
8	AM	Armenia	3.017712e6	2015.0
9	AS	American Samoa	55538.0	2015.0
10	AG	Antigua and Barbuda	91818.0	2015.0
11	AU	Australia	2.3781169e7	2015.0
12	AT	Austria	8.611088e6	2015.0
13	AZ	Azerbaijan	9.651349e6	2015.0
14	BI	Burundi	1.1178921e7	2015.0
15	BE	Belgium	1.1285721e7	2015.0
16	BJ	Benin	1.0879829e7	2015.0
17	BF	Burkina Faso	1.810557e7	2015.0
18	BD	Bangladesh	1.60995642e8	2015.0
19	BG	Bulgaria	7.177991e6	2015.0
20	BH	Bahrain	1.377237e6	2015.0
21	BS	Bahamas, The	388019.0	2015.0
22	BA	Bosnia and Herzegovina	3.810416e6	2015.0
23	BY	Belarus	9.513e6	2015.0
24	BZ	Belize	359287.0	2015.0
25	BM	Bermuda	65235.0	2015.0
26	BO	Bolivia	1.0724705e7	2015.0
27	BR	Brazil	2.07847528e8	2015.0
28	BB	Barbados	284215.0	2015.0
29	BN	Brunei Darussalam	423188.0	2015.0
30	BT	Bhutan	774830.0	2015.0
⋮	⋮	⋮	⋮	⋮

Lets visualize the first 50 countries with a barplot.

bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50], size=(900,400), label="Total Population")

The result is somehow expected. China is within the first 50 countries in the list, and it scales the graph for itself. Let’s find out exactly how much is big.

maximum(df[:SP_POP_TOTL])

1.37122e9

More than one billion people live only in china. This is almost 20% of the world’s population:

maximum(df[:SP_POP_TOTL])/sum(df[:SP_POP_TOTL])

0.18750300206223874

Right now we have the population for 208 countries. Almost all the countries measured by the WB.

size(df[:SP_POP_TOTL])

(208,)

In julia to find the index of the maximum value, the indmax() function can be used. This is useful if you want to know more about the element with the maximum value in a column.

indmax(df[:SP_POP_TOTL])

In fact, element number 36 in our dataframe is China:

df[:country][36]

"China"

Now let’s try something else. We want to see the most populated countries in the world. The first thought to come to mind is to sort the dataframe, right?

sort(df)

	iso2c	country	SP_POP_TOTL	year
1	AD	Andorra	70473.0	2015.0
2	AE	United Arab Emirates	9.156963e6	2015.0
3	AF	Afghanistan	3.2526562e7	2015.0
4	AG	Antigua and Barbuda	91818.0	2015.0
5	AL	Albania	2.889167e6	2015.0
6	AM	Armenia	3.017712e6	2015.0
7	AO	Angola	2.5021974e7	2015.0
8	AR	Argentina	4.3416755e7	2015.0
9	AS	American Samoa	55538.0	2015.0
10	AT	Austria	8.611088e6	2015.0
11	AU	Australia	2.3781169e7	2015.0
12	AW	Aruba	103889.0	2015.0
13	AZ	Azerbaijan	9.651349e6	2015.0
14	BA	Bosnia and Herzegovina	3.810416e6	2015.0
15	BB	Barbados	284215.0	2015.0
16	BD	Bangladesh	1.60995642e8	2015.0
17	BE	Belgium	1.1285721e7	2015.0
18	BF	Burkina Faso	1.810557e7	2015.0
19	BG	Bulgaria	7.177991e6	2015.0
20	BH	Bahrain	1.377237e6	2015.0
21	BI	Burundi	1.1178921e7	2015.0
22	BJ	Benin	1.0879829e7	2015.0
23	BM	Bermuda	65235.0	2015.0
24	BN	Brunei Darussalam	423188.0	2015.0
25	BO	Bolivia	1.0724705e7	2015.0
26	BR	Brazil	2.07847528e8	2015.0
27	BS	Bahamas, The	388019.0	2015.0
28	BT	Bhutan	774830.0	2015.0
29	BW	Botswana	2.262485e6	2015.0
30	BY	Belarus	9.513e6	2015.0
⋮	⋮	⋮	⋮	⋮

Well, not really. This function will sort only by the first column. To sort by other column we can add the cols argument to the function.

sort(df, cols=[:SP_POP_TOTL])

	iso2c	country	SP_POP_TOTL	year
1	TV	Tuvalu	9916.0	2015.0
2	PW	Palau	21291.0	2015.0
3	SM	San Marino	31781.0	2015.0
4	TC	Turks and Caicos Islands	34339.0	2015.0
5	LI	Liechtenstein	37531.0	2015.0
6	MC	Monaco	37731.0	2015.0
7	FO	Faroe Islands	48199.0	2015.0
8	MH	Marshall Islands	52993.0	2015.0
9	MP	Northern Mariana Islands	55070.0	2015.0
10	AS	American Samoa	55538.0	2015.0
11	KN	St. Kitts and Nevis	55572.0	2015.0
12	GL	Greenland	56114.0	2015.0
13	KY	Cayman Islands	59967.0	2015.0
14	BM	Bermuda	65235.0	2015.0
15	AD	Andorra	70473.0	2015.0
16	DM	Dominica	72680.0	2015.0
17	IM	Isle of Man	87780.0	2015.0
18	AG	Antigua and Barbuda	91818.0	2015.0
19	SC	Seychelles	92900.0	2015.0
20	VI	Virgin Islands (U.S.)	103574.0	2015.0
21	AW	Aruba	103889.0	2015.0
22	FM	Micronesia, Fed. Sts.	104460.0	2015.0
23	TO	Tonga	106170.0	2015.0
24	GD	Grenada	106825.0	2015.0
25	VC	St. Vincent and the Grenadines	109462.0	2015.0
26	KI	Kiribati	112423.0	2015.0
27	GU	Guam	169885.0	2015.0
28	LC	St. Lucia	184999.0	2015.0
29	ST	Sao Tome and Principe	190344.0	2015.0
30	WS	Samoa	193228.0	2015.0
⋮	⋮	⋮	⋮	⋮

Just one more little detail. This last query has given us a dataframe with the smallest cities in the world. We want to revert the order of the sorting. This time I’ll use the sort!() function, to save the ordered dataframe into the df variable.

sort!(df, cols=[:SP_POP_TOTL], rev=true)

	iso2c	country	SP_POP_TOTL	year
1	CN	China	1.37122e9	2015.0
2	IN	India	1.311050527e9	2015.0
3	US	United States	3.2141882e8	2015.0
4	ID	Indonesia	2.57563815e8	2015.0
5	BR	Brazil	2.07847528e8	2015.0
6	PK	Pakistan	1.88924874e8	2015.0
7	NG	Nigeria	1.82201962e8	2015.0
8	BD	Bangladesh	1.60995642e8	2015.0
9	RU	Russian Federation	1.44096812e8	2015.0
10	MX	Mexico	1.27017224e8	2015.0
11	JP	Japan	1.26958472e8	2015.0
12	PH	Philippines	1.00699395e8	2015.0
13	ET	Ethiopia	9.939075e7	2015.0
14	VN	Vietnam	9.17038e7	2015.0
15	EG	Egypt, Arab Rep.	9.1508084e7	2015.0
16	DE	Germany	8.1413145e7	2015.0
17	IR	Iran, Islamic Rep.	7.9109272e7	2015.0
18	TR	Turkey	7.866583e7	2015.0
19	CD	Congo, Dem. Rep.	7.7266814e7	2015.0
20	TH	Thailand	6.7959359e7	2015.0
21	FR	France	6.6808385e7	2015.0
22	GB	United Kingdom	6.5138232e7	2015.0
23	IT	Italy	6.0802085e7	2015.0
24	ZA	South Africa	5.495692e7	2015.0
25	MM	Myanmar	5.3897154e7	2015.0
26	TZ	Tanzania	5.347042e7	2015.0
27	KR	Korea, Rep.	5.0617045e7	2015.0
28	CO	Colombia	4.8228704e7	2015.0
29	ES	Spain	4.6418269e7	2015.0
30	KE	Kenya	4.6050302e7	2015.0
⋮	⋮	⋮	⋮	⋮

Ready! now our df variable is in order. Let’s try to visualize it again.

bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50],
    size=(1000,400),
    label="Population Total")

There it is! Clean and tidy. In fact, let’s only use these 50 cities.

biggest = df[1:50,:]

	iso2c	country	SP_POP_TOTL	year
1	CN	China	1.37122e9	2015.0
2	IN	India	1.311050527e9	2015.0
3	US	United States	3.2141882e8	2015.0
4	ID	Indonesia	2.57563815e8	2015.0
5	BR	Brazil	2.07847528e8	2015.0
6	PK	Pakistan	1.88924874e8	2015.0
7	NG	Nigeria	1.82201962e8	2015.0
8	BD	Bangladesh	1.60995642e8	2015.0
9	RU	Russian Federation	1.44096812e8	2015.0
10	MX	Mexico	1.27017224e8	2015.0
11	JP	Japan	1.26958472e8	2015.0
12	PH	Philippines	1.00699395e8	2015.0
13	ET	Ethiopia	9.939075e7	2015.0
14	VN	Vietnam	9.17038e7	2015.0
15	EG	Egypt, Arab Rep.	9.1508084e7	2015.0
16	DE	Germany	8.1413145e7	2015.0
17	IR	Iran, Islamic Rep.	7.9109272e7	2015.0
18	TR	Turkey	7.866583e7	2015.0
19	CD	Congo, Dem. Rep.	7.7266814e7	2015.0
20	TH	Thailand	6.7959359e7	2015.0
21	FR	France	6.6808385e7	2015.0
22	GB	United Kingdom	6.5138232e7	2015.0
23	IT	Italy	6.0802085e7	2015.0
24	ZA	South Africa	5.495692e7	2015.0
25	MM	Myanmar	5.3897154e7	2015.0
26	TZ	Tanzania	5.347042e7	2015.0
27	KR	Korea, Rep.	5.0617045e7	2015.0
28	CO	Colombia	4.8228704e7	2015.0
29	ES	Spain	4.6418269e7	2015.0
30	KE	Kenya	4.6050302e7	2015.0
⋮	⋮	⋮	⋮	⋮

Next we want to add another indicator: Female population percentage. This indicator has the code: ‘SP.POP.TOTL.FE.ZS’, and it represents the percentage of the total population that are registered as female.

df1 = wdi("SP.POP.TOTL.FE.ZS", countries, 2015)
println(size(df1))
println(size(df))

(191,4)
(208,4)

As you can see, the size of the two data frames we have right now are not the same. This is because some information is missing from the female percentage indicator. Not all indicators can be accurately measured in all countries. therefore, once in a while we can find missing values in them.

A workaround for this is to create a new dataframe with the countries we DO have the information for.

df = DataFrame(iso2c=[],country=[],SP_POP_TOTL=[],SP_POP_TOTL_FE_ZS=[])
for country in biggest[:iso2c]
    push!(df, @data([country,biggest[biggest[:iso2c].==country,:][:country][1],
                    biggest[biggest[:iso2c].==country,:][:SP_POP_TOTL][1],
                    df1[df1[:iso2c].==country,:][:SP_POP_TOTL_FE_ZS][1]]))
end
df[1:5,:]

	iso2c	country	SP_POP_TOTL	SP_POP_TOTL_FE_ZS
1	CN	China	1.37122e9	48.4773329025405
2	IN	India	1.311050527e9	48.1676415969283
3	US	United States	3.2141882e8	50.4329265563715
4	ID	Indonesia	2.57563815e8	49.648253579409
5	BR	Brazil	2.07847528e8	50.828985562917

See, now we have a dataframe with no missing values. Still, percentage of female population is actually a variable that depends on the total population. Let’s create a new dataframe with absolute values.

population = DataFrame(iso2c=df[:,1],
    country=df[:,2],
    total=df[:,3],
    females=df[:,3].*(df[:,4]./100),
    males=df[:,3].*(df[:,4].+(-100))./-100)
population[1:5,:]

	iso2c	country	total	females	males
1	CN	China	1.37122e9	6.64730884226216e8	7.064891157737842e8
2	IN	India	1.311050527e9	6.315021189999998e8	6.795484080000004e8
3	US	United States	3.2141882e8	1.6210091742895588e8	1.593179025710441e8
4	ID	Indonesia	2.57563815e8	1.2787593599999988e8	1.2968787900000012e8
5	BR	Brazil	2.07847528e8	1.0564678999999987e8	1.0220073800000013e8

Of course, this is now an approximation, but it will work for our visualization. Lastly, to plot these three variables we need to set the bounds to plot. This is done with the minimum() and maximum() functions.

minimum(population[:total])

2.5155317e7

Lastly, let’s create a visualization to describe all information we have acquired. Since there are three variables we have the option of plotting 3D points in space, but It will look clearer if we use the total population as color or size of a scatterplot. I like to use gr backend for my plots, but this code will work with almost all of Plots.jl backends.

gr()                                                                   # GR backend
plt = scatter(population[:,end],population[:,end-1],                   # Ploting male vs female
xlim=(minimum(population[:total])/2,maximum(population[:total])/1.07), # Limits for the X axis
ylim=(minimum(population[:total])/2,maximum(population[:total])/1.25), # Limits for the Y axis
size=(1000,1000), left_margin=[50px 0px],label="",                     # Window Size, lateral margins and label
ylabel="Female population",xlabel="Male Papulation",                   # X and Y labels
xscale=:log10, yscale=:log10, bottom_margin=50px,                      # Logarithmic scale and bottom margin
markersize=population[:,end-2]./minimum(population[:total]),           # The size of the marker is given by the total population
zcolor=map(log,convert(Array,population[:,end-2]./minimum(population[:total]))), # The color is given by the logarithm of the total population
grid=false)                                                            # NoGrid Science is art ;)

# Last a line is added to mark weather a coutry hase more male or female population
plot!(plt,x->x,minimum(population[:total])/2,maximum(population[:total])/2,label="", color=:green)

<?xml version=”1.0” encoding=”utf-8”?>