While looking for ways to access the World Bank data bank programmatically, I found this very useful library to access it with Julia (along with many others for python and R). The library is WorldBankData.jl, by 4gh. Although the repo doesn’t have much activity, the library works great. Here are some examples of what can be done with the World Bank data, and the powerful Julia dataframes. I will also use Plots.jl for visualizations;

using WorldBankData
using Plots
using DataFrames

The World Bank offers two kind of data sources: Countries and Indicators. Indicators are the variables being measured in every country. Every year 13,398 indicators are measured in 258 countries. I have selected some countries to work with, since not all data from the world bank is complete. This filtering will facilitate our task:

countries = ["AW","AF","AO","AL","AD","AE","AR","AM","AS","AG","AU","AT","AZ","BI","BE","BJ","BF","BD","BG","BH","BS","BA","BY","BZ","BM","BO","BR","BB","BN","BT","BW","CF","CA","CH","CL","CN","CI","CM","CD","CG","CO","KM","CV","CR","CU","KY","CY","CZ","DE","DJ","DM","DK","DO","DZ","EC","EG","ER","ES","EE","ET","FI","FJ","FR","FO","FM","GA","GB","GE","GH","GN","GM","GW","GQ","GR","GD","GL","GT","GU","GY","HK","HN","HR","HT","HU","ID","IM","IN","IE","IR","IQ","IS","IL","IT","JM","JO","JP","KZ","KE","KG","KH","KI","KN","KR","XK","KW","LA","LB","LR","LY","LC","LI","LK","LS","LT","LU","LV","MO","MA","MC","MD","MG","MV","MX","MH","MK","ML","MT","MM","ME","MN","MP","MZ","MR","MU","MW","MY","NA","NC","NE","NG","NI","NL","NO","NP","NZ","OM","PK","PA","PE","PH","PW","PG","PL","PR","KP","PT","PY","PF","QA","RO","RU","RW","SA","SD","SN","SG","SB","SL","SV","SM","SO","RS","SS","ST","SR","SK","SI","SE","SZ","SC","SY","TC","TD","TG","TH","TJ","TM","TL","TO","TT","TN","TR","TV","TZ","UG","UA","UY","US","UZ","VC","VE","VI","VN","VU","WS","YE","ZA","ZM","ZW"];

A list of countries has to be given to the wdi() function, along with an indicator. This function queries the WBDB and returns a dataframe with all data available. If a year is added to the query, only the data for that year will be returned. Let’s work with total country population.

df=wdi("SP.POP.TOTL", countries, 2015)
iso2ccountrySP_POP_TOTLyear
1AWAruba103889.02015.0
2AFAfghanistan3.2526562e72015.0
3AOAngola2.5021974e72015.0
4ALAlbania2.889167e62015.0
5ADAndorra70473.02015.0
6AEUnited Arab Emirates9.156963e62015.0
7ARArgentina4.3416755e72015.0
8AMArmenia3.017712e62015.0
9ASAmerican Samoa55538.02015.0
10AGAntigua and Barbuda91818.02015.0
11AUAustralia2.3781169e72015.0
12ATAustria8.611088e62015.0
13AZAzerbaijan9.651349e62015.0
14BIBurundi1.1178921e72015.0
15BEBelgium1.1285721e72015.0
16BJBenin1.0879829e72015.0
17BFBurkina Faso1.810557e72015.0
18BDBangladesh1.60995642e82015.0
19BGBulgaria7.177991e62015.0
20BHBahrain1.377237e62015.0
21BSBahamas, The388019.02015.0
22BABosnia and Herzegovina3.810416e62015.0
23BYBelarus9.513e62015.0
24BZBelize359287.02015.0
25BMBermuda65235.02015.0
26BOBolivia1.0724705e72015.0
27BRBrazil2.07847528e82015.0
28BBBarbados284215.02015.0
29BNBrunei Darussalam423188.02015.0
30BTBhutan774830.02015.0

Lets visualize the first 50 countries with a barplot.

bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50], size=(900,400), label="Total Population")

The result is somehow expected. China is within the first 50 countries in the list, and it scales the graph for itself. Let’s find out exactly how much is big.

maximum(df[:SP_POP_TOTL])
1.37122e9

More than one billion people live only in china. This is almost 20% of the world’s population:

maximum(df[:SP_POP_TOTL])/sum(df[:SP_POP_TOTL])
0.18750300206223874

Right now we have the population for 208 countries. Almost all the countries measured by the WB.

size(df[:SP_POP_TOTL])
(208,)

In julia to find the index of the maximum value, the indmax() function can be used. This is useful if you want to know more about the element with the maximum value in a column.

indmax(df[:SP_POP_TOTL])
36

In fact, element number 36 in our dataframe is China:

df[:country][36]
"China"

Now let’s try something else. We want to see the most populated countries in the world. The first thought to come to mind is to sort the dataframe, right?

sort(df)
iso2ccountrySP_POP_TOTLyear
1ADAndorra70473.02015.0
2AEUnited Arab Emirates9.156963e62015.0
3AFAfghanistan3.2526562e72015.0
4AGAntigua and Barbuda91818.02015.0
5ALAlbania2.889167e62015.0
6AMArmenia3.017712e62015.0
7AOAngola2.5021974e72015.0
8ARArgentina4.3416755e72015.0
9ASAmerican Samoa55538.02015.0
10ATAustria8.611088e62015.0
11AUAustralia2.3781169e72015.0
12AWAruba103889.02015.0
13AZAzerbaijan9.651349e62015.0
14BABosnia and Herzegovina3.810416e62015.0
15BBBarbados284215.02015.0
16BDBangladesh1.60995642e82015.0
17BEBelgium1.1285721e72015.0
18BFBurkina Faso1.810557e72015.0
19BGBulgaria7.177991e62015.0
20BHBahrain1.377237e62015.0
21BIBurundi1.1178921e72015.0
22BJBenin1.0879829e72015.0
23BMBermuda65235.02015.0
24BNBrunei Darussalam423188.02015.0
25BOBolivia1.0724705e72015.0
26BRBrazil2.07847528e82015.0
27BSBahamas, The388019.02015.0
28BTBhutan774830.02015.0
29BWBotswana2.262485e62015.0
30BYBelarus9.513e62015.0

Well, not really. This function will sort only by the first column. To sort by other column we can add the cols argument to the function.

sort(df, cols=[:SP_POP_TOTL])
iso2ccountrySP_POP_TOTLyear
1TVTuvalu9916.02015.0
2PWPalau21291.02015.0
3SMSan Marino31781.02015.0
4TCTurks and Caicos Islands34339.02015.0
5LILiechtenstein37531.02015.0
6MCMonaco37731.02015.0
7FOFaroe Islands48199.02015.0
8MHMarshall Islands52993.02015.0
9MPNorthern Mariana Islands55070.02015.0
10ASAmerican Samoa55538.02015.0
11KNSt. Kitts and Nevis55572.02015.0
12GLGreenland56114.02015.0
13KYCayman Islands59967.02015.0
14BMBermuda65235.02015.0
15ADAndorra70473.02015.0
16DMDominica72680.02015.0
17IMIsle of Man87780.02015.0
18AGAntigua and Barbuda91818.02015.0
19SCSeychelles92900.02015.0
20VIVirgin Islands (U.S.)103574.02015.0
21AWAruba103889.02015.0
22FMMicronesia, Fed. Sts.104460.02015.0
23TOTonga106170.02015.0
24GDGrenada106825.02015.0
25VCSt. Vincent and the Grenadines109462.02015.0
26KIKiribati112423.02015.0
27GUGuam169885.02015.0
28LCSt. Lucia184999.02015.0
29STSao Tome and Principe190344.02015.0
30WSSamoa193228.02015.0

Just one more little detail. This last query has given us a dataframe with the smallest cities in the world. We want to revert the order of the sorting. This time I’ll use the sort!() function, to save the ordered dataframe into the df variable.

sort!(df, cols=[:SP_POP_TOTL], rev=true)
iso2ccountrySP_POP_TOTLyear
1CNChina1.37122e92015.0
2INIndia1.311050527e92015.0
3USUnited States3.2141882e82015.0
4IDIndonesia2.57563815e82015.0
5BRBrazil2.07847528e82015.0
6PKPakistan1.88924874e82015.0
7NGNigeria1.82201962e82015.0
8BDBangladesh1.60995642e82015.0
9RURussian Federation1.44096812e82015.0
10MXMexico1.27017224e82015.0
11JPJapan1.26958472e82015.0
12PHPhilippines1.00699395e82015.0
13ETEthiopia9.939075e72015.0
14VNVietnam9.17038e72015.0
15EGEgypt, Arab Rep.9.1508084e72015.0
16DEGermany8.1413145e72015.0
17IRIran, Islamic Rep.7.9109272e72015.0
18TRTurkey7.866583e72015.0
19CDCongo, Dem. Rep.7.7266814e72015.0
20THThailand6.7959359e72015.0
21FRFrance6.6808385e72015.0
22GBUnited Kingdom6.5138232e72015.0
23ITItaly6.0802085e72015.0
24ZASouth Africa5.495692e72015.0
25MMMyanmar5.3897154e72015.0
26TZTanzania5.347042e72015.0
27KRKorea, Rep.5.0617045e72015.0
28COColombia4.8228704e72015.0
29ESSpain4.6418269e72015.0
30KEKenya4.6050302e72015.0

Ready! now our df variable is in order. Let’s try to visualize it again.

bar(df[:iso2c][1:50], df[:SP_POP_TOTL][1:50],
    size=(1000,400),
    label="Population Total")

There it is! Clean and tidy. In fact, let’s only use these 50 cities.

biggest = df[1:50,:]
iso2ccountrySP_POP_TOTLyear
1CNChina1.37122e92015.0
2INIndia1.311050527e92015.0
3USUnited States3.2141882e82015.0
4IDIndonesia2.57563815e82015.0
5BRBrazil2.07847528e82015.0
6PKPakistan1.88924874e82015.0
7NGNigeria1.82201962e82015.0
8BDBangladesh1.60995642e82015.0
9RURussian Federation1.44096812e82015.0
10MXMexico1.27017224e82015.0
11JPJapan1.26958472e82015.0
12PHPhilippines1.00699395e82015.0
13ETEthiopia9.939075e72015.0
14VNVietnam9.17038e72015.0
15EGEgypt, Arab Rep.9.1508084e72015.0
16DEGermany8.1413145e72015.0
17IRIran, Islamic Rep.7.9109272e72015.0
18TRTurkey7.866583e72015.0
19CDCongo, Dem. Rep.7.7266814e72015.0
20THThailand6.7959359e72015.0
21FRFrance6.6808385e72015.0
22GBUnited Kingdom6.5138232e72015.0
23ITItaly6.0802085e72015.0
24ZASouth Africa5.495692e72015.0
25MMMyanmar5.3897154e72015.0
26TZTanzania5.347042e72015.0
27KRKorea, Rep.5.0617045e72015.0
28COColombia4.8228704e72015.0
29ESSpain4.6418269e72015.0
30KEKenya4.6050302e72015.0

Next we want to add another indicator: Female population percentage. This indicator has the code: ‘SP.POP.TOTL.FE.ZS’, and it represents the percentage of the total population that are registered as female.

df1 = wdi("SP.POP.TOTL.FE.ZS", countries, 2015)
println(size(df1))
println(size(df))
(191,4)
(208,4)

As you can see, the size of the two data frames we have right now are not the same. This is because some information is missing from the female percentage indicator. Not all indicators can be accurately measured in all countries. therefore, once in a while we can find missing values in them.

A workaround for this is to create a new dataframe with the countries we DO have the information for.

df = DataFrame(iso2c=[],country=[],SP_POP_TOTL=[],SP_POP_TOTL_FE_ZS=[])
for country in biggest[:iso2c]
    push!(df, @data([country,biggest[biggest[:iso2c].==country,:][:country][1],
                    biggest[biggest[:iso2c].==country,:][:SP_POP_TOTL][1],
                    df1[df1[:iso2c].==country,:][:SP_POP_TOTL_FE_ZS][1]]))
end
df[1:5,:]
iso2ccountrySP_POP_TOTLSP_POP_TOTL_FE_ZS
1CNChina1.37122e948.4773329025405
2INIndia1.311050527e948.1676415969283
3USUnited States3.2141882e850.4329265563715
4IDIndonesia2.57563815e849.648253579409
5BRBrazil2.07847528e850.828985562917

See, now we have a dataframe with no missing values. Still, percentage of female population is actually a variable that depends on the total population. Let’s create a new dataframe with absolute values.

population = DataFrame(iso2c=df[:,1],
    country=df[:,2],
    total=df[:,3],
    females=df[:,3].*(df[:,4]./100),
    males=df[:,3].*(df[:,4].+(-100))./-100)
population[1:5,:]
iso2ccountrytotalfemalesmales
1CNChina1.37122e96.64730884226216e87.064891157737842e8
2INIndia1.311050527e96.315021189999998e86.795484080000004e8
3USUnited States3.2141882e81.6210091742895588e81.593179025710441e8
4IDIndonesia2.57563815e81.2787593599999988e81.2968787900000012e8
5BRBrazil2.07847528e81.0564678999999987e81.0220073800000013e8

Of course, this is now an approximation, but it will work for our visualization. Lastly, to plot these three variables we need to set the bounds to plot. This is done with the minimum() and maximum() functions.

minimum(population[:total])
2.5155317e7

Lastly, let’s create a visualization to describe all information we have acquired. Since there are three variables we have the option of plotting 3D points in space, but It will look clearer if we use the total population as color or size of a scatterplot. I like to use gr backend for my plots, but this code will work with almost all of Plots.jl backends.

gr()                                                                   # GR backend
plt = scatter(population[:,end],population[:,end-1],                   # Ploting male vs female
xlim=(minimum(population[:total])/2,maximum(population[:total])/1.07), # Limits for the X axis
ylim=(minimum(population[:total])/2,maximum(population[:total])/1.25), # Limits for the Y axis
size=(1000,1000), left_margin=[50px 0px],label="",                     # Window Size, lateral margins and label
ylabel="Female population",xlabel="Male Papulation",                   # X and Y labels
xscale=:log10, yscale=:log10, bottom_margin=50px,                      # Logarithmic scale and bottom margin
markersize=population[:,end-2]./minimum(population[:total]),           # The size of the marker is given by the total population
zcolor=map(log,convert(Array,population[:,end-2]./minimum(population[:total]))), # The color is given by the logarithm of the total population
grid=false)                                                            # NoGrid Science is art ;)

# Last a line is added to mark weather a coutry hase more male or female population
plot!(plt,x->x,minimum(population[:total])/2,maximum(population[:total])/2,label="", color=:green)
10 7.5 10 8.0 10 8.5 10 9.0 10 7.5 10 8.0 10 8.5 10 9.0 Male Papulation Female population 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0