Using the Python library BeautifulSoup to extract data from a webpage (applied to World Cup rankings)

The Python library BeautifulSoup is an incredible tool for pulling out information from a webpage. You can use it not only to extract tables and lists but you can also use to to pull out very specific elements like a paragraph with a green font color. To briefly illustrate this functionality and in honor of the upcoming World Cup we will use BeautifulSoup on world soccer rankings.

Working with BeautifulSoup

If you’re reading this post I’ll assume you know how to install a Python library (hint easy_install beautifulsoup from the command line). For this example we also use the library urllib2 to help us open a URL.

To start, of course, you’ll want to import the two libraries:

from BeautifulSoup import BeautifulSoup
import urllib2

With the two libraries installed you can now open the URL and use BeautifulSoup to read the web page. Given that the World Cup is coming up we decided to apply this example to the FIFA rankings listed on the ESPN FC web page. We're using Mexico as the example (although we'd like to see them move deep into the tournament we're not hopeful).

Here is what the page on ESPN's soccer site looks like (you can find the main page espn_mex

In order to open and read the page using BeautifulSoup (and urllib2) you would use the following code.

url= 'http://www.espnfc.com/spi/rankings/_/view/fifa/teamId/203/mexico?cc=5901'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

If you were to print out soup you could see the entire webpage. Although it looks like a simple, long text string there is a lot more too it.

Let's use a simple example. If you were to look at the underlying code (or at soup) you would see a div called rank-box which has the FIFA rank. Here is what this looks like in the Chrome developer tools console:

beautiful_soup

To grab this number is incredibly simple. You 'find' the div by class, you identify the h6 element child and you get the contents. All in one line of code:

rank = soup.find("div", {"class": "rank-box"}).h6.contents

It's as easy as this. In our case we want the team name, the rank and the rating. Based on an inspection of the web page DOM we determined that these can be extracted using the following code:

rank = soup.find("div", {"class": "rank-box"}).h6.contents
teaminfo = soup.find("div", {"class": "team-info"})
name = teaminfo.h4.contents
rating = teaminfo.ul.p.span.contents

We can then loop through the teams and grab the data. We were particularly interested in the difference between the FIFA rank and the rank determined by ESPN's Soccer Power Index. Perhaps an average of these two would be a good indicator of success in the tournament. By default the table is sorted by this average. Take a look at the final table below (all columns are sortable and the table is not limited to World Cup teams).

The difference in FIFA and SPI is particularly amazing for Portugal. Who will the US face -- 3rd ranked Portugal or 14th ranked Portugal?

`

CountryRank (FIFA)Rank (SPI)FIFA-SPIFIFA SPI AVGWC Group
Spain1322B
Brazil4132.5A
Germany2423G
Argentina7254.5F
Colombia5615.5C
Uruguay6827D
Portugal314118.5G
Chile13589B
England119210D
Italy912310.5D
France167911.5E
Belgium1213112.5H
Netherlands1510512.5B
Switzerland8221415E
Russia1817117.5H
Ukraine1718117.5Not in WC
United States1421717.5G
Ivory Coast2116518.5C
Greece10271718.5C
Ecuador28111719.5E
Bosnia-Herzegovina25151020F
Mexico1925622A
Croatia20301025A
Sweden2529427Not in WC
Egypt2432828Not in WC
Costa Rica34241029D
Honduras3033331.5E
Denmark23411832Not in WC
Ghana38261232G
Peru42231932.5Not in WC
Czech Republic3634235Not in WC
Scotland22482635Not in WC
Nigeria44281636F
Austria4035537.5Not in WC
Paraguay55203537.5Not in WC
Iran3739238F
Romania32451338.5Not in WC
Venezuela4137439Not in WC
Panama35461140.5Not in WC
Japan47361141.5C
Turkey3944541.5Not in WC
Slovenia29562742.5Not in WC
South Korea55312443H
Cameroon50381244A
Armenia33612847Not in WC
Algeria25694447H
Slovakia4652649Not in WC
Australia59401949.5B
Finland5247549.5Not in WC
Tunisia4957853Not in WC
Uzbekistan5354153.5Not in WC
Republic of Ireland66422454Not in WC
Guinea5159855Not in WC
Bolivia68432555.5Not in WC
Burkina Faso61511056Not in WC
Senegal63491456Not in WC
Wales47661956.5Not in WC
South Africa65531259Not in WC
Hungary45753060Not in WC
Bulgaria73502361.5Not in WC
Mali5964561.5Not in WC
Libya6263162.5Not in WC
Norway55701562.5Not in WC
Poland72581465Not in WC
Iceland58731565.5Not in WC
Morocco76621469Not in WC
Cape Verde Islands42995770.5Not in WC
Belarus83602371.5Not in WC
Israel78671172.5Not in WC
Jordan64811772.5Not in WC
Jamaica81651673Not in WC
Albania70831376.5Not in WC
Saudi Arabia7578376.5Not in WC
Macedonia8074677Not in WC
Congo DR88711779.5Not in WC
El Salvador69922380.5Not in WC
Trinidad and Tobago74911782.5Not in WC
Angola94761885Not in WC
Oman8290886Not in WC
Azerbaijan85961190.5Not in WC
New Zealand111723991.5Not in WC
Zimbabwe98851391.5Not in WC
China9688892Not in WC
Georgia103822192.5Not in WC
Haiti771133695Not in WC
Benin9794395.5Not in WC
Northern Ireland841072395.5Not in WC
Kenya106871996.5Not in WC
Guatemala1247747100.5Not in WC
Lithuania1041013102.5Not in WC
Canada1109713103.5Not in WC
Kuwait1081008104Not in WC
Moldova9911011104.5Not in WC
Estonia9311926106Not in WC
Latvia1091123110.5Not in WC
Cuba9013949114.5Not in WC
Kazakhstan1181153116.5Not in WC
Cyprus1301237126.5Not in WC
Luxembourg11214331127.5Not in WC
Rwanda1311256128Not in WC
Dominican Republic1261348130Not in WC
Antigua and Barbuda1421339137.5Not in WC
Malta12814719137.5Not in WC
Puerto Rico14913712143Not in WC
Suriname13115928145Not in WC
Grenada13615721146.5Not in WC
Guyana1511429146.5Not in WC
St. Vincent and Grenadines12616741146.5Not in WC
Liechtenstein1501446147Not in WC
Belize1441517147.5Not in WC
Nicaragua16813830153Not in WC
St. Kitts and Nevis1531585155.5Not in WC
Faroe Islands16415311158.5Not in WC
Malaysia14517227158.5Not in WC
Netherlands Antilles1571636160Not in WC
Bermuda16915217160.5Not in WC
St. Lucia13319360163Not in WC
Barbados1611698165Not in WC
Hong Kong15817618167Not in WC
Aruba15518934172Not in WC
Dominica16318421173.5Not in WC
Bahamas1861860186Not in WC
Andorra19918118190Not in WC
Cayman Islands1951887191.5Not in WC
Montserrat18821325200.5Not in WC
San Marino2071989202.5Not in WC
US Virgin Islands19421420204Not in WC
British Virgin Islands19721518206Not in WC
Turks and Caicos Islands2072081207.5Not in WC
`

2 responses

Leave a Reply to Puttamadaiah Cancel reply

Your email address will not be published. Required fields are marked *