Pandas and analysing football data

The Pandas library is a tool used for data manipulation and analysis. Here we will demonstrate some of its features.

import pandas as pd

footy = pd.read_csv('D:\programming\python\data\wc2014\players in wc 2014v2.csv')


Player Position Club Club_league Age Country Caps International_goals
0 Alan PULIDO Forward Tigres UANL Mexico 23 Mexico 5 4
1 Adam TAGGART Forward Newcastle United Jets FC Australia 21 Australia 4 3
2 Reza GHOOCHANNEJAD Forward Charlton Athletic FC England 26 Iran 13 9
3 NEYMAR Forward FC Barcelona Spain 22 Brazil 48 31
4 Didier DROGBA Forward Galatasaray SK Turkey 36 Ivory Coast 100 61

Firstly we import the Pandas library. Pandas can read and write to multiple file types including sql, Excel and text files. Here we will look at a csv file containing data about footballers who were due to play in World Cup 2014. The head() function returns the first n rows from the dataframe, this defaults to 5. At this point it's worth mentioning that a dataframe is a 2-dimensional data structure that contains columns and rows, and is similar to a spreadsheet or database.



FC Bayern Muenchen 15
Manchester United FC 14
FC Barcelona 13
SSC Napoli 12
Real Madrid CF 12
Chelsea FC 12
Juventus FC12
Manchester City FC 10
Liverpool FC 10
Paris Saint-Germain FC 10
Arsenal FC 10
Atletico Madrid 9
FC Porto9
FC Internazionale 8
AC Milan8
FC Zenit St. Petersburg 8
FC Dynamo Moscow7
SS Lazio7
Borussia Dortmund 7
FC Schalke 04 7
dtype: int64


One of the key requirements in data analysis is the ability to count different types of data. Here we use the value_counts() function to sum the number of times a football club appears in the Club column. I have also used the head() function to limit the results, but instead of the default of 5, I want to see the top 20 most represented clubs at the World Cup.



Player Position Club Club_league Age Country Caps International_goals
447 Iker CASILLAS Goalkeeper Real Madrid CF Spain 33 Spain 153 0
271 Yasuhito ENDO Midfielder Gamba Osaka Japan 34 Japan 143 12
435 Gianluigi BUFFON Goalkeeper Juventus FC Italy 36 Italy 139 0
106 Javad NEKOUNAM Midfielder Kuwait SC Kuwait 33 Iran 137 37
293 Georgios KARAGOUNIS Midfielder Fulham FC England 37 Greece 134 10
259 Xavi HERNANDEZ Midfielder FC Barcelona Spain 34 Spain 131 12
13 Miroslav KLOSE Forward SS Lazio Italy 36 Germany 131 68
434 Noel VALLADARES Goalkeeper CD Olimpia Honduras 37 Honduras 120 0
272 Carlos SALCIDO Defender Tigres UANL Mexico 34 Mexico 120 10
211 Rafael MARQUEZ Defender Club Leon Mexico 35 Mexico 119 15

Another key requirement in data analysis is to sort data according to a given feature. Here we use sort_index() to sort our data by the Caps column. Now we can see the ten most capped players at the World Cup.


mygroup = footy[['Player', 'Country', 'Caps']].groupby('Country')

x = (group.sort_index(by='Caps', ascending=False)[:1] for Country,group in mygroup)

highestcaps = pd.DataFrame()

for line in x:
    highestcaps = highestcaps.append(line)


Player Country Caps
312 Madjid BOUGUERRA Algeria 61
416 Javier MASCHERANO Argentina 96
158 Mark BRESCIANO Australia 73
210 Daniel VAN BUYTEN Belgium 79
81 Zvjezdan MISIMOVIC Bosnia & Herzegovina 81
441 JULIO CESAR Brazil 79
28 Samuel ETOO Cameroon 117
471 Claudio BRAVO Chile 79
315 Mario YEPES Columbia 97
430 Michael UMANA Costa Rica 81
154 Darijo SRNA Croatia 111
172 Edison MENDEZ Ecuador 110
152 Steven GERRARD England 110
149 Franck RIBERY France 81
13 Miroslav KLOSE Germany 131
110 Sulley MUNTARI Ghana 80
293 Georgios KARAGOUNIS Greece 134
434 Noel VALLADARES Honduras 120
106 Javad NEKOUNAM Iran 137
435 Gianluigi BUFFON Italy 139
432 Didier ZOKORA Ivory Coast 119
271 Yasuhito ENDO Japan 143
272 Carlos SALCIDO Mexico 120
107 Wesley SNEIJDER Netherlands 98
295 Joseph YOBO Nigeria 96
336 Sergey IGNASHEVICH Russia 97
97 LEE Keunho South Korea 63
447 Iker CASILLAS Spain 153
203 Tranquillo BARNETTA Switzerland 74
185 DaMarcus BEASLEY USA 115
73 Diego FORLAN Uruguay 109

Finally let's look at the highestcaps DataFrame, which is a bit more advanced. It shows us the most capped player for each country. To make this dataframe we have built on our previous use of the sort_index() function by also using the groupby() function. Firstly we select the relevant columns we are interested in Player, Country and Caps and use groupby() to organise them by Country so players from the same country are all together. Next we use a generator comprehension which we will call 'x'. This takes the mygroup object where players are already grouped together by Country and now also orders them by Caps. So each country's players are grouped together and within each group the players are ordered by the number of caps from highest to lowest. From this we use [:1] to only retreive the the most capped player from each country. Now we can create the highestcaps DataFrame and append each line of x to it to show the most capped player for each country.