Skip to main content

My First Preppin' Data Challenge

Last week I had a go at my first Preppin Data challenge. 
I decided to do it in Python but I wasn't sure the best way to submit my solution. Carl Allchin from the Data School said that I could start a blog to show my workings. So here it is.

For week 25's challenge, we were given data on Ed Sheeran and Ben Howard concerts. There were three sets of data to work with: 'Gigs Data' and 'Home Locations', both contained in an Excel file, and a csv file of longitudes and latitudes for the concert locations.

└── week 25
    ├── LongLats.csv
    └── Wow _ PD data set.xlsx (Gigs Data, Home Locations)


I started by importing the pandas library and assigning the three data sets to separate data frames.

df_LongLats = pd.read_csv('LongLats.csv')
df_gigs = pd.read_excel('Wow _ PD data set.xlsx', sheet_name='Gigs Data')
df_homes = pd.read_excel('Wow _ PD data set.xlsx', sheet_name='Home Locations')
print('gigs df created... '+str(df_gigs.shape))

Following the instructions, I first tackled the LongLats data set. I used the pandas Series.str.split method to split the longitudes and latitudes, assigning the new split columns to a temporary variable called 'df_LongLats_split'. After renaming the columns, I joined them to the original 'df_LongLats' data frame. I could then remove the redundant LongLats column.

df_LongLats_split = df_LongLats['LongLats'].str.split(', ', n=1, expand=True)
df_LongLats_split.columns = ['Longitude', 'Latitude']
df_LongLats = pd.merge(df_LongLats, df_LongLats_split, left_index=True, right_index=True)
df_LongLats = df_LongLats.drop('LongLats', axis=1)

Splitting the Concert column into fellow artists was a bit more difficult. As above, I used the split method to assign the split columns to a new data frame, this time called 'df_fellowartists'.

df_fellowartists = df_gigs['Concert'].str.split(' / |; ', expand=True)

Output:
                   0           1            2     3     4     5     6     7     8     9     10    11    12    13
0          Ben Howard        None         None  None  None  None  None  None  None  None  None  None  None  None
1    Noah & The Whale  Ben Howard         None  None  None  None  None  None  None  None  None  None  None  None
2  Divide' World Tour        None         None  None  None  None  None  None  None  None  None  None  None  None
3         Divide Tour        None         None  None  None  None  None  None  None  None  None  None  None  None
4          Ed Sheeran   Foy Vance  Ben Kweller  None  None  None  None  None  None  None  None  None  None  None

I made the assumption that if the second column was a null value, then the first column must either be 'Ed Sheeran', 'Ben Howard' or the name of a tour. It couldn't be the name of another artist. As I didn't need Ed, Ben or tour names in the fellow artists column, I made it so the first column would change to 'None', if the second column was 'None'.

I then replaced all other occurrences of 'Ed Sheeran' and 'Ben Howard' with null values.

df_fellowartists.loc[pd.isnull(df_fellowartists[1]) == True, 0] = None
replacements = {'Ed Sheeran': None, 'Ben Howard': None}
df_fellowartists.replace(replacements, inplace=True)


Now that 'df_fellowartists' only contained tour names and the names of fellow artists, my next task was to put them into a single column. Since I was going to use the index later to merge data frames, I wanted to make sure that any extra rows I created wouldn't affect the original indexing. I did this by duplicating the index to a regular column, 'index1'.

df_fellowartists = df_fellowartists.rename_axis('index1').reset_index()

The challenge specifications ask that for each gig, Ben and Ed have their own row where the fellow artist column is null. I did this by adding an extra column to 'df_fellowartists' called 'Ed|Ben' with a placeholder value.

df_fellowartists['Ed|Ben'] = 'leave empty'

Output:
   index1                 0          1            2     3     4     5     6     7     8     9    10    11    12    13       Ed|Ben
0       0              None       None         None  None  None  None  None  None  None  None  None  None  None  None  leave empty
1       1  Noah & The Whale       None         None  None  None  None  None  None  None  None  None  None  None  None  leave empty
2       2              None       None         None  None  None  None  None  None  None  None  None  None  None  None  leave empty
3       3              None       None         None  None  None  None  None  None  None  None  None  None  None  None  leave empty
4       4              None  Foy Vance  Ben Kweller  None  None  None  None  None  None  None  None  None  None  None  leave empty

To make a single column from all these values, I created a column of lists.

fellowartist_cols = df_fellowartists[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,'Ed|Ben']]

df_fellowartists['list'] = fellowartist_cols.values.tolist()
df_fellowartists = df_fellowartists.drop(fellowartist_cols, axis=1)

Output:
   index1                                               list
0       0  [None, None, None, None, None, None, None, Non...
1       1  [Noah & The Whale, None, None, None, None, Non...
2       2  [None, None, None, None, None, None, None, Non...
3       3  [None, None, None, None, None, None, None, Non...
4       4  [None, Foy Vance, Ben Kweller, None, None, Non...

I then iterated through the lists to remove the null values.

clean_list = []
for list in df_fellowartists['list']:
 list = [x for x in list if x != None]
 clean_list.append(list)  
df_fellowartists['fellow artists'] = clean_list
df_fellowartists.drop('list', axis=1, inplace=True)

Output:
   index1                         fellow artists
0       0                          [leave empty]
1       1        [Noah & The Whale, leave empty]
2       2                          [leave empty]
3       3                          [leave empty]
4       4  [Foy Vance, Ben Kweller, leave empty]

Here I used a combination of pandas methods to put each value on to a separate row with its original index value. It was now ready to be merged with the gigs data frame on an outer join.

df_fellowartists = df_fellowartists['fellow artists'].apply(pd.Series).stack().reset_index(1,name='fellow artists').drop('level_1', 1)
df_gigs = pd.merge(df_gigs, df_fellowartists, how='outer', left_index=True, right_index=True)    

Finally I removed the duplicates in the gigs data frame, then merged it with the Homes and LongLats data frames. The last step was to export as a csv file.

df_gigs.drop_duplicates(subset=df_gigs.columns.difference(['ConcertID']), inplace=True)
print('gigs df duplicates dropped... '+str(df_gigs.shape))

df_gigs = pd.merge(df_gigs, df_homes, how='left', on='Artist')    
df_gigs.rename(columns={'Longitude': 'Home Longitude','Latitude': 'Home Latitude'}, inplace=True)
print('gigs df merged with home longlats... '+str(df_gigs.shape))

df_gigs = pd.merge(df_LongLats, df_gigs, how='right', on='Location')    
df_gigs.replace('leave empty', np.nan, inplace=True)
df_gigs.to_csv('cleaned gig data.csv', sep=',')

At different stages of my code, I printed the row count to check that I hadn't been adding or removing rows by mistake. The final row count for the finished csv file was 1755. That's seven away from the target of 1748, so I think I may have missed some duplicates.


Output:
df_gigs created...df_gigs - number of rows (1545, 6)
df_fellowartists created from Concerts split...df_fellowartists - number of rows (1545, 14)
Ben and Ed removed from df_fellowartists...df_fellowartists - number of rows (1545, 14)
axis duplicated...df_fellowartists - number of rows (1545, 16)
all fellow artists into single column - number of rows (2332, 1)
df_gigs merged with fellow artists...df_gigs - number of rows (2332, 7)
df_gigs duplicates dropped...df_gigs - number of rows (1755, 7)
df_gigs merged with home longlats...df_gigs - number of rows (1755, 10)
df_gigs merged with gigs longlats...df_gigs - number of rows (1755, 12)

I really enjoyed my first Preppin Data challenge. It gave me some much needed Python practice. I am curious to know whether there's a way to convert Python code into HTML markup so that maybe next time I can write the blog in the comments as I code.

Comments