An attempt to create a Python script to analyze Itunes library. This is intended for beginners in Python to make learning fun, especially for people who love Analytics & Music. I worked on this in 2018 as part of learning Python. There are a lot of interesting information in the library file like: Play Count, Skip Count, Play Date, Release Date, Album Name etc. We can get some interesting insights into most listened Artist, Most skipped songs, Physical location of the music file.
The Library file is stored in a XML format on different location depending on the OS.
- Windows: C:\Users\<username>\Music\iTunes
If you have some trouble finding the library file, you can always export the library file by clicking
- File →Export library
Folks who are familiar with itunes library file would know the “Library.xml” as a pair of key-value instead of meaningful XML tags.
Tags <dict><key>,<string>,<integer> are ubiquitous in the library file. For example, Name of the track is coded inside <key> tag with Name as it’s content and is immediately followed by <string> tag and actual name of the track, Hurt as it’s content.
Python Script
Before this, it is very important to understand the structure of the library file. The XML structure is as per below image. All track information are encoded under. NB: Indentation of Python code may have to be corrected manually
- <key>Tracks<key>
- <dict>
Importing Libraries
import pandas as pd
import xml.etree.ElementTree as ET ## XML parsing
Reading Library file
lib = r'C:\Users\<username>\Music\iTunes\Library.xml'
Parsing the library file
All the important attributes of a track can be obtained under the <dict> tag which is shown in above image Fig 2. Fetch all elements under <dict> tag to get all elements of a track and store in main_dict variable. All track information are encoded under the dict tag of tracks_dict element created below. List of all elements (tracks) are stored in track-list variable
tree = ET.parse(lib)
root = tree.getroot()
main_dict=root.findall('dict')
for item in list(main_dict[0]):
if item.tag=="dict":
tracks_dict=item
break
tracklist=list(tracks_dict.findall('dict'))
From tracklist obtained above, we split podcasts/Music
- podcast :- list of XML elements under podcast
- purchased:- list of XML elements purchased from iTunes
- apple_music:- list of XML elements added to library through Apple Music subscription
Print number of tracks under different kinds: Podcast/purchased/apple
print (“Number of tracks under Podcast: “,str(len(podcast)))
print (“Number of tracks Purchased: “,str(len(purchased)))
print (“Number of Music added thought Apple Music subscription: “,str(len(apple_music)))
Function to fetch all columns from all 3 Kinds of tracks: podcast/purchased/apple music. It accepts lists (list of XML elements) as parameter and returns the columns as a set which is used to create final DataFrame.
- Column names can be obtained from the content of <key> tag. (Please refer Fig 1: A snapshot of Library.xml)
- podcast_cols :- dictionary of all column names from podcast
- purchased_cols:- dictionary of all column names from purchased Music
- apple_music_cols:- dictionary of all column names from Subscrip
Function to create the final DataFrame. This function accepts one list of XML elements and column set as parameter and returns Dataframe.
Since Purchased and Apple Music DataFrames have similar contents, it is better to combine both data sets with only columns of interest
Analysis
The created data sets were loaded to a local Oracle XE Database and Visual reports were created with the help of PowerBI, SQL queries are not included.
Below are some of the analysis that can be done on the result data-set.
Artist Word Cloud
It is created using Python. Word cloud created on total play count by Artist
df = df_songs.groupby([‘artist’])[‘play_count’].sum().reset_index()
df[‘desc’] = (df[‘artist’]+’ ‘)*df[‘play_count’]
text = “ “.join(item for item in df.desc)
wordcloud = WordCloud(background_color="white",collocations=False).generate(text)plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Genre Distribution
We explore how tracks are distributed among different Genres. From the below image, we can see majority of the songs in the library are Rock followed by Pop
Top Artists by Play Count
Here we are analyzing the favorite artists by total play count of all their tracks in the library. My favorites are Eric Clapton & Bruce Springsteen :)
Track Distribution by Era
To see how tracks in my library are distributed among different Era. Most songs are from post 2010.
Most Played & Skipped Songs
Below are some of the most played songs and skipped songs in my library. Most played is “Bailando”(277) followed by “Hurt”(216) with skip counts 6 and 3 respectively.
Recommended Songs
Below are some of the songs of my favorite artists missing in my library. This is obtained from Itunes search API. You can fetch some of the latest songs of your favorite artists and compare it with your existing library