iTunes Library analysis using Python

5 min readJul 2, 2019

An attempt to create a Python script to analyze Itunes library. This is intended for beginners in Python to make learning fun, especially for people who love Analytics & Music. I worked on this in 2018 as part of learning Python. There are a lot of interesting information in the library file like: Play Count, Skip Count, Play Date, Release Date, Album Name etc. We can get some interesting insights into most listened Artist, Most skipped songs, Physical location of the music file.

The Library file is stored in a XML format on different location depending on the OS.

Windows: C:\Users\<username>\Music\iTunes

If you have some trouble finding the library file, you can always export the library file by clicking

File →Export library

Folks who are familiar with itunes library file would know the “Library.xml” as a pair of key-value instead of meaningful XML tags.

Tags <dict><key>,<string>,<integer> are ubiquitous in the library file. For example, Name of the track is coded inside <key> tag with Name as it’s content and is immediately followed by <string> tag and actual name of the track, Hurt as it’s content.

Python Script

Before this, it is very important to understand the structure of the library file. The XML structure is as per below image. All track information are encoded under. NB: Indentation of Python code may have to be corrected manually

<key>Tracks<key>
<dict>

Fig 2: A snapshot of collapsed Library XML

Importing Libraries

import pandas as pd
import xml.etree.ElementTree as ET ## XML parsing

Reading Library file

lib = r'C:\Users\<username>\Music\iTunes\Library.xml'

Parsing the library file

All the important attributes of a track can be obtained under the <dict> tag which is shown in above image Fig 2. Fetch all elements under <dict> tag to get all elements of a track and store in main_dict variable. All track information are encoded under the dict tag of tracks_dict element created below. List of all elements (tracks) are stored in track-list variable

tree = ET.parse(lib)
root = tree.getroot()
main_dict=root.findall('dict')
for item in list(main_dict[0]):    
    if item.tag=="dict":
        tracks_dict=item
        break
tracklist=list(tracks_dict.findall('dict'))

From tracklist obtained above, we split podcasts/Music

podcast :- list of XML elements under podcast
purchased:- list of XML elements purchased from iTunes
apple_music:- list of XML elements added to library through Apple Music subscription

Fig 4: how apple_music variable looks like

Print number of tracks under different kinds: Podcast/purchased/apple

print (“Number of tracks under Podcast: “,str(len(podcast)))
print (“Number of tracks Purchased: “,str(len(purchased)))
print (“Number of Music added thought Apple Music subscription: “,str(len(apple_music)))

Function to fetch all columns from all 3 Kinds of tracks: podcast/purchased/apple music. It accepts lists (list of XML elements) as parameter and returns the columns as a set which is used to create final DataFrame.

Column names can be obtained from the content of <key> tag. (Please refer Fig 1: A snapshot of Library.xml)
podcast_cols :- dictionary of all column names from podcast
purchased_cols:- dictionary of all column names from purchased Music
apple_music_cols:- dictionary of all column names from Subscrip

Function to create the final DataFrame. This function accepts one list of XML elements and column set as parameter and returns Dataframe.

Since Purchased and Apple Music DataFrames have similar contents, it is better to combine both data sets with only columns of interest

Analysis

The created data sets were loaded to a local Oracle XE Database and Visual reports were created with the help of PowerBI, SQL queries are not included.
Below are some of the analysis that can be done on the result data-set.

Artist Word Cloud

It is created using Python. Word cloud created on total play count by Artist

df = df_songs.groupby([‘artist’])[‘play_count’].sum().reset_index()
df[‘desc’] = (df[‘artist’]+’ ‘)*df[‘play_count’]
text = “ “.join(item for item in df.desc)
wordcloud = WordCloud(background_color="white",collocations=False).generate(text)plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

Genre Distribution

We explore how tracks are distributed among different Genres. From the below image, we can see majority of the songs in the library are Rock followed by Pop

Top Artists by Play Count

Here we are analyzing the favorite artists by total play count of all their tracks in the library. My favorites are Eric Clapton & Bruce Springsteen :)

Track Distribution by Era

To see how tracks in my library are distributed among different Era. Most songs are from post 2010.

Most Played & Skipped Songs

Below are some of the most played songs and skipped songs in my library. Most played is “Bailando”(277) followed by “Hurt”(216) with skip counts 6 and 3 respectively.

Recommended Songs

Below are some of the songs of my favorite artists missing in my library. This is obtained from Itunes search API. You can fetch some of the latest songs of your favorite artists and compare it with your existing library