Spotify: Library Data Scraped by ‘Copyright Extremists’

by Priyanka Patel

A massive trove of Spotify data-256 million rows of track metadata and 86 million audio files-has surfaced on file-sharing networks, a development that’s less a holiday gift and more a headache for the streaming giant.

Metadata Dump Raises AI training Concerns

The exposed data could potentially be used to train artificial intelligence models, posing a risk to music licensing efforts.

  • A ‘pirate activist group’ accessed and released the Spotify data via Anna’s Archive, a platform known for ‘shadow libraries.’
  • Spotify confirmed an unauthorized scrape of public metadata and illicit access to some audio files.
  • While building a spotify clone is absolutely possible, the bigger concern is the potential for AI firms to leverage the data without proper licensing.
  • The incident doesn’t appear to impact user security, as the data was obtained through public APIs and circumvention of DRM.

The release, reported by Billboard, includes what’s being called the “world’s first ‘preservation archive’ for music,” encompassing approximately 99.6% of all Spotify listens. A blog post on Anna’s Archive claimed, “We backed up Spotify (metadata and music files). It’s distributed in bulk torrents (~300TB), grouped by popularity.” The archive boasts 256 million tracks and 186 million unique ISRCs.

could someone simply recreate Spotify with this data? Technically, yes. But the music industry is notoriously protective of it’s copyrights. Any attempt to launch a competing service would likely face swift legal action,mirroring the recent case where major labels sued and settled with The Internet Archive over a preservation archive of older recordings.

Did you know? – An ISRC (International Standard Recording Code) is a unique identifier for each recording. It helps track and manage royalties for artists and rights holders.

Spotify acknowledged the incident, stating in a statement to Music Ally that an investigation revealed a third party scraped public metadata and used “illicit tactics to circumvent DRM” to access audio files. The company characterized the actors as “anti-copyright extremists who’ve previously pirated content from YouTube and other platforms.”

The company emphasized that this wasn’t a typical security breach affecting users. Investigators believe the activists primarily utilized Spotify’s public web API to gather the metadata. However, the parallel drawn by Spotify to similar data releases from YouTube is notably noteworthy. Datasets from youtube have reportedly been used to train unlicensed generative AI music services.

Pro tip – APIs (Application programming Interfaces) allow different software systems to communicate. Public APIs are generally accessible, but scraping data from them can violate terms of service.

That’s where the real worry lies for the music industry. While a free Spotify clone is unlikely to flourish, the availability of this massive dataset could complicate licensing negotiations

Reader question – Do you think data preservation efforts should be prioritized even if they potentially conflict with copyright laws? What are your thoughts?

Leave a Comment