Ad-hoc Patreon audio scraping

Patreon seem to find it acceptable to do a whole bunch of security-through-obscurity measures on their site that verge on DRM. This is a symptom of the times, really. They don't provide an API for users, although they do provide an API for creators. Let's try to scrape MP3s.

<a data-tag="post-file-download"
   href="https://c10.patreonusercontent.com/LONGSTRING" class="sc-fzpans kmqqXw">Myfile.mp3</a>

This is what the link for downloads looks like. We are after the URL in the href attribute. A query string parameter token-hash and token-time are included in the URL. This means that it's actually possible to download these files without sending any cookie, as the authentication details are already encoded in the URL.

So that means that the problem reduces to the following:

Expanding the infinite scroll of the site
Find all links and correlate them with some metadata

Expand the infinite scroll

Start out at the URL that has a tags filter that matches what you want.

https://www.patreon.com/somecreator/posts?filters%5Btag%5D=Patron%20Only%20Audio

We look for the 'Load more' button and expand it. Make sure you actually physically scroll down the page until the first instance of this button appears. el.children.length allows us to choose the deepest element in the hierarchy before going upwards to its parent. This prevents us from accidentally picking an outer container element which would also transitively contain the target text, Load more.

function findLoadMore() {
     const buttonMatches = Array.from(document.querySelectorAll('div')).filter(el => el.textContent === 'Load more' && el.children.length === 0);
     if (buttonMatches.length === 0) {
         return undefined;
     }

     if (buttonMatches.length > 1) {
         throw new Error("ambiguous");
     }

     const theMatch = buttonMatches[0];
     const theButton = theMatch.parentNode;

     if (theButton.tagName !== 'BUTTON') {
         throw new Error("unexpected");
     }
     return theButton;
}

function expandScroll() {
    const button = findLoadMore();
    if (button === undefined) {
        console.log("terminating expansion");
    } else {
        console.log("click");
        button.click();
        window.setTimeout(expandScroll, 30000);
    }
}

The site is so ungodly slow that this can hang Chrome quite easily. I suggest either having an unrealistic amount of patience, or pausing script execution in the "Sources" tab in the inspector after a while.

Scrape the URLs

Scraping the URLs is a bit more interesting. We need two pieces of information, the first is the post publication date:

<a data-tag="post-published-at" href="/posts/some-post-12345" class="sc-fzpans bwkMGo">May 16 at 9:28am</a>

There may be more information in the show title but I don't think that we can guarantee uniqueness purely from use of that information. The second piece of information is the download URL.

From some experimentation in the inspector, I can tell that the common ancestor of these elements is the following: <div data-tag="post-card">...</div>

This at least is a bit semantic.

Let's scrape:

var postNodes = document.querySelectorAll('[data-tag="post-card"]');
var data = Array.from(postNodes).reduce((result, el) => {
    const publishedAt = el.querySelector('[data-tag="post-published-at"]');
    const postFileDownload = el.querySelector('[data-tag="post-file-download"]');

    if (postFileDownload !== null)  {
        result.push({
            publishedAt: publishedAt.textContent,
            href: postFileDownload.getAttribute('href')
        });
    }
    return result;
}, []);

We have to handle the case where we don't find any data in case of mis-tagged items.

Now you can console.log(JSON.stringify(data)) and paste it into a text editor, you can save it as exported.json. Using that JSON data you can use this Python script to download the files into timestamped files:

import json
import requests
import unicodedata
import re

with open('exported.json', 'r') as f:
    data = json.load(f)

def slugify(value):
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub('[^:\w\s-]', '', value).strip().lower()
    return re.sub('[-\s]+', '-', value)


for rec in data:
    timestamp = rec['publishedAt']
    href = rec['href']
    filename = "{}.mp3".format(slugify(timestamp))
    print(filename)
    r = requests.get(href, stream=True, timeout=30)

    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

Update 2021-10-13: Confirmed that this method is still working.