Patreon seem to find it acceptable to do a whole bunch of security-through-obscurity measures on their site that verge on DRM. This is a symptom of the times, really. They don't provide an API for users, although they do provide an API for creators. Let's try to scrape MP3s.
<a data-tag="post-file-download"
href="https://c10.patreonusercontent.com/LONGSTRING" class="sc-fzpans kmqqXw">Myfile.mp3</a>
This is what the link for downloads looks like. We are after the URL in the
href
attribute. A query string parameter token-hash
and token-time
are
included in the URL. This means that it's actually possible to download these
files without sending any cookie, as the authentication details are already
encoded in the URL.
So that means that the problem reduces to the following:
- Expanding the infinite scroll of the site
- Find all links and correlate them with some metadata
Expand the infinite scroll
Start out at the URL that has a tags filter that matches what you want.
https://www.patreon.com/somecreator/posts?filters%5Btag%5D=Patron%20Only%20Audio
We look for the 'Load more' button and expand it. Make sure you actually
physically scroll down the page until the first instance of this button appears.
el.children.length
allows us to choose the deepest element in the hierarchy
before going upwards to its parent. This prevents us from accidentally picking
an outer container element which would also transitively contain the target
text, Load more
.
function findLoadMore() {
const buttonMatches = Array.from(document.querySelectorAll('div')).filter(el => el.textContent === 'Load more' && el.children.length === 0);
if (buttonMatches.length === 0) {
return undefined;
}
if (buttonMatches.length > 1) {
throw new Error("ambiguous");
}
const theMatch = buttonMatches[0];
const theButton = theMatch.parentNode;
if (theButton.tagName !== 'BUTTON') {
throw new Error("unexpected");
}
return theButton;
}
function expandScroll() {
const button = findLoadMore();
if (button === undefined) {
console.log("terminating expansion");
} else {
console.log("click");
button.click();
window.setTimeout(expandScroll, 30000);
}
}
The site is so ungodly slow that this can hang Chrome quite easily. I suggest either having an unrealistic amount of patience, or pausing script execution in the "Sources" tab in the inspector after a while.
Scrape the URLs
Scraping the URLs is a bit more interesting. We need two pieces of information, the first is the post publication date:
<a data-tag="post-published-at" href="/posts/some-post-12345" class="sc-fzpans bwkMGo">May 16 at 9:28am</a>
There may be more information in the show title but I don't think that we can guarantee uniqueness purely from use of that information. The second piece of information is the download URL.
From some experimentation in the inspector, I can tell that the common ancestor
of these elements is the following: <div data-tag="post-card">...</div>
This at least is a bit semantic.
Let's scrape:
var postNodes = document.querySelectorAll('[data-tag="post-card"]');
var data = Array.from(postNodes).reduce((result, el) => {
const publishedAt = el.querySelector('[data-tag="post-published-at"]');
const postFileDownload = el.querySelector('[data-tag="post-file-download"]');
if (postFileDownload !== null) {
result.push({
publishedAt: publishedAt.textContent,
href: postFileDownload.getAttribute('href')
});
}
return result;
}, []);
We have to handle the case where we don't find any data in case of mis-tagged items.
Now you can console.log(JSON.stringify(data))
and paste it into a text
editor, you can save it as exported.json
. Using that JSON data you can use
this Python script to download the files into timestamped files:
import json
import requests
import unicodedata
import re
with open('exported.json', 'r') as f:
data = json.load(f)
def slugify(value):
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub('[^:\w\s-]', '', value).strip().lower()
return re.sub('[-\s]+', '-', value)
for rec in data:
timestamp = rec['publishedAt']
href = rec['href']
filename = "{}.mp3".format(slugify(timestamp))
print(filename)
r = requests.get(href, stream=True, timeout=30)
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
Update 2021-10-13: Confirmed that this method is still working.