Extract public posts from Facebook page without AP

2019-08-30 17:57发布

问题:

Just to clarify I don't have a Facebook account and I have no intent of creating one. Also, what I'm trying to achieve is perfectly legal in my country.

Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)

When I run this in the web console:

document.getElementsByClassName('userContent')

I get a list of elements containing the text of the latest posts.

But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.

Here is my attempt of trying exactly that:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
    const $ = cheerio.load(postsHtml);

    const timeLinePostEls = $('.userContent');
    console.log(timeLinePostEls.html()); // should NOT be null
    const newestPostEl = timeLinePostEls.get(0);
    console.log(newestPostEl.html()); // should NOT be null
    const newestPostText = newestPostEl.text();
    console.log(newestPostText);
    //const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
    //console.log(newestPostTime);
}).catch(console.error);

unfortunately $('.userContent') does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.

But I couldn't really come up with a with a good regex approach or the like to extract that data.

Depending on the post content the number of HTML tags within the post varies heavily.

Here is a simple example of a post containing one link:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;"><p>We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&amp;h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>

Formatted in a more readable form it looks somewhat like this:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;">
    <p>
        We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 
        2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for 
        Best Perks and Benefits. See what it took to make the list and check out our 
        profile to see some of our job openings.
        <a href="VERY_LONG_URL.........." target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
    </p>
</div>

This regex seems to work okay, but I don't think it is very reliable:

/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g

If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?

Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?

回答1:

Okay, I finally figured it out. I hope this will be useful to others. This function will extract the 20 latest posts, including the creation time:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

function GetFbPosts(pageUrl) {
    const requestOptions = {
        url: pageUrl,
        headers: {
            'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
        }
    };
    return rp.get(requestOptions).then( postsHtml => {
        const $ = cheerio.load(postsHtml);
        const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
        const posts = timeLinePostEls.map(post=>{
            return {
                message: post.html(),
                created_time: post.parents('.userContentWrapper').find('.timestampContent').html()
            }
        });
        return posts;
    });
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
    // Log all posts
    for (const post of posts) {
        console.log(post.created_at, post.message);
    }
});

Since Facebook messages can have complicated formatting the message is not plain text, but HTML. But you could remove the formatting and just get the text by replacing message: post.html() with message: post.text().