Scraping Hacker News on a Schedule with TaskPipes

Background

TaskPipes is a tool to turn any data into a spreadsheet. But why?

Well, spreadsheets are flexible. They’re accessible to the full spectrum of users, from completely non-technical to Turing Award winners. Tabular data is just really easy to manipulate, modify and get into the format you need.

an example

If you’d like to follow along with this, please go to taskpipes.com/examples and make a copy of the Hacker News Pipe.

Let’s use TaskPipes to scrape the front page of Hacker News every day, pull out any stories from GitHub, and email me the results.

Although you can pull in data from a range of sources with TaskPipes, we want to use the “External Link” option. Let’s set this to HN:

Screen Shot 2015-09-22 at 19.27.23

TaskPipes extracts any tables that are present in the HTML and, if you view the source on news.ycombinator.com, you’ll notice that there are three columns. We only want the third column, so let’s remove the first two.

Screen Shot 2015-09-22 at 19.27.45

Next, we want to extract the number of points of each submission. We use the “Extract Text” functionality to get the text between the start position and the first occurrence of the word “points”.

Screen Shot 2015-09-22 at 19.29.52

No more regex!

We do a similar thing to pull out the headline, domain and the number of comments, to end up with the data in this format:

Screen Shot 2015-09-22 at 17.38.46

Now, let’s apply a filter to extract only the stories from github.com, and our pipe is set up.

We can set this process to run on a schedule, and will be emailed a CSV file with the results.

Screen Shot 2015-09-22 at 17.40.57

Alternatively, send this data to an external API, a database, Google spreadsheet, or elsewhere.

Wrapping Up

You can use this above example by visiting taskpipes.com/examples

TaskPipes can pull data from almost anywhere, including web pages, the body of emails or even email attachments.

Clean and manipulate data, and send it to a range of different destinations.

Sign up for a free TaskPipes account at taskpipes.com

Advertisements
Scraping Hacker News on a Schedule with TaskPipes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s