In Data Quality Management, Tips

Written by Lance Martin

In 2017, the ReachForce development team made a goal to go serverless. This meant that we needed a new solution for scraping webpages and retrieving form HTML to help our customers self-configure our SmartForms product. After some brainstorming, we decided that our technology of choice would be a node based AWS lambda paired with phantomJS running as a child process. We hit a few obstacles along the way but were able to take away a few insightful lessons that we’re excited to share. This blog will walk you through how to build a serverless webscraper.

Getting the Proper Version of Phantom

The first step of putting phantomJS onto an AWS lambda is to download a version compatible with lambda. Since lambda runs a 64 bit version of linux, you will need the corresponding version of phantomJS. In our case, we used phantomjs-2.1.1-linux-x86_64.

Requiring the Necessary Modules

For running phantomJS in Node, you will need just a few of the natively included node modules. You’ll need the path module in order to provide the proper path to your phantomJS bin, the spawn module to run phantom as a child process, and the file system module so that data from phantom can be written to a temporary file and then read once the process is done.

Running phantomJS as a Child Process

First, you will need to configure your environment path so that your phantom bin can be properly accessed, and then store the path to phantom into a constant.

After you’ve created your path, you’ll want to establish any arguments that you want to pass to your phantom process. Any phantomJS flags should come first, followed by the path to the script you want phantom to execute, and then any additional arguments can be passed afterward. In our use case, we set the flags to relax some of phantom’s HTTPS restrictions so that our scraper is more tolerant of configuration variances. We then pass a URL that we want phantom to open as our additional argument.

Now that we have the path and arguments established, we can spawn phantom as a child process. Make sure to save it into a reference variable so that you can listen to its’ events.

Congrats! You now have phantomJS running as a child process of your node lambda. But our work isn’t done yet. We still need to listen to our child process so that we can look for any data it outputs or log any errors that it reports.

Listening to events from your phantom process and processing data

To listen for console.log()s and errors coming from your child phantom process, you’ll want to tie into its standard out and standard error. To do so, you can reference your phantom process’s stdout or stderr properties and add an event listener to it as such:

It is very easy to transfer data from your phantom script to your node parent process by using the stdout, but please be aware that you will be limited by the maximum string length of the stdout. If your data exceeds that length then your string will be split and it can cause many headaches to successfully piece your data back together. We advise instead that you use the stdout only for communicating very small bits of data back to your parent process and write large data into a temporary file that can later be read by the parent process.

You may also find it very useful to provide a callback function for when your phantom process is done running. You can do so by listening for the close event of the phantom spawn, as such:

Establishing a temporary file for phantomJS to write large amounts of data in

If you expect your phantom child process to communicate large data objects back to your parent process, then you should create a temporary file for the processes to share. This can be accomplished by opening a sync with the file system, creating a file if one doesn’t already exist, and then closing the sync

Setting up your phantom script

The first thing you will need to do inside of your phantom script is to require the necessary modules to access the arguments passed from the parent process, a module to write and access temporary files, and the page module so that phantomJS can open the desired web page. To do this, you’ll want to import the system module, the page module, and once again the file system module.

Next you will need to access the arguments passed from the parent process. To do this, access the args property of the system module. In our case the argument being passed is the url of the page we want to open.

With the desired url to be opened you can use the page module to open the page and communicate the result back to your parent process.

Once you evaluate your page and scrape the data you are looking for, you can write that data into the temporary file that was created by the parent process. In our use case we write a json blob into the file with a function structured much like this:

After writing the desired data to our temporary file we close the phantom process, which will activate the close listener in our parent process and read the data in the temp file.

Reading data from the phantom child process stored in the temporary file

In our specific use case, it was only necessary to read data stored in the temporary file once our phantom child process was done running. To accomplish this, we put all of our logic, such as checking to see if our temporary file contained JSON and sending a callback, inside of our phantom spawn’s close event. We also delete the temporary file since it is no longer necessary.

Conclusion

It was a long process but you’ve now built a phantom web scraper running inside of a node lambda that’s capable of scraping large amounts of data! Hope this guide has helped you accomplish your serverless scraper goals.