Table of Contents
- When to Webdrive
- Selenium Webdriver
- First Steps
- Usage + Example Program
- Referencing Webpage (DOM) Elements
Do you have menial or repetitive tasks you find yourself performing manually within a web browser, over-and-over-and-over again? Maybe your job requires you to search or message relevant people over linkedin? Maybe you run a publishing company that wastes countless hours repetitively uploading your ebooks and their metadata to different services like Amazon and Pubit? If you were given an intern who could follow instructions perfectly, would you be able to describe your exact process for completing these repetitive tasks? If so, this tutorial walks you through how you can go about automating such workflows by:
- Formalizing our workflow into an Algorithm (a generic set of repeatable steps)
- Translating our human instructions into discrete actions (e.g. clicking, typing) applied to specific browser "elements", in a way they can be understood and referenced by computers
- See Render v. Markup section which describes how we think of / interpret / represent webpages differently than computers.
- Programming a script which controls the flow of your Algorithm
Webdriving is the act of having a computer perform steps and actions within a web browser, on your behalf, such as typing keys, performing searches, submitting forms, and clicking links/buttons. It is advantageous for automating menial, repetitive tasks (kind of like what mechanical turk [mturk] is used for) and also in software development for allowing programmers to write tests and catch irregularities when their web application is run on different browsers (e.g. Firefox, Chome, Safari, Opera, etc.). In some organizations, hundreds of people independently make contributions to the code which runs a web application. Can you imagine comprehensively testing every web page feature using every browser, every time one of these changes occur?
- Amplification: Whatever workflow you were doing manually, after you have a script, you can run it multiple times and amplify your productivity.
- Productivity: You can focus your time on things which require expertise
- Speed: Webdrivers perform much faster than us, since it knows exactly where elements on the page are without moving a mouse.
- Correctness: You can avoid mistakes by relying on the computer to step through your workflow without errors.
Writing Webdrivers isn't "free" as it takes a significant time investment to formalize the steps of your routine and translate your manual workflow into steps the computer can understand. For the non-programmer, expect your first Selenium script to take you about a week (2h a day for 5 days). Most of your time will be spent installing + setting up + learning the basics, reading the included resources, and above all, trying to map/reconsile your visual interpretation of the elements you see within a webpage to the browser's underlying/internal identifiers which it understands. More on this is Render v. Markup.
Webdriving excels when your workflow is:
- Frequent. A repetitive task which needs to be run frequently (especially when this frequency is greater than you can possibly achieve as a human)
- Deterministic. The task or the webpage doesn't have much variability or dynamic, unpredictable content (it's generally the same)
- Tedious and Lengthy: The task involves a lot of steps (especially when the process is so long, doing it multiple times is more expensive than doing it once to learn the steps and then programming a script to handle recurrent applications).
- High Repetitious. The same workflow must be done many times, with only minor changes
Maintenance & Execution Costs
In addition to programming costs, there's also the cost of execution and maintenance. When we as humans run into exceptions in our workflow (e.g. if linkedin updates their website and changes the location of a button), we can adapt. Your computer program will break and halt its work. If you aren't a good manager (i.e. you don't check your script often), you can lose significant time and progress.
- Monitoring. Someone or something needs to ensure the script is still running, and running correctly. A popular method is to have your program send email notifications as soon as it fails, or, if it is acceptable to skip some repetitions of your program, to programatically skip or address edge cases
Webdriving is very rarely a silver bullet, especially since many workflows require intervention of an expert. Some tasks have very repetitive steps which can be automated in the background, but occasionally the workflow has specific steps which require completion by a human expert. For example, in the example of a LinkedIn webdriver which finds relevant people, maybe we want to pause the webdriving process to write them a personal message. This pausing disrupts the whole automation process, requires the human to become involved, and eliminates most of the advantage of using a webdriver.
One of the biggest components of designing effective automation pipelines is getting around this type of problem. Optimizing such scenarios is one of the biggest opportunities for gains. We won't be going too deep into this now, but you can imagine solving this problem by templating orbatching.
- Templating: Eliminate expertise by using generic templates (such as an email template). Templates generally allow greater scale at the tradeoff of quality.
- Batching: Do the work up front. Batch the expert's work up front, store it somewhere (a database) and then have your webdriver refer to this database or content as it is needed. In this way, the expert is no longer the bottleneck of the workflow.
A popular tool for "Webdriving" is Selenium. Selenium is a toolkit which you can import and program using Java and Python -- in my opinion, the Python interface (the set of functions and the conventions for calling these functions) for Selenium is much simpler for the first-time programmer (both in installation/setup of the tools, writing the code, and executing the code).
First Step: A Visual (Non Programming) Example
Before getting started with Webdriver programming, there's a neat way to use Selenium interactively through Firefox, without requiring direct programming: http://www.seleniumhq.org/projects/ide/ (see download link to install and try the Plugin). You can watch a short video explaining how to use the graphical interface to record your browser actions here:https://www.youtube.com/watch?v=gsHyDIyA3dg
At a high level, you should be familiar with the concept of recording your browser actions, but a lot of the features and specifics (addressing elements) should leave you a bit confused. This will be addressed in Render v. Markup.
- Install Python 3.4
- Install Selenium
- Install Selenium Python Bindings
- sudo pip install selenium
Usage + Example Program
- To start python, open up a command shell and type: python
Referencing Webpage (DOM) Elements
The power of Selenium is being able to programmatically address different browser elements. To do this, we need to understand the language the browser uses, learn how the browser addresses elements, and how this is different from how we talk about browser elements.
Render v. Markup
Have you ever tried to give map directions using visual indicators to a friend who only understands areas by names? Perhaps you forget the exact address on a location but you know it's next to a red coffee shop and there is a big willow tree out front. This means nothing to your friend, they need the coffee shop name. Webdriving is a comparably frustrating and analogous task.
One thing you'll notice (the hardest part about webdriving using Selenium) is the browser understands a web page and its structure very differently than a human does. While different elements of a webpage (such as a hypertext link or a button) may appear to us humans as being identical, or visually appear to be in a certain logical order, the browser may have have an entirely different internal representation of what the page "looks like". The actual structure and organization of a webpage is not how it visually appears to the user (this is called rendering), but how the web page's HTML elements are marked-up.
What a webpage is
Without diving in too deep, when you (a user) visit a webpage in your browser, what you see (maybe we navigate to the URL google.com) is your browser's rendering of a HTML file which is sent to us by the server we requested (in this case we go to google.com, google's server returns an html file of google.com to your browser which interprets this HTML markup and renders it based on its browser rules for HTML).
Viewing a Webpage's Source
By default, we see the rendered version of the resoures (web page) we request. However, you can view the actual underlying HTML markup that your browser has received from the server by viewing the webpage's source (Ctrl + u) or, preferred by most, by using the browser's wonderful element inspector developer tool (chrome and firefox both have comparable inspectors).
Tag Names, Element IDs, CSS Selectors, and XPath
Within the source, you will see a bunch of HTML tags. These tags are all apart of the document's Object Model (collectively called the DOM). Tags within the DOM can be referenced in various ways. Just as humans have names, social security numbers, parents or children, telephone numbers, and email addresses (unique ways of being referenced), every HTML tag within the DOM has a unique way of being referenced.
A browser refers to elements by their tag's ID (if it exists):
By a tag name:
By a class name...
By its xpath (the location of this element relative to others within the document / the page's DOM)
Using IDs, tag names, and class names is generally more convenient than xpaths -- the following explains effective ways of determining xpaths of elements: