Stateful programmatic web browsing in Python, after Andy Lester's Perl
module WWW::Mechanize
.
mechanize.Browser
is a subclass of
mechanize.UserAgent
, which is, in turn, a subclass of
urllib2.OpenerDirector (ClientCookie.OpenerDirector
for
pre-2.4 versions of Python), so any URL can be opened, not just
http:
. mechanize.UserAgent
offers easy dynamic
configuration of user-agent features like protocol, cookie, redirection and
robots.txt
handling, without having to make a new
OpenerDirector
each time, eg. by calling
build_opener()
(it's not stable yet, though).
.back()
and .reload()
methods).
Referer
HTTP header is added properly (optional).
robots.txt
.
An example:
import re from mechanize import Browser br = Browser() br.open("http://www.example.com/") # follow second link with element text matching regular expression response = br.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) assert br.viewing_html() print br.title() print response.geturl() print response.info() # headers print response.read() # body response.close() br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm (from ClientForm). br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) response2 = br.submit() # submit current form response3 = br.back() # back to cheese shop # the history mechanism uses cached requests and responses assert response3 is response # we can still use the response, even though we closed it: response3.seek(0) response3.read() response4 = br.reload() assert response4 is not response3 for form in br.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex=re.compile("python.org")): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
You may control the browser's policy by using the methods of
mechanize.Browser
's base class, mechanize.UserAgent
.
For example:
br = Browser() # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). br.set_handle_equiv(False) # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) # Don't handle cookies br.set_cookiejar() # Supply your own ClientCookie.CookieJar (NOTE: cookie handling is ON by # default: no need to do this unless you have some reason to use a # particular cookiejar) br.set_cookiejar(cj) # Print information about HTTP redirects and Refreshes. br.set_debug_redirects(True) # Print HTTP response bodies (ie. the HTML, most of the time). br.set_debug_responses(True) # Print HTTP headers. br.set_debug_http(True)
Full documentation is in the docstrings.
Thanks to Ian Bicking, for persuading me that a UserAgent
class
would be useful.
.response()
method (each call should return independent
pointer to same data). Want to be able to clone responses, too, so can
process HTML. Needs some careful thought: want to clean up the multiple
layers of response objects in ClientCookie and the standard library.
mechanize.UserAgent
.
Browser.load_response()
method.
Browser.form_as_string()
and
Browser.__str__()
methods.
mechanize.Browser
/ ClientForm API is not sufficient. (DOMForm
is similar: implementation of ClientForm interface on top of HTML DOM, but
it's buggy and unmaintained, and the DOM is not as nice an API as
BeautifulSoup).
All documentation (including this web page) is included in the distribution.
This is an alpha release: interfaces may change, and there will be bugs.
Development release.
For installation instructions, see the INSTALL file included in the distribution.
The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/mechanize/trunk, so to check out the source:
svn co http://codespeak.net/svn/wwwsearch/mechanize/trunk mechanize
Richard Jones' webunit (this is not the same as Steven Purcell's code of the same name). webunit and mechanize are quite similar. On the minus side, webunit is missing things like browser history, high-level forms and links handling, thorough cookie handling, refresh redirection, adding of the Referer header, observance of robots.txt and easy extensibility. On the plus side, webunit has a bunch of utility functions bound up in its WebFetcher class, which look useful for writing tests (though they'd be easy to duplicate using mechanize). In general, webunit has more of a frameworky emphasis, with aims limited to writing tests, where mechanize and the modules it depends on try hard to be general-purpose libraries.
There are many related links in the General FAQ page, too.
2.2 or above.
ClientCookie, ClientForm and pullparser.
The versions of those required modules are listed in the setup.py for mechanize (included with the download). The dependencies are automatically fetched by easy_install when you run python setup.py install.
The BSD license (included in distribution).
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, November 2005.