]>
Commit | Line | Data |
---|---|---|
ac4d1142 NL |
1 | title: normalize(//h1)\r |
2 | \r | |
3 | author: //td/p[position()=last()]/em\r | |
4 | \r | |
5 | # I swear, this is really the best way to do this\r | |
6 | date: normalize(//td[contains(@style, "color: #ffffff")])\r | |
7 | \r | |
8 | # my god, it's full of tables\r | |
9 | body: /table/tbody/tr[5]//table/tbody//table/tbody/tr/td\r | |
10 | strip: //h1\r | |
11 | \r | |
12 | # the following two lines strip the byline at the end of the article (the byline is a <p> that consists of an em dash and then some text in an <em>). I have no idea why I can't just strip //p[position()=last()], but trying to do so includes a bunch of other crap in the output.\r | |
13 | strip: //p[position()=last()]/em\r | |
14 | strip: //p[position()=last()]/child::text() | |
15 | test_url: http://www.fnal.gov/pub/today/archive_2011/today11-11-09_MuonDepartmentReadMore.html |