]>
Commit | Line | Data |
---|---|---|
ac4d1142 NL |
1 | # TODO: clean up the extra junk at the end of articles\r |
2 | \r | |
3 | # general text formatting\r | |
4 | prune: no\r | |
5 | convert_double_br_tags:yes\r | |
6 | \r | |
7 | # where to find the basic metadata\r | |
8 | author://a[@class='articleauthor']\r | |
9 | date://a[starts-with(@href,'/en/search/published/')]\r | |
10 | title:substring-before(//h2[@class='title'],'—')\r | |
11 | body://div[@id='maincontainer']\r | |
12 | \r | |
13 | dissolve://div[starts-with(@id,'commentableblock')]\r | |
14 | \r | |
15 | # clean up the crap\r | |
16 | strip://div[contains(@class,'domusnetwork')]\r | |
17 | strip://div[contains(@class,'relative_wrapper')]\r | |
18 | \r | |
19 | strip://div[contains(@class,'captionsubimage')]/img[contains(@class,'arrow')]\r | |
20 | wrap_in(em): //div[contains(@class,'captionsubimage')]/span | |
21 | test_url: http://www.domusweb.it/en/design/in-praise-of-lost-time/ |