]>
Commit | Line | Data |
---|---|---|
ac4d1142 NL |
1 | # NOTE: If testing this configuration yields bad results, including junk text like "Try BostonGlobe.com today" and "THIS STORY APPEARED IN", please replace the Test URL with a current-day headline link from bostonglobe.com.\r |
2 | \r | |
3 | title: //div[@class="header"]/h1\r | |
4 | author: substring-after(//div[@class="byline"]/h2[@class="author"],"By ")\r | |
5 | date: //div[@class="byline"]/p[last()]\r | |
6 | body: //div[@class="article-body"]\r | |
7 | \r | |
8 | strip_id_or_class: aside\r | |
9 | strip_id_or_class: promo\r | |
10 | strip_id_or_class: skip-nav\r | |
11 | strip_id_or_class: article-more\r | |
12 | strip_id_or_class: article-bar\r | |
13 | \r | |
14 | # This removes image captions. If the parser starts saving images from bostonglobe.com (currently, it does not), then this directive should be removed.\r | |
15 | strip_id_or_class: figure | |
16 | test_url: http://bostonglobe.com/news/nation/2012/03/17/illinois-primary-could-pivotal/PsDzFZqvhEYyXbOcF9FOkO/story.html |