-title: substring-before(//title, '—')
-test_url: http://512pixels.net/more-on-linked-lists/
\ No newline at end of file
+title: //meta[@property='og:title']/@content
+test_url: http://www.512pixels.net/blog/2014/10/the-move
Full-Text RSS site config files
================
-[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no site patterns, it tries to detect the content block automatically.
+[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.
-This repository contains the site config files we use in Full-Text RSS.
+This repository contains the site-specific extraction rules we rely on in Full-Text RSS.
### Contributing changes
+We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the [test results](http://siteconfig.fivefilters.org/test/) and see which files you'd like to contribute fixes for.
+
We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: [file editing](https://github.com/blog/844-forking-with-the-edit-button) through the web interface.
You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:
> And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.
-Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (login required).
+Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (no longer available since Instapaper was sold).
### Testing site config files
body: //section[@class='content']
date: //span[1]
author: //h1[@id='sitetitle']
-test_url: https://alexduner.com/blog/2013/1/something-i-learned-today
\ No newline at end of file
+test_url: http://alexduner.com/blog/something-i-learned-today
+body: //section[@class='main_cont']/img | //div[@class='articleContent']
+title: //div[@class='blog_top_left']//h2
author: //a[@class='b'][1]
date: substring-after(substring-before(//div, 'Posted in'), ' on ')
strip_image_src: /content/images/globals/
single_page_link: concat('http://www.anandtech.com/print/', substring-after(//meta[@property='og:url']/@content, '/show/'))
-test_url: http://www.anandtech.com/show/5812/eurocom-monster-10-clevos-little-monster/
\ No newline at end of file
+test_url: http://www.anandtech.com/show/8370/gigabyte-am1m-s2h-review
+test_url: http://www.anandtech.com/show/8402/sandisk-releases-ultra-ii-ssd-the-second-tlc-nand-ssd-in-the-market
+test_url: http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores
--- /dev/null
+# Author: zinnober
+
+prune: no
+
+title: substring-before(//div[@id='content']/h1, ',')
+
+single_page_link: //a[@title='Seite drucken']
+
+body: //div[@id='detail-body']
+
+replace_string(<span class="description">): <em>
+replace_string(<p class="leadtext"><small>): <p class="leadtext">
+
+# Fix headlines
+replace_string(Patrick Hollstein):
+replace_string(APOTHEKE ADHOC):
+replace_string(dpa):
+replace_string(Katharina Lübke):
+replace_string(Julia Pradel):
+replace_string(Franziska Gerhardt):
+
+test_url: http://www.apotheke-adhoc.de/nachrichten/politik/nachricht-detail-politik/deutscher-apothekertag-antraege-gegen-lieferengpaesse-2/
+
strip: //div[@class='pager']
next_page_link: //nav//a[span/@class='next']/@href
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
test_url: http://arstechnica.com/tech-policy/news/2012/02/gigabit-internet-for-80-the-unlikely-success-of-californias-sonicnet.ars
test_url: http://arstechnica.com/apple/2005/04/macosx-10-4/
--- /dev/null
+title: //div[@class='col-center']/h1
+author: //div[@class='personality']/a
+date: //div[@class='personality-date']
+body: //div[@class='content-top ']//div[@class='content'][1] | //div[contains(@class,'article-body')] | //div[contains(@class,'main-article')]
+
+next_page_link: //div[@id='review-link']/a
+
+strip: //div[@class='author-block']
+strip: //p//iframe[contains(@src,'signup')]/preceding::p[1]
+
+test_url: http://www.autocar.co.uk/car-review/volkswagen/golf
+test_url: http://www.autocar.co.uk/car-news/pebble-beach/saleen-unveils-performance-electric-vehicle-based-tesla-model-s
+test_url: http://www.autocar.co.uk/car-review/rolls-royce/first-drives/rolls-royce-ghost-series-ii-first-drive-review
#strip: //div[@class="story-feature narrow"]
#strip: //div[@class="story-feature wide"]
#strip: //div[@class="story-feature dslideshow-enclosure"]
-strip: //div[contains(@class, "story-feature")]
+strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
strip: //span[@class="story-date"]
#strip: //div[@class="caption body-narrow-width"]
strip: //div[@class="warning"]//p
strip: //div[contains(@class, 'share-tools')]
strip: //div[@id='also-related-links']
+strip_id_or_class: share-help
+strip_id_or_class: comments_module
+
replace_string(<noscript>): <div>
replace_string(</noscript>): </div>
+tidy: no
prune: no
dissolve: //h2
+
test_url: http://www.bbc.co.uk/sport/0/football/23224017
+test_contains: Swansea City have completed the club-record signing
+
test_url: http://www.bbc.co.uk/news/business-15060862
+test_contains: Europe's leaders are meeting again to try to solve
+
+# news feed
+test_url: http://feeds.bbci.co.uk/news/rss.xml
+# sports feed
+test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
# video entry
-test_url: http://www.bbc.co.uk/news/world-asia-22056933
\ No newline at end of file
+test_url: http://www.bbc.co.uk/news/world-asia-22056933
--- /dev/null
+body: //div[@class="story-body"]
+# for video entries
+body: //div[contains(@class, "videoInStory") or @id="meta-information"]
+title: //h1[@class="story-header"]
+date: //span[@class="story-date"]/span[@class='date']
+# for sport site
+date: //meta[@name='DCTERMS.created']/@content
+author: //div[@id='headline']//span[@class='byline-name']
+
+# recipes, e.g. http://www.bbc.co.uk/food/recipes/mymincepies_71055
+body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
+
+#strip: //div[@class="story-feature narrow"]
+#strip: //div[@class="story-feature wide"]
+#strip: //div[@class="story-feature dslideshow-enclosure"]
+strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
+strip: //span[@class="story-date"]
+#strip: //div[@class="caption body-narrow-width"]
+strip: //div[@class="warning"]//p
+strip: //div[@id='page-bookmark-links-head']
+strip: //object
+strip: //div[contains(@class, "bbccom_advert_placeholder")]
+strip: //div[contains(@class, "embedded-hyper")]
+strip: //div[contains(@class, 'market-data')]
+strip: //a[contains(@class, 'hidden')]
+strip: //div[contains(@class, 'hypertabs')]
+strip: //div[contains(@class, 'related')]
+strip: //form[@id='comment-form']
+strip: //div[contains(@class, 'comment-introduction')]
+strip: //div[contains(@class, 'share-tools')]
+strip: //div[@id='also-related-links']
+
+strip_id_or_class: share-help
+strip_id_or_class: comments_module
+
+replace_string(<noscript>): <div>
+replace_string(</noscript>): </div>
+
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
+tidy: no
+prune: no
+
+dissolve: //h2
+
+test_url: http://www.bbc.com/sport/0/football/28918021
+test_contains: Cameroonian footballer Albert Ebosse has died
+
+test_url: http://www.bbc.com/sport/0/football/23224017
+
+test_url: http://www.bbc.com/news/business-15060862
+test_contains: Europe's leaders are meeting again to try
+
+
+# news feed
+test_url: http://feeds.bbci.co.uk/news/rss.xml
+# sports feed
+test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
+# video entry
+test_url: http://www.bbc.com/news/world-asia-22056933
--- /dev/null
+body: //div[@id='column_1']
+next_page_link: //div[@class='next']/a[not(contains(@href, '/comments') or contains(@href, '/news/'))]
+prune: no
+
+author: substring-after(//p[@class='byline'], 'by ')
+date: substring-before(substring-after(//p[@class='byline'], 'on '), ' by')
+
+strip: //h1
+strip_id_or_class: socialLinks
+strip_id_or_class: byline
+strip_id_or_class: pageSelector
+strip_id_or_class: articleTabs
+strip_id_or_class: pageNav
+strip_id_or_class: share
+strip_id_or_class: commentsContainer
+strip_id_or_class: below_article_related
+
+test_url: http://www.bit-tech.net/hardware/storage/2014/08/13/ocz-arc-100-240gb-review/1
+test_url: http://www.bit-tech.net/news/bits/2014/08/15/google-trojan/1
--- /dev/null
+body: //div[contains(@class, 'article_pages')]
+
+strip_id_or_class: article_page-header
+strip_id_or_class: paginator
+strip_id_or_class: article_info
+
+find_string: src="data:image
+replace_string: ignore-src="data:image
+find_string: data-defer-src="
+replace_string: src="
+
+prune: no
+
+test_url: http://bleacherreport.com/articles/feed
+test_url: http://bleacherreport.com/articles/2137787-christian-ponders-newborn-daughter-was-named-after-fsu-legend-bobby-bowden
+test_url: http://bleacherreport.com/articles/2137596-college-football-week-1-picks-unlv-runnin-rebels-vs-arizona-wildcats/
\ No newline at end of file
--- /dev/null
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set author
+author: //a[@rel='author']
+
+# Set date
+date: //span[@class='Datum']
+
+# Content is here
+body: //div[@class='Artikel']
+
+# Tidy up before article
+strip: //div[@id='FAZHeaderNeu']
+strip: //h2[@itemprop='headline']
+strip: //span[@class='Datum']
+strip: //span[@class='Autor']
+strip_id_or_class: ArticlePagerTop
+strip: //div[@class='FAZArtikelEinleitung']/h2
+
+# General cleanup
+strip: //div[@class='clear']
+strip: //span[@class='Bildnachweis']
+strip: //iframe
+strip_id_or_class: Community
+strip: ' · '
+
+# Remove tracking and ads
+strip_image_src: /l.gif?
+strip: //img[@width='1']
+strip_id_or_class: invisible
+strip_id_or_class: Anzeige
+strip_id_or_class: billboard
+
+# Remove clutter after article
+strip_id_or_class: Tagline
+strip_id_or_class: ArtikelAbbinder
+strip_id_or_class: FAZArtikelKommentare
+strip_id_or_class: ArtikelKommentieren
+strip_id_or_class: FAZContentRight
+
+# Try it yourself
+test_url: http://blogs.faz.net/wost/2014/08/17/viel-fuck-und-wenig-guter-sex-1239/
strip: //div[starts-with(@id, 'sumario') and contains(., 'más información')]
strip: //div[@id='coment' or @id='foros_not']
-test_url: http://elpais.com/elpais/2012/02/06/gente/1328526783_491687.html
-test_url: http://www.elpais.com/articulo/cultura/mano/retrato/materia/elpepicul/20120207elpepicul_2/Tes
+test_url: http://brasil.elpais.com/brasil/2014/10/15/politica/1413334841_878730.html
+test_contains: O PT quer intensificar a presença do ex-presidente
+
+test_url: http://brasil.elpais.com/brasil/2014/10/13/internacional/1413225730_450761.html
+test_contains: Todos na localidade onde ele nasceu ainda falavam da façanha
-# story has several pages, should be detected
-body: //div[@id='storyBody']
-body: //div[@id='article_body']
-body: //div[@id='story_body']
+# include the lead graphic in the body, if available
+body: //div[contains(concat(' ', normalize-space(@id), ' '), ' lead_graphic ')] | //div[contains(concat(' ', normalize-space(@itemprop), ' '), ' articleBody ')]
+title: //h1[contains(concat(' ', normalize-space(@itemprop), ' '), ' headline ')]
+date: //time[contains(concat(' ', normalize-space(@itemprop), ' '), ' datePublished ')]
-title://h1[@id='article_headline']
-
-# article author
-author: //p[@class='author']/a
-# story author(s)
-author: substring-after(//p[@class='byline'], 'By ')
-
-# article date
-date: //span[@class='published_date']
-# story date
-date: //span[@class='date']
-
-date: substring-after(//div[contains(@class,'attributor')],'on')
-strip_id_or_class: inset
-strip: //p/span[@class='photoCredit']
-strip: //h1
-
-strip_id_or_class: page_count
-strip_id_or_class: tools
-strip_id_or_class: pagination
-
-single_page_link: //li[@id='stPrint']/a
+strip_id_or_class: photo_credit
+strip_id_or_class: photo_caption
+strip_id_or_class: inline_gallery
+# pull quote, often inside a blockquote element
+strip_id_or_class: pq
+strip_id_or_class: credit
+strip_id_or_class: figcaption
+strip_id_or_class: related_item
test_url: http://www.businessweek.com/magazine/buyback-insurance-a-good-deal-for-retailers-07282011.html
-test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
\ No newline at end of file
+test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
+test_url: http://www.businessweek.com/articles/2014-07-09/american-apparel-dov-charneys-sleazy-struggle-for-control
body: //div[@data-print='body']
body: //section[@data-print='body']
+find_string: rel:bf_image_src=
+replace_string: src=
+find_string: src="data:
+replace_string: disabled_src="data:
+
+native_ad_clue: //meta[@property="article:section" and @content="Advertiser"]
+
# For various things...
strip: *[@data-print="ignore"]
-test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
\ No newline at end of file
+test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
+# Native ad
+test_url: http://www.buzzfeed.com/bravo/ways-to-up-your-online-dating-game
\ No newline at end of file
--- /dev/null
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set title
+title: //h2
+
+date: //li[@class='time']
+
+# Set author
+author: //a[contains(@rel, 'author')]
+
+# Content is here
+body: //div[@id='content']
+
+# Tidy up before article
+strip: //div[@class='meta']
+
+# Tidy up after article
+strip_id_or_class: nr_related_placeholder
+strip_id_or_class: twitter-share-button
+strip_id_or_class: afterpost
+strip_id_or_class: tags
+
+# Try it yourself
+test_url: http://www.canonrumors.com/2014/09/chuck-westfall-talks-canon-eos-7d-mark-ii/
+test_url: http://www.canonrumors.com/2014/09/canon-cinema-eos-captures-space-in-4k-for-new-imax-3d-film/
author: //div[@class='author']
prune: no
-test_url: http://www.chomsky.info/onchomsky/2002----.htm
\ No newline at end of file
+test_url: http://www.chomsky.info/onchomsky/2002----.htm
+test_contains: The propaganda model argues
title: //div[@id='maincontent']//h1
body: //div[@id='resizeableText']
+single_page_link: concat(//link[@rel='canonical']/@href, '?sp=true')
+
test_url: http://cn.reuters.com/article/CNAnalysesNews/idCNKBS0FF0NM20140710
-test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
\ No newline at end of file
+test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
+# multipage link
+test_url: http://cn.reuters.com/article/idCNKBS0FF0UL20140710
\ No newline at end of file
-body: //div[@id='content']
+body: //div[@id='readme']
+
+test_url: http://code.fivefilters.org/full-text-rss
tidy: no
prune: no
-test_url: www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
\ No newline at end of file
+test_url: http://www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
tidy: no
prune: no
-test_url: da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
\ No newline at end of file
+test_url: http://da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
--- /dev/null
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set title
+title: //header/h1
+
+# Set author
+author: //a[rel='author']
+
+# Content is here
+body: //article
+
+# Tidy up before article
+strip: //header
+
+# Tidy up article
+strip: //div[contains(@id, 'gallery-')]
+replace_string(<a rel="attachment): <p rel="attachment
+
+
+# Tidy up after article
+strip: //div[@class='sm']
+strip_id_or_class: related
+strip_id_or_class: comments
+strip: //footer
+
+# Try it yourself
+test_url: http://www.designsponge.com/2010/06/seattle-design-guide.html
+test_url: http://www.designsponge.com/2012/04/sneak-peek-liz-cook.html
body: (//div[starts-with(@id, 'post_message')])[1]
prune: no
-tidy: no
\ No newline at end of file
+tidy: no
+
+test_url: http://www.desitvforum.net/forum/watch-online/431739-creature-3d-2014-watch-online-download-dvd-rip.html
--- /dev/null
+# Author: zinnober
+
+prune: yes
+tidy: yes
+
+title: //h1
+date: //p[@class='news_datum']
+author: //span[@class='author']
+
+body: //div[@class='tagesnews-content']
+
+# General clenaup
+strip_id_or_class: dachzeile
+strip: //h3
+strip: //p[@class='bodytext']//a
+strip_id_or_class: autor_datum
+strip_id_or_class: comments
+strip_id_or_class: banner-
+
+strip: //p[contains(., 'Lesen Sie')]
+strip: //p[contains(., '– in DAZ')]
+
+# Fix image captions
+replace_string(<p class="image_caption">): <p><small><em>
+replace_string(</dd>): </em></small></dd>
+
+test_url: http://www.deutsche-apotheker-zeitung.de/pharmazie/news/2014/09/03/weniger-nebenwirkungen-aber-kein-zusatznutzen/13715.html
+test_url: http://www.deutsche-apotheker-zeitung.de/recht/news/2014/09/02/urteile-zum-cannabis-eigenanbau-bfarm-geht-in-berufung/13716.html
+
-title: //h1[@id='query_h1']
-body: //div[contains(@class, 'lunatext results_content')]
-strip_id_or_class: spl_unshd
-#replace_string(<div class="dicTl">): <div class="dicTl">------------------<br />
+body: //div[contains(@class, 'source-data')]
+strip: //button
prune: no
-test_url: http://www.wired.com/cloudline/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/
\ No newline at end of file
+test_url: http://dictionary.reference.com/browse/propaganda
-single_page_link: //a[@id='download_button_link']
\ No newline at end of file
+single_page_link: //a[@id='download_button_link']
+
+test_url: https://www.dropbox.com/s/qmocfrco2t0d28o/Fluffbeast.docx
--- /dev/null
+# Author: Marvin Dickhaus <github@marvindickhaus.de>
+# 2014-10-08
+
+#Tidy just messes up the DOM
+tidy: no
+
+title: //h1
+body: //h2 | //div[@id='artikelteaser'] | //div[@id='artikeltext']
+
+#Strip
+strip_image_src: artikel_a_merken.gif
+strip: //div[@class='zusatzinfo']
+
+#Author: substring is used to remove the " Von " prefix.
+author: substring(//li[@class='artikelautor'], 5)
+
+date: //li[@class='artikeldatum']
+
+#The first two URLs will at some point no longer show
+#the full article. There is a time-based paywall
+#installed. Using the feed should present valid output
+test_url: http://www.echo-online.de/art1231,5503063
+test_url: http://www.echo-online.de/art1168,5502598
+test_url: http://www.echo-online.de/rss/darmstadt.xml
body: //div[@class='main-content']
+body: //article[contains(@class, 'resp-node')]
date: //time[@class='date-created']
strip: //aside
prune: no
autodetect_next_page: no
-test_url: http://www.economist.com/node/21528429
\ No newline at end of file
+test_url: http://www.economist.com/node/21528429
+
+test_url: http://www.economist.com/news/essays/21623373-which-something-old-and-powerful-encountered-vault
+test_contains: the calfskin pages are smooth
+test_contains: Books will evolve online and off
-body: //div[ @class='content' ] | //div[ @class='blog-entry' ]
+body: //p[@class='strapline'] | //div[@class='cover-image'] | //article[@class='hd']
+strip: //div[@class='social top']
+strip: //p[@class='byline']
-strip: //h2/abbr | //div[ @class='lowleader' ] | //*[ @class='discussion' ] | //img[ @class='play-button' ] | //div[ @class='boxout' ] | //h2/a | //h2 | //h2/div | //p[ @class='timestamp' ] | //a[ @class='eurogamer-author' ] | //p[ @class='aPager' ] | //h1 | //div[ @id='lowleader' ] | //a[ @class='next' ] | //div[contains(concat(' ', normalize-space(@class), ' '), ' pullquote ')]
+date: //span[@itemprop='datePublished']
+author: //a[@itemprop='author']/text()
-date://p[ @class='timestamp' ]
-
-author://a[ @class='eurogamer-author' ]
-test_url: http://www.eurogamer.net/articles/digitalfoundry-vs-unreal-engine-4
\ No newline at end of file
+test_url: http://www.eurogamer.net/articles/2014-08-20-bungie-ordered-to-return-shares-to-composer-marty-odonnell
+test_url: http://www.eurogamer.net/articles/2014-08-20-invisible-inc-does-espionage-justice
body: //div[@id='imagestage']
+body: //div[contains(@class, 'userContentWrapper')]
+
+strip_id_or_class: commentable
+
prune: no
tidy: no
-test_url: https://www.facebook.com/feeds/page.php?id=338077742912613&format=rss20
\ No newline at end of file
+# single_page_link: replace(substring-after(//noscript//meta[@http-equiv="refresh"]/@content, 'URL='), "&", "&")
+
+test_url: https://www.facebook.com/permalink.php?story_fbid=10154584776550183&id=294468630182
+test_contains: holding an extraordinary session in Brussels this month
strip: //div[@id='y-article-related']
strip: //div[@id='ypf-article-related']
prune: no
+tidy: no
single_page_link: //div[@class='ft']//a[contains(@href, 'page=all')]
-test_url: http://sg.finance.yahoo.com/news/Motorola-takes-wraps-249-rsg-3508842732.html?x=0&.v=1
-test_url: http://finance.yahoo.com/news/super-young-retirement-savers.html
\ No newline at end of file
+test_url: http://finance.yahoo.com/news/canadian-orebodies-gives-notice-exercise-130000032.html
\ No newline at end of file
body: //div[@class='entry']
-test_url: http://www.fivechapters.com/2010/paris-part-one/
\ No newline at end of file
+test_url: http://www.fivechapters.com/2014/the-saddest-writer-in-america-part-two/
-prune: no
\ No newline at end of file
+body: //section[contains(@class, 'container')]
+prune: no
+
+test_url: http://fivefilters.org/kindle-it/
title: //div[@class='translateHead']//h1 | //div[@id='art-mast']//h1
author: substring-after(//span[@id='by-line'], 'BY ')
date: //span[@id='pub-date']
-body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
+body: (//article//img[contains(@class, 'main_photo')])[1] | (//article//div[contains(@class, 'full_post_content')])[1]
+#body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
#Strip inside article content
strip: //div[@id='share-box']
-strip: //div[@id='special-box']
+strip: //div[@id='special-box
+
+strip_id_or_class: side_panel
prune: no
single_page_link: //span[@id='controls']/a[contains(@href, 'print=yes')]
single_page_link: //a[text()='SINGLE PAGE']
+test_url: http://www.foreignpolicy.com/articles/2014/07/22/the_end_game_in_gaza_netanyahu_hamas
test_url: http://www.foreignpolicy.com/articles/2011/08/01/a_murderers_manifesto_and_me
test_url: http://www.foreignpolicy.com/articles/2012/02/29/five_years_in_damascus
\ No newline at end of file
-# Jens Kohl, jens.kohl@...
-# - Added publication date
-# - Striped pagination block
-# - Added single page link
-# - Added xpath-querys for the printer friendly version
+# Author: zinnober
+# Rewrite of original template which fetched the printer-version without pictures
-title: //h1
-body: //div[@class='formatted']
+tidy: no
prune: no
-date: substring-after(//li[2][@class="text1"], 'Datum:')
-strip: //ol[@class="list-chapters"]
-strip_comments: yes
-
-# next: commands for printer friendly pages
-single_page_link: //a[contains(@href, 'print.php?a=')]/@href
-title: //body/h3
-strip_image_src: staticrl/images/logo.jpg
-strip_image_src: http://cpx.golem.de/cpx.php?class=7
-strip: //body/h3
-strip: //body/b[1]
-strip: //body/b[2]
-strip: //body/b[3]
-strip: //div[1]
-test_url: http://www.golem.de/1112/88696.html
\ No newline at end of file
+# Set full title
+title: //h1
+
+date: //time
+
+# Content is here
+body: //article
+
+# Fetch full multipage articles
+next_page_link: //a[@id='atoc_next']
+
+# Remove tracking and ads
+strip_id_or_class: iqadtile4
+
+# General Cleanup
+strip_id_or_class: list-jtoc
+strip_id_or_class: table-jtoc
+strip_id_or_class: implied
+strip_id_or_class: social-
+strip_id_or_class: comments
+strip_id_or_class: footer
+
+# Tidy up galleries (could still be improved, though)
+strip: //img[@src='']
+
+# Try yourself
+test_url: http://www.golem.de/news/intel-core-i7-5960x-im-test-die-pc-revolution-beginnt-mit-octacore-und-ddr4-1408-108893.html
+test_url: http://www.golem.de/news/test-infamous-first-light-neonbunter-actionspass-1408-108914.html
-#second part of single_page_link for telepolis-articles (desktop-version of site)
-single_page_link: //p[@class='news_option']/a | //a[@id='tp-druckversion']
+# Author: zinnober
+# Template should work well with either desktop or mobile version (m.heise.de)
+prune: no
+
+title: //article/h1 | //h1
date: //p[@class='news_datum']
-title: //h1
-body: //div[@class='meldung_wrapper']
+author: //h4[@class='author']
+
+body: //article | //div[@class='meldung_wrapper']
+
+# General cleanup
+strip: //time
+strip: //h4[@class='author']
+strip: //p[@class='news_datum']
+strip: //p[@class='artikel_datum']
+strip: //a[contains(@href, 'mailto')]
+strip_id_or_class: comments
+strip_id_or_class: ISI_IGNORE
+strip_id_or_class: clear
+
+strip_id_or_class: linkurl_grossbild
+strip_id_or_class: image-num
+strip_id_or_class: heisebox_right
+strip_id_or_class: dossier
+
+# Strip Ads
+strip_id_or_class: ad_
+
+# Some optimizations
+replace_string(<h5>): <h2>
+replace_string(</h5>): </h2>
+replace_string(<span class="bild_rechts"): <p
+replace_string(<div class="heisebox">): <blockquote>
+
+
+next_page_link: //a[@class='next']
+next_page_link: //a[@title='vor']
-test_url: http://www.heise.de/newsticker/meldung/Europa-soll-Grundrechteschutz-im-Netz-staerken-1392664.html
-test_url: http://www.heise.de/tp/artikel/42/42579/1.html
+test_url: http://www.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
+test_url: http://m.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
+test_url: http://www.heise.de/newsticker/meldung/Ueberwachungstechnik-Die-globale-Handy-Standortueberwachung-2301494.html
tidy: no
strip_image_src: analytics.apnewsregistry
-test_url: http://hosted.ap.org/dynamic/stories/U/US_SPENDING_SHOWDOWN?SITE=FLPET&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2011-04-06-07-46-50
\ No newline at end of file
+test_url: http://hosted.ap.org/dynamic/stories/E/EU_TURKEY_KURDS?SITE=KSNEW&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-10-14-10-50-25
--- /dev/null
+body: //div[@id='left-stack' or contains(@class, 'center-stack')]
+
+find_string: class="artwork" src="
+replace_string: class="artwork" src-disabled="
+find_string: src-swap-high-dpi="
+replace_string: src="
+
+strip_id_or_class: rating
+strip_id_or_class: listeners-also-bought
+
+prune: no
+
+test_url: https://itunes.apple.com/us/rss/topaudiobooks/limit=10/xml
+test_url: https://itunes.apple.com/us/audiobook/the-giver-unabridged/id356345850
\ No newline at end of file
tidy: no
test_url: http://www.kachiblog.com/2013/05/samsung-galaxy-s4-vs-samsung-galaxy.html
-test_url: http://www.kachiblog.com/feeds/posts/default
\ No newline at end of file
+test_url: http://www.kachiblog.com/feed
--- /dev/null
+title: //div[@itemprop='headline']
+body: //noscript/img | //div[@itemprop='text']
+author: //div[@class='meta meta--post']//a[@class='is-author']
+date: //div[@class='meta meta--post']//time/@datetime
+
+test_url: http://www.lifehacker.co.uk/2014/08/22/dealhacker-10-google-chromecast-super-cheap-batteries-much
+test_url: http://www.lifehacker.co.uk/2014/08/18/andrognito-hides-files-youd-like-keep-away-prying-eyes
#Comments
strip: //table
strip: //p/following-sibling::*[0]
-test_url: http://www.mainpost.de/ueberregional/meinung/Dioxin-Skandal-bringt-Agrarministerin-in-Bedraengnis;art9517,5920211
\ No newline at end of file
+test_url: http://www.mainpost.de/regional/wuerzburg/Autobahnschuetze-Staatsanwalt-fordert-zwoelf-Jahre;art492151,8386332
strip_id_or_class: article-tools
strip_id_or_class: pagenav
prune: no
-test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
\ No newline at end of file
+test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
+test_contains: In an era of permanent war, economic meltdown
-body: //div[contains(@class, 'post-content-inner')]
-strip_id_or_class: follow-ups
-strip_id_or_class: footer
+body: //div[contains(@class, 'postContent-inner')]
+strip_id_or_class: supplementalPostContent
prune: no
-test_url: https://medium.com/p/6844c0d7893b
\ No newline at end of file
+test_url: https://medium.com/@savolai/kaytettavyyden-haasteet-keskustelukulttuurista-2-3-6844c0d7893b
+test_contains: Jos käytettävyysongelmat ovat kerran niin tyypillisiä
+test_contains: Keskustelukulttuuriongelmasta (subjective vs. objective bugs)
+
+test_url: https://medium.com/health-the-future/thirty-things-ive-learned-482765ee3503
+test_contains: Remember you will die
+test_contains: You have to have some faith.
--- /dev/null
+strip: //div[contains(@style, 'float:right') and contains(., 'advertisement')]
+body: //div[@style="float:left;width:740px;"]
+
+tidy: no
+
+test_url: http://www.menshealth.com.sg/fitness/mh-picks-under-armour-clutchfit-nitro-mid-cleats
+test_contains: These cleats are made for one thing
+
+test_url: http://www.menshealth.com.sg/fitness/top-10-fat-burning-bodyweight-moves-you-can-do-10-minutes
+test_contains: let this workout fool you
+
+test_url: http://www.menshealth.com.sg/fitness/feed
\ No newline at end of file
strip_id_or_class: ezc_comments
strip_comments: yes
-test_url: http://www.northumberlandview.ca/index.php?module=news&func=display&sid=5972
\ No newline at end of file
+test_url: http://www.northumberlandview.ca/index.php?module=news&type=user&func=display&sid=31127
author:substring-after(//h6[@class='byline'],'By ')
test_url: http://www.nytimes.com/2011/07/24/books/review/an-academic-authors-unintentional-masterpiece.html
+test_contains: In this column I want to look at a not uncommon way of writing
+
test_url: http://www.nytimes.com/2012/06/10/arts/television/the-newsroom-aaron-sorkins-return-to-tv.html
+test_contains: IF you’ve seen enough of Aaron Sorkin’s theater
+
test_url: http://www.nytimes.com/2013/03/25/world/middleeast/israeli-military-responds-after-patrols-come-under-fire-from-syria.html
test_url: http://www.nytimes.com/2013/08/15/nyregion/when-the-new-york-city-subway-ran-without-rails.html
test_url: http://www.nytimes.com/2004/02/29/weekinreview/correspondence-class-consciousness-china-s-wealthy-live-creed-hobbes-darwin-meet.html
-test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html
\ No newline at end of file
+test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html
-body: //div[@id='_ctl12__ctl0_Article']
+body: //div[contains(@class, 'article-photo-wrapper')]
prune: no
-autodetect_on_failure: no
\ No newline at end of file
+
+test_url: http://www.real.gr/DefaultArthro.aspx?page=arthro&id=360962&catID=1
+test_contains: Επισήμως το αποψινό υπουργικό
# this doesn't work for some reason...?
date: //p[@class="tagline"]//@datetime
-body: //div[@class="expando"]//div[@class="usertext-body"]
+body: (//div[contains(@class, 'noncollapsed')]//div[contains(@class, 'usertext-body')])[1]
strip_id_or_class: tagline
strip_id_or_class: unvotable-message
single_page_link: //p[@class="title"]/a[contains(@href, 'http://')]
test_url: http://www.reddit.com/r/truegaming/comments/wfe7r/i_wrote_about_the_problems_i_honestly_feel_that/
-test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
\ No newline at end of file
+test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
+test_url: http://www.reddit.com/r/WritingPrompts/comments/2786lw/wp_in_a_world_where_puns_are_illegal_one_man/chybk8e
\ No newline at end of file
-body: //div[@class="storyBox"]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article ') and (contains(concat(' ',normalize-space(@class),' '),' clear '))]
title: //div[@class="storyBox"]/h1
author: //a[@rel="author"]
date: substring-before(//span[@class="dateline"], 'by')
#grab the actual content div
body: //div[@class='rt-article']
-test_url: http://www.sourcebooks.com/next/sourcebooks-next-our-blog/1601-another-piece-of-the-e-puzzle-or-when-good-ebook-promotions-go-bad.html
\ No newline at end of file
+test_url: http://www.sourcebooks.com/blog/happy-27th-birthday-sourcebooks.html
--- /dev/null
+body: //div[contains(@class, 'story-text')]
+
+strip_id_or_class: related
+
+test_url: http://www.tabletmag.com/jewish-news-and-politics/181181/mossberg-parallel-states?all=1
\ No newline at end of file
--- /dev/null
+# Author: zinnober
+# Should work with "normal" articles as well as with image galleries
+
+prune: no
+
+# Title
+title: //h1/span[@class='hcf-headline']
+
+# Set author
+author: //a[@rel='author']
+
+# Set date
+date: //span[@class='date hcf-atlas']
+
+# Fetch full multipage articles
+next_page_link: //a[contains(@class, 'hcf-forward')]
+
+# Content is here
+body: //article
+body: //div[contains(@class, 'hcf-screen')]
+
+# Remove tracking and ads
+strip_id_or_class: hcf-ad
+strip_id_or_class: hcf-autoload-ad
+strip_id_or_class: hcf-content-ad
+
+# Tidy up before article
+strip: //article/h1
+strip_id_or_class: hcf-atlas
+strip_id_or_class: hcf-author
+strip_id_or_class: date hcf-atlas
+strip_id_or_class: date hcf-atlas
+
+# General cleanup
+strip: //div[contains(@class, 'hcf-screen')]//h1
+strip: //div[@class='hcf-subpage-titles']//ul
+strip_id_or_class: hcf-doctype-media
+strip_id_or_class: hcf-inline-gallery
+strip_id_or_class: hcf-doctype-video
+strip_id_or_class: hcf-links
+strip_id_or_class: hcf-mini-navi
+strip_id_or_class: hcf-media-control
+strip_id_or_class: hcf-hidden
+replace_string(<span class="hcf-update">Update</span>): <strong>Update: </strong>
+
+# Fix pictures and captions
+replace_string(<a class="hcf-doctype-gallery): <p class="hcf-doctype-gallery
+replace_string(<a class="hcf-doctype-enlarge): <p class="hcf-doctype-enlarge
+replace_string(<figcaption class="hcf-caption">): <br><small><em>
+replace_string(</figcaption>): </em></small>
+
+# Fix image galleries
+replace_string(<a class=" ajaxify): <p class="ajaxify
+replace_string(<div class="hcf-caption"><div><p>): <small><em>
+
+# Try it yourself
+test_url: http://www.tagesspiegel.de/berlin/bezirke/wedding/wedding-jetzt/auf-der-suche-nach-einem-stadtteil-wilder-weiter-wedding/8757156.html
+test_url: http://www.tagesspiegel.de/berlin/olympia-in-berlin-der-flughafen-tegel-soll-das-olympische-dorf-werden/10645036.html
+test_url: http://www.tagesspiegel.de/mediacenter/fotostrecken/berlin/bildergalerie-kreuzberger-der-woche/9305534.html
+
single_page_link_in_feed: //b/a
-test_url_feed: http://www.techmeme.com/feed.xml
\ No newline at end of file
+test_url: http://www.techmeme.com/feed.xml
single_page_link: //article//a[contains(@class, 'print')]
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
test_url: http://www.theatlantic.com/technology/archive/2011/04/want-to-see-how-crazy-a-bot-run-market-can-be/237773/
test_url: http://www.theatlantic.com/magazine/archive/2007/11/the-autumn-of-the-multitaskers/6342/
test_url: http://www.theatlantic.com/entertainment/archive/2012/04/30-rock-live-a-funny-reminder-of-why-sitcoms-arent-shot-live-anymore/256447/
\ No newline at end of file
+body: //div[contains(@class, 'entry-content')]//div[contains(@class, 'column-2')]
single_page_link: //div[contains(@class, 'pagination')]//a[contains(@title, 'ingle page')]
+strip_id_or_class: entry-related
+strip_id_or_class: entry-sidebar
+strip_id_or_class: entry-pagination
tidy: no
prune: no
-test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
\ No newline at end of file
+test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
+test_url: http://www.theglobeandmail.com/report-on-business/industry-news/energy-and-resources/cliffs-natural-resources-looking-to-exit-ontarios-ring-of-fire/article20651617/
\ No newline at end of file
#strip: //a[not(text())]
strip_id_or_class: pocket-btn
author: //li[@class='byline']
+native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]
+native_ad_clue: //meta[@property="video:tag" and contains(@content, "Partner zone")]
prune: no
tidy: no
+
test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
+test_contains: The National Security Agency has made repeated attempts to develop
+test_contains: The agency did not directly address those questions, instead providing a statement.
+
test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
-test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
\ No newline at end of file
+test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
+test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws
+
+test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
+# Native ad
+test_url: http://www.theguardian.com/sustainable-business/2014/jul/18/ben-jerry-turn-ice-cream-into-energy
strip: //img[contains(@class, 'vox-lazy-load')]
# deal with bad parsing
strip: //div[contains(@class, 'story-image')]//div[contains(., 'function(')]
+strip: //div[contains(@class, 'm-linkset')]
+strip: //div[contains(@class, 'm-entry__sidebar')]
+strip: //ul[contains(@class, 'm-article__sources')]
+strip: //div[contains(@class, 'chorus-emc__content')]
+
strip_id_or_class: gallery
strip_id_or_class: article-meta
test_url: http://www.theverge.com/2011/11/3/2534861/nokia-lumia-800-review
test_url: http://www.theverge.com/2013/2/24/4026114/barnes-noble-shifting-focus-away-from-nook-hardware
test_url: http://www.theverge.com/2014/6/19/5824072/top-shelf-living-the-dream
-test_url: http://www.theverge.com/rss/frontpage
\ No newline at end of file
+test_url: http://www.theverge.com/rss/frontpage
--- /dev/null
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set author
+author: //a[contains(@rel, 'author')]
+
+# Content is here
+body: //article
+
+# Tidy up before article
+strip: //header
+
+# Get rid of doubled images
+strip: //img[contains(@class, '-hidden')]
+
+# Tidy up after article
+strip_id_or_class: social-list
+strip_id_or_class: meta-info
+strip: //footer
+
+# Try it yourself
+test_url: http://www.thisiscolossal.com/2014/09/chicago-in-the-fog-by-michael-salisbury/
+test_url: http://www.thisiscolossal.com/2014/09/bird-portraits-ruffling-with-personality-by-leila-jeffreys/
--- /dev/null
+title: //div[@id='headline']
+body: //div[@class='entry_text']
+author: //div[text() = 'Author:']/following-sibling::div/a
+date: //div[text() = 'Published:']/following-sibling::div
+single_page_link: //a[@href='noscript.html']
+prune: no
+
+test_url: http://towerofthehand.com/blog/2014/08/08-pitch-this-got-spinoff/index.html
+test_url: http://towerofthehand.com/blog/2014/07/31-definitions-and-embodiments/index.html
+test_url: http://towerofthehand.com/blog/2014/07/03-hero-with-thousand-faces/index.html
prune: no
tidy: no
-test_url: https://twitter.com/medialens/status/216883678582804480
\ No newline at end of file
+test_url: https://twitter.com/medialens/status/216883678582804480
+test_contains: is all but alone in challenging the tsunami of UK
author: //div[contains(@class, 'byline')]//span[contains(@class, 'name')]
date: //div[contains(@class, 'cn_date_time')]
body: //div[contains(@class, 'pageContainers')]
+body: //div[@id='main']
body: //article[@id='items-container']
#body: //h2[@class='sub-header'] | //div[contains(@class, 'contributor-type') or @class='display-date' or @class='content-container']
single_page_link: //a[@title='Print this page']
test_url: http://www.vanityfair.com/politics/features/2011/05/egypt-revolutionaries-201105
+test_contains: nothing can take away from the miracle of Tahrir Square
+
test_url: http://www.vanityfair.com/politics/features/2008/08/hitchens200808
-test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
\ No newline at end of file
+test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
--- /dev/null
+author: //div[@id='main']//div[@class='col right']//div[contains(@class, 'attribute-author')]
+body: //div[@id='main']//div[@class='col right']
+strip_id_or_class: boxes
+strip_id_or_class: lazy
+strip_id_or_class: comment_box
+strip_id_or_class: fb_comments
+
+find_string: <noscript>
+replace_string: <div>
+find_string: </noscript>
+replace_string: </div>
+
+prune: no
+tidy: no
+
+test_url: http://www.wn.de/Muenster/Kultur/1742956-Wilm-Weppelmann-verlaesst-die-Einsiedelei-Und-dann-ab-unter-die-Dusche
+# feed
+test_url: http://www.wn.de/rss/feed/wn_muenster
\ No newline at end of file
-# 2014-10-21 [Marmo] added stripping of inline ads and appropriate test_url
# 2013.10.30 [rezor92] fixed single_page_link
# 2012-12-23 [carlo@...] fixed half-assed headlines in articles, removed inline author profiles, adjusted picture captions
# 2012-03-17 [dkless@...] Cut metadata parts in the beginning and the ends of the content block; copyright entries for pictures removed; Author fixed, not sure if old entries still valid (I left them); Weird problems with some pages addressed (see last section for removing hidden section)
strip_id_or_class: articleheader
strip: //div[@id="comments"] | //div[@class="pagination block"] | //p[@class="ressortbacklink"] | //div[@id="relatedArticles"] | // div[@class="inline portrait"]
-#Remove inline ads
-strip: //div[@class="innerad"]
#Removes author and date from the start
strip: //ul[@class="tools"]
footnotes: no
test_url: http://www.zeit.de/kultur/film/2012-12/Kurzfilmtag
-test_url: http://www.zeit.de/wissen/2014-10/ebola-nigeria-who