]> git.immae.eu Git - github/wallabag/wallabag.git/commitdiff
updated site_config 888/head
authorNicolas Lœuillet <nicolas@loeuillet.org>
Mon, 27 Oct 2014 05:46:13 +0000 (06:46 +0100)
committerNicolas Lœuillet <nicolas@loeuillet.org>
Mon, 27 Oct 2014 05:46:13 +0000 (06:46 +0100)
64 files changed:
inc/3rdparty/site_config/standard/512pixels.net.txt
inc/3rdparty/site_config/standard/README.md
inc/3rdparty/site_config/standard/alexduner.com.txt
inc/3rdparty/site_config/standard/anandtech.com.txt
inc/3rdparty/site_config/standard/apotheke-adhoc.de.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/arstechnica.com.txt
inc/3rdparty/site_config/standard/autocar.co.uk.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/bbc.co.uk.txt
inc/3rdparty/site_config/standard/bbc.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/bit-tech.net.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/bleacherreport.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/blogs.faz.net.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/brasil.elpais.com.txt
inc/3rdparty/site_config/standard/businessweek.com.txt
inc/3rdparty/site_config/standard/buzzfeed.com.txt
inc/3rdparty/site_config/standard/canonrumors.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/chomsky.info.txt
inc/3rdparty/site_config/standard/cn.reuters.com.txt
inc/3rdparty/site_config/standard/code.fivefilters.org.txt
inc/3rdparty/site_config/standard/csmonitor.com.txt
inc/3rdparty/site_config/standard/da.feedsportal.com.txt
inc/3rdparty/site_config/standard/designsponge.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/desitvforum.net.txt
inc/3rdparty/site_config/standard/deutsche-apotheker-zeitung.de.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/dictionary.reference.com.txt
inc/3rdparty/site_config/standard/dropbox.com.txt
inc/3rdparty/site_config/standard/echo-online.de.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/economist.com.txt
inc/3rdparty/site_config/standard/eurogamer.net.txt
inc/3rdparty/site_config/standard/facebook.com.txt
inc/3rdparty/site_config/standard/faz.net.txt [changed mode: 0644->0755]
inc/3rdparty/site_config/standard/finance.yahoo.com.txt
inc/3rdparty/site_config/standard/fivechapters.com.txt
inc/3rdparty/site_config/standard/fivefilters.org.txt
inc/3rdparty/site_config/standard/foreignpolicy.com.txt
inc/3rdparty/site_config/standard/golem.de.txt
inc/3rdparty/site_config/standard/heise.de.txt
inc/3rdparty/site_config/standard/hosted.ap.org.txt
inc/3rdparty/site_config/standard/itunes.apple.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/kachiblog.com.txt
inc/3rdparty/site_config/standard/lifehacker.co.uk.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/mainpost.de.txt
inc/3rdparty/site_config/standard/medialens.org.txt
inc/3rdparty/site_config/standard/medium.com.txt
inc/3rdparty/site_config/standard/menshealth.com.sg.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/northumberlandview.ca.txt
inc/3rdparty/site_config/standard/nytimes.com.txt
inc/3rdparty/site_config/standard/real.gr.txt
inc/3rdparty/site_config/standard/reddit.com.txt
inc/3rdparty/site_config/standard/searchengineland.com.txt
inc/3rdparty/site_config/standard/sourcebooks.com.txt
inc/3rdparty/site_config/standard/tabletmag.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/tagesspiegel.de.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/techmeme.com.txt
inc/3rdparty/site_config/standard/theatlantic.com.txt
inc/3rdparty/site_config/standard/theglobeandmail.com.txt
inc/3rdparty/site_config/standard/theguardian.com.txt
inc/3rdparty/site_config/standard/theverge.com.txt
inc/3rdparty/site_config/standard/thisiscolossal.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/towerofthehand.com.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/twitter.com.txt
inc/3rdparty/site_config/standard/vanityfair.com.txt
inc/3rdparty/site_config/standard/wn.de.txt [new file with mode: 0755]
inc/3rdparty/site_config/standard/zeit.de.txt

index e458980fe0db1b7aaecd0b5c2e62cf06a1820138..02a996f79eb0c6fe2aacbb37d89d4b73417adf90 100755 (executable)
@@ -1,2 +1,2 @@
-title: substring-before(//title, '&mdash;')
-test_url: http://512pixels.net/more-on-linked-lists/
\ No newline at end of file
+title: //meta[@property='og:title']/@content
+test_url: http://www.512pixels.net/blog/2014/10/the-move
index 9040ba8522bd3abb14d41d005eba4b51201c8cc2..ab5b12d9c0f4ce267741c74cffcaf80077d85f08 100755 (executable)
@@ -1,12 +1,14 @@
 Full-Text RSS site config files
 ================
 
-[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no site patterns, it tries to detect the content block automatically.
+[Full-Text RSS](http://fivefilters.org/content-only/), our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.
 
-This repository contains the site config files we use in Full-Text RSS.
+This repository contains the site-specific extraction rules we rely on in Full-Text RSS.
 
 ### Contributing changes
 
+We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the [test results](http://siteconfig.fivefilters.org/test/) and see which files you'd like to contribute fixes for.
+
 We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: [file editing](https://github.com/blog/844-forking-with-the-edit-button) through the web interface. 
 
 You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:
@@ -31,7 +33,7 @@ Marco, Instapaper's creator, graciously opened up the database of contributions
 
 > And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.
 
-Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (login required).
+Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at [instapaper.com/bodytext/](http://instapaper.com/bodytext/) (no longer available since Instapaper was sold).
 
 ### Testing site config files
 
index bd9de9d70e66bb610c89d77599405a0f45e303d2..3897f9ec165d75a7e7b76ed0470f3bc909f9de80 100755 (executable)
@@ -1,4 +1,4 @@
 body: //section[@class='content']
 date: //span[1]
 author: //h1[@id='sitetitle']
-test_url: https://alexduner.com/blog/2013/1/something-i-learned-today
\ No newline at end of file
+test_url: http://alexduner.com/blog/something-i-learned-today
index 7d80491852a98bd5994386a06f64c75b87bcd5dd..fc95c5d8a3bf3135a2159a9faf89df6ba2cdbbc3 100755 (executable)
@@ -1,3 +1,5 @@
+body: //section[@class='main_cont']/img | //div[@class='articleContent']
+title: //div[@class='blog_top_left']//h2
 author: //a[@class='b'][1]
 date: substring-after(substring-before(//div, 'Posted in'), ' on ')
 strip_image_src: /content/images/globals/
@@ -8,4 +10,6 @@ prune: no
 
 single_page_link: concat('http://www.anandtech.com/print/', substring-after(//meta[@property='og:url']/@content, '/show/'))
 
-test_url: http://www.anandtech.com/show/5812/eurocom-monster-10-clevos-little-monster/
\ No newline at end of file
+test_url: http://www.anandtech.com/show/8370/gigabyte-am1m-s2h-review
+test_url: http://www.anandtech.com/show/8402/sandisk-releases-ultra-ii-ssd-the-second-tlc-nand-ssd-in-the-market
+test_url: http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores
diff --git a/inc/3rdparty/site_config/standard/apotheke-adhoc.de.txt b/inc/3rdparty/site_config/standard/apotheke-adhoc.de.txt
new file mode 100755 (executable)
index 0000000..3a702e7
--- /dev/null
@@ -0,0 +1,23 @@
+# Author: zinnober
+
+prune: no
+
+title: substring-before(//div[@id='content']/h1, ',')
+
+single_page_link: //a[@title='Seite drucken']
+
+body: //div[@id='detail-body']
+
+replace_string(<span class="description">): <em>
+replace_string(<p class="leadtext"><small>): <p class="leadtext">
+
+# Fix headlines
+replace_string(Patrick Hollstein): &nbsp;
+replace_string(APOTHEKE ADHOC): &nbsp;
+replace_string(dpa): &nbsp;
+replace_string(Katharina Lübke): &nbsp;
+replace_string(Julia Pradel): &nbsp;
+replace_string(Franziska Gerhardt): &nbsp;
+
+test_url: http://www.apotheke-adhoc.de/nachrichten/politik/nachricht-detail-politik/deutscher-apothekertag-antraege-gegen-lieferengpaesse-2/
+
index 767f6800c829f21d82d10b318e8d8a1e237f2a10..eb92aa2c7a4fd31ef43054ac7e3372ff3abe9046 100755 (executable)
@@ -13,5 +13,7 @@ title: //div[@id='story']//h2[@class='title']
 strip: //div[@class='pager']
 next_page_link: //nav//a[span/@class='next']/@href
 
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
 test_url: http://arstechnica.com/tech-policy/news/2012/02/gigabit-internet-for-80-the-unlikely-success-of-californias-sonicnet.ars
 test_url: http://arstechnica.com/apple/2005/04/macosx-10-4/
diff --git a/inc/3rdparty/site_config/standard/autocar.co.uk.txt b/inc/3rdparty/site_config/standard/autocar.co.uk.txt
new file mode 100755 (executable)
index 0000000..9f4fe18
--- /dev/null
@@ -0,0 +1,13 @@
+title: //div[@class='col-center']/h1
+author: //div[@class='personality']/a
+date: //div[@class='personality-date']
+body: //div[@class='content-top ']//div[@class='content'][1] | //div[contains(@class,'article-body')] | //div[contains(@class,'main-article')]
+
+next_page_link: //div[@id='review-link']/a
+
+strip: //div[@class='author-block']
+strip: //p//iframe[contains(@src,'signup')]/preceding::p[1]
+
+test_url: http://www.autocar.co.uk/car-review/volkswagen/golf
+test_url: http://www.autocar.co.uk/car-news/pebble-beach/saleen-unveils-performance-electric-vehicle-based-tesla-model-s
+test_url: http://www.autocar.co.uk/car-review/rolls-royce/first-drives/rolls-royce-ghost-series-ii-first-drive-review
index ef1f491ae0adc229a7db0c6c274525dfa38a2acc..bad77654b02c29981afcbb54079f77d60ab4096e 100755 (executable)
@@ -13,7 +13,7 @@ body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
 #strip: //div[@class="story-feature narrow"]
 #strip: //div[@class="story-feature wide"]
 #strip: //div[@class="story-feature dslideshow-enclosure"]
-strip: //div[contains(@class, "story-feature")]
+strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
 strip: //span[@class="story-date"]
 #strip: //div[@class="caption body-narrow-width"]
 strip: //div[@class="warning"]//p
@@ -30,13 +30,26 @@ strip: //div[contains(@class, 'comment-introduction')]
 strip: //div[contains(@class, 'share-tools')]
 strip: //div[@id='also-related-links']
 
+strip_id_or_class: share-help
+strip_id_or_class: comments_module
+
 replace_string(<noscript>): <div>
 replace_string(</noscript>): </div>
 
+tidy: no
 prune: no
 
 dissolve: //h2
+
 test_url: http://www.bbc.co.uk/sport/0/football/23224017
+test_contains: Swansea City have completed the club-record signing 
+
 test_url: http://www.bbc.co.uk/news/business-15060862
+test_contains: Europe's leaders are meeting again to try to solve
+
+# news feed
+test_url: http://feeds.bbci.co.uk/news/rss.xml
+# sports feed
+test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
 # video entry
-test_url: http://www.bbc.co.uk/news/world-asia-22056933
\ No newline at end of file
+test_url: http://www.bbc.co.uk/news/world-asia-22056933
diff --git a/inc/3rdparty/site_config/standard/bbc.com.txt b/inc/3rdparty/site_config/standard/bbc.com.txt
new file mode 100755 (executable)
index 0000000..c04a683
--- /dev/null
@@ -0,0 +1,60 @@
+body: //div[@class="story-body"]
+# for video entries
+body: //div[contains(@class, "videoInStory") or @id="meta-information"]
+title: //h1[@class="story-header"]
+date: //span[@class="story-date"]/span[@class='date']
+# for sport site
+date: //meta[@name='DCTERMS.created']/@content
+author: //div[@id='headline']//span[@class='byline-name']
+
+# recipes, e.g. http://www.bbc.co.uk/food/recipes/mymincepies_71055
+body: //div[contains(@class, 'hrecipe')]//div[@id='subcolumn-1']
+
+#strip: //div[@class="story-feature narrow"]
+#strip: //div[@class="story-feature wide"]
+#strip: //div[@class="story-feature dslideshow-enclosure"]
+strip: //div[contains(@class, "story-feature") and not(contains(@class, 'full-width'))]
+strip: //span[@class="story-date"]
+#strip: //div[@class="caption body-narrow-width"]
+strip: //div[@class="warning"]//p
+strip: //div[@id='page-bookmark-links-head']
+strip: //object
+strip: //div[contains(@class, "bbccom_advert_placeholder")]
+strip: //div[contains(@class, "embedded-hyper")]
+strip: //div[contains(@class, 'market-data')]
+strip: //a[contains(@class, 'hidden')]
+strip: //div[contains(@class, 'hypertabs')]
+strip: //div[contains(@class, 'related')]
+strip: //form[@id='comment-form']
+strip: //div[contains(@class, 'comment-introduction')]
+strip: //div[contains(@class, 'share-tools')]
+strip: //div[@id='also-related-links']
+
+strip_id_or_class: share-help
+strip_id_or_class: comments_module
+
+replace_string(<noscript>): <div>
+replace_string(</noscript>): </div>
+
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
+tidy: no
+prune: no
+
+dissolve: //h2
+
+test_url: http://www.bbc.com/sport/0/football/28918021
+test_contains: Cameroonian footballer Albert Ebosse has died
+
+test_url: http://www.bbc.com/sport/0/football/23224017
+
+test_url: http://www.bbc.com/news/business-15060862
+test_contains: Europe's leaders are meeting again to try
+
+
+# news feed
+test_url: http://feeds.bbci.co.uk/news/rss.xml
+# sports feed
+test_url: http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int
+# video entry
+test_url: http://www.bbc.com/news/world-asia-22056933
diff --git a/inc/3rdparty/site_config/standard/bit-tech.net.txt b/inc/3rdparty/site_config/standard/bit-tech.net.txt
new file mode 100755 (executable)
index 0000000..c6f5b20
--- /dev/null
@@ -0,0 +1,19 @@
+body: //div[@id='column_1']
+next_page_link: //div[@class='next']/a[not(contains(@href, '/comments') or contains(@href, '/news/'))]
+prune: no
+
+author: substring-after(//p[@class='byline'], 'by ')
+date: substring-before(substring-after(//p[@class='byline'], 'on '), ' by')
+
+strip: //h1
+strip_id_or_class: socialLinks
+strip_id_or_class: byline
+strip_id_or_class: pageSelector
+strip_id_or_class: articleTabs
+strip_id_or_class: pageNav
+strip_id_or_class: share
+strip_id_or_class: commentsContainer
+strip_id_or_class: below_article_related
+
+test_url: http://www.bit-tech.net/hardware/storage/2014/08/13/ocz-arc-100-240gb-review/1
+test_url: http://www.bit-tech.net/news/bits/2014/08/15/google-trojan/1
diff --git a/inc/3rdparty/site_config/standard/bleacherreport.com.txt b/inc/3rdparty/site_config/standard/bleacherreport.com.txt
new file mode 100755 (executable)
index 0000000..9205e44
--- /dev/null
@@ -0,0 +1,16 @@
+body: //div[contains(@class, 'article_pages')]
+
+strip_id_or_class: article_page-header
+strip_id_or_class: paginator
+strip_id_or_class: article_info
+
+find_string: src="data:image
+replace_string: ignore-src="data:image
+find_string: data-defer-src="
+replace_string: src="
+
+prune: no
+
+test_url: http://bleacherreport.com/articles/feed
+test_url: http://bleacherreport.com/articles/2137787-christian-ponders-newborn-daughter-was-named-after-fsu-legend-bobby-bowden
+test_url: http://bleacherreport.com/articles/2137596-college-football-week-1-picks-unlv-runnin-rebels-vs-arizona-wildcats/
\ No newline at end of file
diff --git a/inc/3rdparty/site_config/standard/blogs.faz.net.txt b/inc/3rdparty/site_config/standard/blogs.faz.net.txt
new file mode 100755 (executable)
index 0000000..4f2626f
--- /dev/null
@@ -0,0 +1,45 @@
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set author
+author: //a[@rel='author']
+
+# Set date
+date: //span[@class='Datum']
+
+# Content is here
+body: //div[@class='Artikel']
+
+# Tidy up before article
+strip: //div[@id='FAZHeaderNeu']
+strip: //h2[@itemprop='headline']
+strip: //span[@class='Datum']
+strip: //span[@class='Autor']
+strip_id_or_class: ArticlePagerTop
+strip: //div[@class='FAZArtikelEinleitung']/h2
+
+# General cleanup
+strip: //div[@class='clear']
+strip: //span[@class='Bildnachweis']
+strip: //iframe
+strip_id_or_class: Community
+strip: ' ·  '
+
+# Remove tracking and ads
+strip_image_src: /l.gif?
+strip: //img[@width='1']
+strip_id_or_class: invisible
+strip_id_or_class: Anzeige
+strip_id_or_class: billboard
+
+# Remove clutter after article
+strip_id_or_class: Tagline
+strip_id_or_class: ArtikelAbbinder
+strip_id_or_class: FAZArtikelKommentare
+strip_id_or_class: ArtikelKommentieren
+strip_id_or_class: FAZContentRight
+
+# Try it yourself
+test_url: http://blogs.faz.net/wost/2014/08/17/viel-fuck-und-wenig-guter-sex-1239/
index 0b8feb6a559844d40751d150b5cdf87776612bce..6a22dcb76f124e9375541089cb91fd8cf2dc07d6 100755 (executable)
@@ -19,5 +19,8 @@ strip: //p[@class='nota_pie']
 strip: //div[starts-with(@id, 'sumario') and contains(., 'más información')]
 strip: //div[@id='coment' or @id='foros_not']
 
-test_url: http://elpais.com/elpais/2012/02/06/gente/1328526783_491687.html
-test_url: http://www.elpais.com/articulo/cultura/mano/retrato/materia/elpepicul/20120207elpepicul_2/Tes
+test_url: http://brasil.elpais.com/brasil/2014/10/15/politica/1413334841_878730.html
+test_contains: O PT quer intensificar a presença do ex-presidente
+
+test_url: http://brasil.elpais.com/brasil/2014/10/13/internacional/1413225730_450761.html
+test_contains: Todos na localidade onde ele nasceu ainda falavam da façanha
index 03085593f9587f1a1dea15f671922cf647d3c97a..f546b708f09f1a103856335e6bbfc33c0ebbf028 100755 (executable)
@@ -1,30 +1,17 @@
-# story has several pages, should be detected
-body: //div[@id='storyBody']
-body: //div[@id='article_body']
-body: //div[@id='story_body']
+# include the lead graphic in the body, if available
+body: //div[contains(concat(' ', normalize-space(@id), ' '), ' lead_graphic ')] | //div[contains(concat(' ', normalize-space(@itemprop), ' '), ' articleBody ')]
+title: //h1[contains(concat(' ', normalize-space(@itemprop), ' '), ' headline ')]
+date: //time[contains(concat(' ', normalize-space(@itemprop), ' '), ' datePublished ')]
 
-title://h1[@id='article_headline']
-
-# article author
-author: //p[@class='author']/a
-# story author(s)
-author: substring-after(//p[@class='byline'], 'By ')
-
-# article date
-date: //span[@class='published_date']
-# story date
-date: //span[@class='date']
-
-date: substring-after(//div[contains(@class,'attributor')],'on')
-strip_id_or_class: inset
-strip: //p/span[@class='photoCredit']
-strip: //h1
-
-strip_id_or_class: page_count
-strip_id_or_class: tools
-strip_id_or_class: pagination
-
-single_page_link: //li[@id='stPrint']/a
+strip_id_or_class: photo_credit
+strip_id_or_class: photo_caption
+strip_id_or_class: inline_gallery
+# pull quote, often inside a blockquote element
+strip_id_or_class: pq
+strip_id_or_class: credit
+strip_id_or_class: figcaption
+strip_id_or_class: related_item
 
 test_url: http://www.businessweek.com/magazine/buyback-insurance-a-good-deal-for-retailers-07282011.html
-test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
\ No newline at end of file
+test_url: http://www.businessweek.com/articles/2012-06-06/american-pain-the-largest-u-dot-s-dot-pill-mills-rise-and-fall
+test_url: http://www.businessweek.com/articles/2014-07-09/american-apparel-dov-charneys-sleazy-struggle-for-control
index 97dddaee9e95bc81c296a021391b77210d7b1c7a..ea88ea472d2bd8435c66f6807c6d7b6968344ff1 100755 (executable)
@@ -10,6 +10,15 @@ date: //time[@data-print='date']
 body: //div[@data-print='body']
 body: //section[@data-print='body']
 
+find_string: rel:bf_image_src=
+replace_string: src=
+find_string: src="data:
+replace_string: disabled_src="data:
+
+native_ad_clue: //meta[@property="article:section" and @content="Advertiser"]
+
 # For various things...
 strip: *[@data-print="ignore"]
-test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
\ No newline at end of file
+test_url: http://www.buzzfeed.com/hgrant/35-reasons-why-dogs-hate-the-holidays
+# Native ad
+test_url: http://www.buzzfeed.com/bravo/ways-to-up-your-online-dating-game
\ No newline at end of file
diff --git a/inc/3rdparty/site_config/standard/canonrumors.com.txt b/inc/3rdparty/site_config/standard/canonrumors.com.txt
new file mode 100755 (executable)
index 0000000..c22cf4f
--- /dev/null
@@ -0,0 +1,28 @@
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set title
+title: //h2
+
+date: //li[@class='time']
+
+# Set author
+author: //a[contains(@rel, 'author')]
+
+# Content is here
+body: //div[@id='content']
+
+# Tidy up before article
+strip: //div[@class='meta']
+
+# Tidy up after article
+strip_id_or_class: nr_related_placeholder
+strip_id_or_class: twitter-share-button
+strip_id_or_class: afterpost
+strip_id_or_class: tags
+
+# Try it yourself
+test_url: http://www.canonrumors.com/2014/09/chuck-westfall-talks-canon-eos-7d-mark-ii/
+test_url: http://www.canonrumors.com/2014/09/canon-cinema-eos-captures-space-in-4k-for-new-imax-3d-film/
index 31440538085fe3af4fe264b009f8b9cf902ba3b2..2645f119dc965826cee481875efae17d6d784403 100755 (executable)
@@ -2,4 +2,5 @@ title: //div[@class='title']
 author: //div[@class='author']
 prune: no
 
-test_url: http://www.chomsky.info/onchomsky/2002----.htm
\ No newline at end of file
+test_url: http://www.chomsky.info/onchomsky/2002----.htm
+test_contains: The propaganda model argues
index b38786622fab6e62abc2c53493b4f862d7db4800..28f104725ca568e48067da326b216257d8a4fc32 100755 (executable)
@@ -1,5 +1,9 @@
 title: //div[@id='maincontent']//h1
 body: //div[@id='resizeableText']
 
+single_page_link: concat(//link[@rel='canonical']/@href, '?sp=true')
+
 test_url: http://cn.reuters.com/article/CNAnalysesNews/idCNKBS0FF0NM20140710
-test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
\ No newline at end of file
+test_url: http://cn.reuters.feedsportal.com/CNAnalysesNews
+# multipage link
+test_url: http://cn.reuters.com/article/idCNKBS0FF0UL20140710
\ No newline at end of file
index 269fb54783e805f175de34a1f6c96f5ad406ef6a..f8a88caeed67ed49d2a865b2dd3ea7683675761d 100755 (executable)
@@ -1 +1,3 @@
-body: //div[@id='content']
+body: //div[@id='readme']
+
+test_url: http://code.fivefilters.org/full-text-rss
index b482e34e9377072793524f2d7a51b5aa629e61e8..70ab98854ed87f84c49ca0f9834c68c0fb14688b 100755 (executable)
@@ -15,4 +15,4 @@ strip_id_or_class: promotion-tag
 tidy: no
 prune: no
 
-test_url: www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
\ No newline at end of file
+test_url: http://www.csmonitor.com/World/Middle-East/2011/1108/Imminent-Iran-nuclear-threat-A-timeline-of-warnings-since-1979/Earliest-warnings-1979-84
index 381446e597ad04237e9e2ff5fd6abab6ab8d79fd..2bd66be82159464d06174566540a239c9eb16b4f 100755 (executable)
@@ -2,4 +2,4 @@ single_page_link: //a
 tidy: no
 prune: no
 
-test_url: da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
\ No newline at end of file
+test_url: http://da.feedsportal.com/c/585/f/413794/s/17037b5a/l/0L0Stelegraaf0Bnl0Cbinnenland0C10A2757860C0I0IKlacht0Itegen0Idr0B0IFrank0Iniet0I0Eontvankelijk0I0I0Bhtml0Dcid0Frss/ia1.htm
diff --git a/inc/3rdparty/site_config/standard/designsponge.com.txt b/inc/3rdparty/site_config/standard/designsponge.com.txt
new file mode 100755 (executable)
index 0000000..2cd2f1f
--- /dev/null
@@ -0,0 +1,31 @@
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set title
+title: //header/h1
+
+# Set author
+author: //a[rel='author']
+
+# Content is here
+body: //article
+
+# Tidy up before article
+strip: //header
+
+# Tidy up article
+strip: //div[contains(@id, 'gallery-')]
+replace_string(<a rel="attachment): <p rel="attachment
+
+
+# Tidy up after article
+strip: //div[@class='sm']
+strip_id_or_class: related
+strip_id_or_class: comments
+strip: //footer
+
+# Try it yourself
+test_url: http://www.designsponge.com/2010/06/seattle-design-guide.html
+test_url: http://www.designsponge.com/2012/04/sneak-peek-liz-cook.html
index efa85f763120ae01285556b33950881d3ce55e42..c77007b7bccdea318348e271b4bb3d08cf641027 100755 (executable)
@@ -2,4 +2,6 @@ body: (//blockquote[contains(@class, 'postcontent')])[1]
 body: (//div[starts-with(@id, 'post_message')])[1]
 
 prune: no
-tidy: no
\ No newline at end of file
+tidy: no
+
+test_url: http://www.desitvforum.net/forum/watch-online/431739-creature-3d-2014-watch-online-download-dvd-rip.html
diff --git a/inc/3rdparty/site_config/standard/deutsche-apotheker-zeitung.de.txt b/inc/3rdparty/site_config/standard/deutsche-apotheker-zeitung.de.txt
new file mode 100755 (executable)
index 0000000..36709ca
--- /dev/null
@@ -0,0 +1,29 @@
+# Author: zinnober
+
+prune: yes
+tidy: yes
+
+title: //h1
+date: //p[@class='news_datum']
+author: //span[@class='author']
+
+body: //div[@class='tagesnews-content']
+
+# General clenaup
+strip_id_or_class: dachzeile
+strip: //h3
+strip: //p[@class='bodytext']//a
+strip_id_or_class: autor_datum
+strip_id_or_class: comments
+strip_id_or_class: banner-
+
+strip: //p[contains(., 'Lesen Sie')]
+strip: //p[contains(., '– in DAZ')]
+
+# Fix image captions
+replace_string(<p class="image_caption">): <p><small><em>
+replace_string(</dd>): </em></small></dd>
+
+test_url: http://www.deutsche-apotheker-zeitung.de/pharmazie/news/2014/09/03/weniger-nebenwirkungen-aber-kein-zusatznutzen/13715.html
+test_url: http://www.deutsche-apotheker-zeitung.de/recht/news/2014/09/02/urteile-zum-cannabis-eigenanbau-bfarm-geht-in-berufung/13716.html
+
index f8b79c8007a467bce82af40e173863d00d52b4da..b8243d0c5e6d5f50cb5eba93caf20b040cd61495 100755 (executable)
@@ -1,8 +1,6 @@
-title: //h1[@id='query_h1']
-body: //div[contains(@class, 'lunatext results_content')]
-strip_id_or_class: spl_unshd
-#replace_string(<div class="dicTl">): <div class="dicTl">------------------<br />
+body: //div[contains(@class, 'source-data')]
+strip: //button
 
 prune: no
 
-test_url: http://www.wired.com/cloudline/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/
\ No newline at end of file
+test_url: http://dictionary.reference.com/browse/propaganda
index 92ae31b218831b76f6df27e81864476a836b107f..3b51569f77851c70ba594c5e8f47135d6cdc9cb3 100755 (executable)
@@ -1 +1,3 @@
-single_page_link: //a[@id='download_button_link']
\ No newline at end of file
+single_page_link: //a[@id='download_button_link']
+
+test_url: https://www.dropbox.com/s/qmocfrco2t0d28o/Fluffbeast.docx
diff --git a/inc/3rdparty/site_config/standard/echo-online.de.txt b/inc/3rdparty/site_config/standard/echo-online.de.txt
new file mode 100755 (executable)
index 0000000..e53de23
--- /dev/null
@@ -0,0 +1,24 @@
+# Author: Marvin Dickhaus <github@marvindickhaus.de>
+# 2014-10-08
+
+#Tidy just messes up the DOM
+tidy: no
+
+title: //h1
+body: //h2 | //div[@id='artikelteaser'] | //div[@id='artikeltext']
+
+#Strip 
+strip_image_src: artikel_a_merken.gif
+strip: //div[@class='zusatzinfo']
+
+#Author: substring is used to remove the " Von " prefix.
+author: substring(//li[@class='artikelautor'], 5)
+
+date: //li[@class='artikeldatum']
+
+#The first two URLs will at some point no longer show 
+#the full article. There is a time-based paywall 
+#installed. Using the feed should present valid output
+test_url: http://www.echo-online.de/art1231,5503063
+test_url: http://www.echo-online.de/art1168,5502598
+test_url: http://www.echo-online.de/rss/darmstadt.xml
index 16c9ed646f1369609bfea90d6c0a741d819b8915..8db5fdd66b1ac3dfc17e9e485788cfc328ce1f34 100755 (executable)
@@ -1,8 +1,13 @@
 body: //div[@class='main-content']
+body: //article[contains(@class, 'resp-node')]
 date: //time[@class='date-created']
 strip: //aside
 prune: no
 
 autodetect_next_page: no
 
-test_url: http://www.economist.com/node/21528429
\ No newline at end of file
+test_url: http://www.economist.com/node/21528429
+
+test_url: http://www.economist.com/news/essays/21623373-which-something-old-and-powerful-encountered-vault
+test_contains: the calfskin pages are smooth
+test_contains: Books will evolve online and off
index 8a3516671b27afde0e93955a4d00d0693047a8c8..8931becb1166435d99becc5d2ab6a5f235047ab5 100755 (executable)
@@ -1,8 +1,9 @@
-body: //div[ @class='content' ]  |  //div[ @class='blog-entry' ]
+body: //p[@class='strapline'] | //div[@class='cover-image'] | //article[@class='hd']
+strip: //div[@class='social top']
+strip: //p[@class='byline']
 
-strip: //h2/abbr  |  //div[ @class='lowleader' ]  |  //*[ @class='discussion' ]  |  //img[ @class='play-button' ]  |  //div[ @class='boxout' ] | //h2/a | //h2 | //h2/div | //p[ @class='timestamp' ] | //a[ @class='eurogamer-author' ] | //p[ @class='aPager' ] | //h1 | //div[ @id='lowleader' ] | //a[ @class='next' ]  |  //div[contains(concat(' ', normalize-space(@class), ' '), ' pullquote ')]
+date: //span[@itemprop='datePublished']
+author: //a[@itemprop='author']/text()
 
-date://p[ @class='timestamp' ]
-
-author://a[ @class='eurogamer-author' ]
-test_url: http://www.eurogamer.net/articles/digitalfoundry-vs-unreal-engine-4
\ No newline at end of file
+test_url: http://www.eurogamer.net/articles/2014-08-20-bungie-ordered-to-return-shares-to-composer-marty-odonnell
+test_url: http://www.eurogamer.net/articles/2014-08-20-invisible-inc-does-espionage-justice
index 6a49276740868cf4c14446403cf33354f5e7ab16..26d4f90594043445744ee6d9627ac3b5b97eb4d7 100755 (executable)
@@ -1,5 +1,12 @@
 body: //div[@id='imagestage']
+body: //div[contains(@class, 'userContentWrapper')]
+
+strip_id_or_class: commentable
+
 prune: no
 tidy: no
 
-test_url: https://www.facebook.com/feeds/page.php?id=338077742912613&format=rss20
\ No newline at end of file
+# single_page_link: replace(substring-after(//noscript//meta[@http-equiv="refresh"]/@content, 'URL='), "&amp;", "&")
+
+test_url: https://www.facebook.com/permalink.php?story_fbid=10154584776550183&id=294468630182
+test_contains: holding an extraordinary session in Brussels this month
old mode 100644 (file)
new mode 100755 (executable)
index 248522cbdd4f14434316ae0a9c4ed7e335fe04c5..0c967db0575b97d5b44fba3577b534624e4e8802 100755 (executable)
@@ -5,8 +5,8 @@ strip: //div[contains(@class, 'related-companies')]
 strip: //div[@id='y-article-related']
 strip: //div[@id='ypf-article-related']
 prune: no
+tidy: no
 
 single_page_link: //div[@class='ft']//a[contains(@href, 'page=all')]
 
-test_url: http://sg.finance.yahoo.com/news/Motorola-takes-wraps-249-rsg-3508842732.html?x=0&.v=1
-test_url: http://finance.yahoo.com/news/super-young-retirement-savers.html
\ No newline at end of file
+test_url: http://finance.yahoo.com/news/canadian-orebodies-gives-notice-exercise-130000032.html
\ No newline at end of file
index d9c5e42e77fe7b6351e3b24d8642d3d3fbc1c956..9614d2f6e861bce836aa487e58a495a2b4c188a8 100755 (executable)
@@ -1,2 +1,2 @@
 body: //div[@class='entry']
-test_url: http://www.fivechapters.com/2010/paris-part-one/
\ No newline at end of file
+test_url: http://www.fivechapters.com/2014/the-saddest-writer-in-america-part-two/
index dc1db432f9386d52665b6556712b9875f78d2b7a..f37f02b93e53dcdbff8d042acdacb187278484a0 100755 (executable)
@@ -1 +1,4 @@
-prune: no
\ No newline at end of file
+body: //section[contains(@class, 'container')]
+prune: no
+
+test_url: http://fivefilters.org/kindle-it/
index 4e84b9893435c951ba98136129336dd0e54ec330..853a5b7bd9f676c612dce06604899d81fe13a7a0 100755 (executable)
@@ -1,15 +1,19 @@
 title: //div[@class='translateHead']//h1 | //div[@id='art-mast']//h1
 author: substring-after(//span[@id='by-line'], 'BY ')
 date: //span[@id='pub-date']
-body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
+body: (//article//img[contains(@class, 'main_photo')])[1] | (//article//div[contains(@class, 'full_post_content')])[1]
+#body: //div[@id='art-mast']/h2 | //div[@class='translateBody'] | //div[@id='art-body']
 #Strip inside article content
 strip: //div[@id='share-box']
-strip: //div[@id='special-box']
+strip: //div[@id='special-box
+
+strip_id_or_class: side_panel
 
 prune: no
 
 single_page_link: //span[@id='controls']/a[contains(@href, 'print=yes')]
 single_page_link: //a[text()='SINGLE PAGE']
 
+test_url: http://www.foreignpolicy.com/articles/2014/07/22/the_end_game_in_gaza_netanyahu_hamas
 test_url: http://www.foreignpolicy.com/articles/2011/08/01/a_murderers_manifesto_and_me
 test_url: http://www.foreignpolicy.com/articles/2012/02/29/five_years_in_damascus
\ No newline at end of file
index 6afdebe8d98fd376a3d055379decf0e17b407249..c64860c09c51f34f59deac35eaec44a02ae621be 100755 (executable)
@@ -1,25 +1,34 @@
-# Jens Kohl, jens.kohl@...
-# - Added publication date
-# - Striped pagination block
-# - Added single page link
-# - Added xpath-querys for the printer friendly version
+# Author: zinnober
+# Rewrite of original template which fetched the printer-version without pictures
 
-title: //h1
-body: //div[@class='formatted']
+tidy: no
 prune: no
 
-date: substring-after(//li[2][@class="text1"], 'Datum:')
-strip: //ol[@class="list-chapters"]
-strip_comments: yes
-
-# next: commands for printer friendly pages
-single_page_link: //a[contains(@href, 'print.php?a=')]/@href
-title: //body/h3
-strip_image_src: staticrl/images/logo.jpg
-strip_image_src: http://cpx.golem.de/cpx.php?class=7
-strip: //body/h3
-strip: //body/b[1]
-strip: //body/b[2]
-strip: //body/b[3]
-strip: //div[1]
-test_url: http://www.golem.de/1112/88696.html
\ No newline at end of file
+# Set full title
+title: //h1
+
+date: //time
+
+# Content is here
+body: //article
+
+# Fetch full multipage articles
+next_page_link: //a[@id='atoc_next']
+
+# Remove tracking and ads
+strip_id_or_class: iqadtile4
+
+# General Cleanup
+strip_id_or_class: list-jtoc
+strip_id_or_class: table-jtoc
+strip_id_or_class: implied
+strip_id_or_class: social-
+strip_id_or_class: comments
+strip_id_or_class: footer
+
+# Tidy up galleries (could still be improved, though)
+strip: //img[@src='']
+
+# Try yourself
+test_url: http://www.golem.de/news/intel-core-i7-5960x-im-test-die-pc-revolution-beginnt-mit-octacore-und-ddr4-1408-108893.html
+test_url: http://www.golem.de/news/test-infamous-first-light-neonbunter-actionspass-1408-108914.html
index 37a4aaf0d0d37684155775e516eb259f845a5217..9433104b0077b28ae3fec092cf9a1930eab18c06 100755 (executable)
@@ -1,9 +1,42 @@
-#second part of single_page_link for telepolis-articles (desktop-version of site)
-single_page_link: //p[@class='news_option']/a | //a[@id='tp-druckversion']
+# Author: zinnober
+# Template should work well with either desktop or mobile version (m.heise.de)
 
+prune: no
+
+title: //article/h1 | //h1
 date: //p[@class='news_datum']
-title: //h1
-body: //div[@class='meldung_wrapper']
+author: //h4[@class='author']
+
+body: //article | //div[@class='meldung_wrapper']
+
+# General cleanup
+strip: //time
+strip: //h4[@class='author']
+strip: //p[@class='news_datum']
+strip: //p[@class='artikel_datum']
+strip: //a[contains(@href, 'mailto')]
+strip_id_or_class: comments
+strip_id_or_class: ISI_IGNORE
+strip_id_or_class: clear
+
+strip_id_or_class: linkurl_grossbild
+strip_id_or_class: image-num
+strip_id_or_class: heisebox_right
+strip_id_or_class: dossier
+
+# Strip Ads
+strip_id_or_class: ad_
+
+# Some optimizations
+replace_string(<h5>): <h2>
+replace_string(</h5>): </h2>
+replace_string(<span class="bild_rechts"): <p
+replace_string(<div class="heisebox">): <blockquote>
+
+
+next_page_link: //a[@class='next']
+next_page_link: //a[@title='vor']
 
-test_url: http://www.heise.de/newsticker/meldung/Europa-soll-Grundrechteschutz-im-Netz-staerken-1392664.html
-test_url: http://www.heise.de/tp/artikel/42/42579/1.html
+test_url: http://www.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
+test_url: http://m.heise.de/open/artikel/Die-Neuerungen-von-Linux-3-15-2196231.html
+test_url: http://www.heise.de/newsticker/meldung/Ueberwachungstechnik-Die-globale-Handy-Standortueberwachung-2301494.html
index dfd8193763648147edccde0ea4ef23fd72f69863..a660f23b11cba56aa4e787d3cdd81f18a2d6bd69 100755 (executable)
@@ -2,4 +2,4 @@ body: //table[@class='ap-smallphoto-table'] | //div[@class='body']//*[@class='en
 tidy: no
 strip_image_src: analytics.apnewsregistry
 
-test_url: http://hosted.ap.org/dynamic/stories/U/US_SPENDING_SHOWDOWN?SITE=FLPET&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2011-04-06-07-46-50
\ No newline at end of file
+test_url: http://hosted.ap.org/dynamic/stories/E/EU_TURKEY_KURDS?SITE=KSNEW&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-10-14-10-50-25
diff --git a/inc/3rdparty/site_config/standard/itunes.apple.com.txt b/inc/3rdparty/site_config/standard/itunes.apple.com.txt
new file mode 100755 (executable)
index 0000000..ffd9556
--- /dev/null
@@ -0,0 +1,14 @@
+body: //div[@id='left-stack' or contains(@class, 'center-stack')]
+
+find_string: class="artwork" src="
+replace_string: class="artwork" src-disabled="
+find_string: src-swap-high-dpi="
+replace_string: src="
+
+strip_id_or_class: rating
+strip_id_or_class: listeners-also-bought
+
+prune: no
+
+test_url: https://itunes.apple.com/us/rss/topaudiobooks/limit=10/xml
+test_url: https://itunes.apple.com/us/audiobook/the-giver-unabridged/id356345850
\ No newline at end of file
index 35baf8dfc6c781025fa0c9f7b886c2c7c9c21801..57ab0de1a48db595cdad21b5f313e2df208ffb51 100755 (executable)
@@ -4,4 +4,4 @@ body: //div[@itemprop='articleBody']
 tidy: no
 
 test_url: http://www.kachiblog.com/2013/05/samsung-galaxy-s4-vs-samsung-galaxy.html
-test_url: http://www.kachiblog.com/feeds/posts/default
\ No newline at end of file
+test_url: http://www.kachiblog.com/feed
diff --git a/inc/3rdparty/site_config/standard/lifehacker.co.uk.txt b/inc/3rdparty/site_config/standard/lifehacker.co.uk.txt
new file mode 100755 (executable)
index 0000000..c540f7f
--- /dev/null
@@ -0,0 +1,7 @@
+title: //div[@itemprop='headline']
+body: //noscript/img | //div[@itemprop='text']
+author: //div[@class='meta meta--post']//a[@class='is-author']
+date: //div[@class='meta meta--post']//time/@datetime
+
+test_url: http://www.lifehacker.co.uk/2014/08/22/dealhacker-10-google-chromecast-super-cheap-batteries-much
+test_url: http://www.lifehacker.co.uk/2014/08/18/andrognito-hides-files-youd-like-keep-away-prying-eyes
index 2136de3fdc9a159317a707bcf303a7166c008aee..2f6382f187f76c3fa0517ecdda3d8d0808fb2853 100755 (executable)
@@ -25,4 +25,4 @@ strip_id_or_class: 'rightimage'
 #Comments
 strip: //table
 strip: //p/following-sibling::*[0]
-test_url: http://www.mainpost.de/ueberregional/meinung/Dioxin-Skandal-bringt-Agrarministerin-in-Bedraengnis;art9517,5920211
\ No newline at end of file
+test_url: http://www.mainpost.de/regional/wuerzburg/Autobahnschuetze-Staatsanwalt-fordert-zwoelf-Jahre;art492151,8386332
index 4c333aa116247e2e5b2ba64057b31dc5afeaf297..c26bac553d493ea8eca54602ff9255c00c4041a4 100755 (executable)
@@ -1,4 +1,5 @@
 strip_id_or_class: article-tools
 strip_id_or_class: pagenav
 prune: no
-test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
\ No newline at end of file
+test_url: http://www.medialens.org/index.php/alerts/alert-archive/2012/713-the-illusion-of-democracy.html
+test_contains: In an era of permanent war, economic meltdown
index acf7cc9028c1a5f5cc8280616f552bbb702a4a13..9e9c6895167d59d35f4457516d48353e099a9f91 100755 (executable)
@@ -1,7 +1,12 @@
-body: //div[contains(@class, 'post-content-inner')]
-strip_id_or_class: follow-ups
-strip_id_or_class: footer
+body: //div[contains(@class, 'postContent-inner')]
+strip_id_or_class: supplementalPostContent
 
 prune: no
 
-test_url: https://medium.com/p/6844c0d7893b
\ No newline at end of file
+test_url: https://medium.com/@savolai/kaytettavyyden-haasteet-keskustelukulttuurista-2-3-6844c0d7893b
+test_contains: Jos käytettävyysongelmat ovat kerran niin tyypillisiä
+test_contains: Keskustelukulttuuriongelmasta (subjective vs. objective bugs)
+
+test_url: https://medium.com/health-the-future/thirty-things-ive-learned-482765ee3503
+test_contains: Remember you will die
+test_contains: You have to have some faith.
diff --git a/inc/3rdparty/site_config/standard/menshealth.com.sg.txt b/inc/3rdparty/site_config/standard/menshealth.com.sg.txt
new file mode 100755 (executable)
index 0000000..6a66925
--- /dev/null
@@ -0,0 +1,12 @@
+strip: //div[contains(@style, 'float:right') and contains(., 'advertisement')]
+body: //div[@style="float:left;width:740px;"]
+
+tidy: no
+
+test_url: http://www.menshealth.com.sg/fitness/mh-picks-under-armour-clutchfit-nitro-mid-cleats
+test_contains: These cleats are made for one thing
+
+test_url: http://www.menshealth.com.sg/fitness/top-10-fat-burning-bodyweight-moves-you-can-do-10-minutes
+test_contains: let this workout fool you
+
+test_url: http://www.menshealth.com.sg/fitness/feed
\ No newline at end of file
index 88429a7835c1ec57e478a8b87dd711363d715caa..f698d98e0711ee0e8c56ba1b6c2238ca05042b46 100755 (executable)
@@ -8,4 +8,4 @@ strip_id_or_class: news_morearticlesincat
 strip_id_or_class: ezc_comments
 strip_comments: yes
 
-test_url: http://www.northumberlandview.ca/index.php?module=news&func=display&sid=5972
\ No newline at end of file
+test_url: http://www.northumberlandview.ca/index.php?module=news&type=user&func=display&sid=31127
index 23c9ad11979eab45974b1e3178ebd6a9e82486a8..54735ec7b643982a38191b31446ce7838c87874a 100755 (executable)
@@ -42,8 +42,12 @@ strip://h6[@class = 'kicker']
 author:substring-after(//h6[@class='byline'],'By ')
 
 test_url: http://www.nytimes.com/2011/07/24/books/review/an-academic-authors-unintentional-masterpiece.html
+test_contains: In this column I want to look at a not uncommon way of writing
+
 test_url: http://www.nytimes.com/2012/06/10/arts/television/the-newsroom-aaron-sorkins-return-to-tv.html
+test_contains: IF you’ve seen enough of Aaron Sorkin’s theater
+
 test_url: http://www.nytimes.com/2013/03/25/world/middleeast/israeli-military-responds-after-patrols-come-under-fire-from-syria.html
 test_url: http://www.nytimes.com/2013/08/15/nyregion/when-the-new-york-city-subway-ran-without-rails.html
 test_url: http://www.nytimes.com/2004/02/29/weekinreview/correspondence-class-consciousness-china-s-wealthy-live-creed-hobbes-darwin-meet.html
-test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html
\ No newline at end of file
+test_url: http://www.nytimes.com/2014/06/19/opinion/gail-collins-romney-and-the-2016-contenders-huddle.html
index 1a33610d6f031be9ff0c408dadf9d74218af7a68..ce0a3c4309ece51bc515de63bfcba905311d4957 100755 (executable)
@@ -1,3 +1,5 @@
-body: //div[@id='_ctl12__ctl0_Article']
+body: //div[contains(@class, 'article-photo-wrapper')]
 prune: no
-autodetect_on_failure: no
\ No newline at end of file
+
+test_url: http://www.real.gr/DefaultArthro.aspx?page=arthro&id=360962&catID=1
+test_contains: Επισήμως το αποψινό υπουργικό
index 8871f5644c5f27ff5b057ddd78c70420877ce4d6..ba342c7cd2aa0fe1f6f85cb28b80d1e29719cc5b 100755 (executable)
@@ -7,7 +7,7 @@ author: //p[@class="tagline"]/a
 # this doesn't work for some reason...?
 date: //p[@class="tagline"]//@datetime
 
-body: //div[@class="expando"]//div[@class="usertext-body"]
+body: (//div[contains(@class, 'noncollapsed')]//div[contains(@class, 'usertext-body')])[1]
 
 strip_id_or_class: tagline
 strip_id_or_class: unvotable-message
@@ -17,4 +17,5 @@ strip_id_or_class: buttons
 single_page_link: //p[@class="title"]/a[contains(@href, 'http://')]
 
 test_url: http://www.reddit.com/r/truegaming/comments/wfe7r/i_wrote_about_the_problems_i_honestly_feel_that/
-test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
\ No newline at end of file
+test_url: http://www.reddit.com/r/worldnews/comments/1as37r/twelve_north_korean_soldiers_attempting_to_defect/
+test_url: http://www.reddit.com/r/WritingPrompts/comments/2786lw/wp_in_a_world_where_puns_are_illegal_one_man/chybk8e
\ No newline at end of file
index fb6a1074a4b06e2a081a2bb094a06e6841e7f1f8..9ccc5898133a2e8754cf7c3e44d97e2cb021bd28 100755 (executable)
@@ -1,4 +1,4 @@
-body: //div[@class="storyBox"]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article ') and (contains(concat(' ',normalize-space(@class),' '),' clear '))]
 title: //div[@class="storyBox"]/h1
 author: //a[@rel="author"]
 date: substring-before(//span[@class="dateline"], 'by')
index b52169da921e2e8a4b5a78ebf0038c6ae245a5ed..86e3df5ee0086eea0eac0197fc90764c1c3fd9ca 100755 (executable)
@@ -1,4 +1,4 @@
 #grab the actual content div
 body: //div[@class='rt-article']
 
-test_url: http://www.sourcebooks.com/next/sourcebooks-next-our-blog/1601-another-piece-of-the-e-puzzle-or-when-good-ebook-promotions-go-bad.html
\ No newline at end of file
+test_url: http://www.sourcebooks.com/blog/happy-27th-birthday-sourcebooks.html
diff --git a/inc/3rdparty/site_config/standard/tabletmag.com.txt b/inc/3rdparty/site_config/standard/tabletmag.com.txt
new file mode 100755 (executable)
index 0000000..58b1f5b
--- /dev/null
@@ -0,0 +1,5 @@
+body: //div[contains(@class, 'story-text')]
+
+strip_id_or_class: related
+
+test_url: http://www.tabletmag.com/jewish-news-and-politics/181181/mossberg-parallel-states?all=1
\ No newline at end of file
diff --git a/inc/3rdparty/site_config/standard/tagesspiegel.de.txt b/inc/3rdparty/site_config/standard/tagesspiegel.de.txt
new file mode 100755 (executable)
index 0000000..57e7d3d
--- /dev/null
@@ -0,0 +1,60 @@
+# Author: zinnober
+# Should work with "normal" articles as well as with image galleries
+
+prune: no
+
+# Title
+title: //h1/span[@class='hcf-headline']
+
+# Set author
+author: //a[@rel='author']
+
+# Set date
+date: //span[@class='date hcf-atlas']
+
+# Fetch full multipage articles
+next_page_link: //a[contains(@class, 'hcf-forward')]
+
+# Content is here
+body: //article
+body: //div[contains(@class, 'hcf-screen')]
+
+# Remove tracking and ads
+strip_id_or_class: hcf-ad
+strip_id_or_class: hcf-autoload-ad
+strip_id_or_class: hcf-content-ad
+
+# Tidy up before article
+strip: //article/h1
+strip_id_or_class: hcf-atlas
+strip_id_or_class: hcf-author
+strip_id_or_class: date hcf-atlas
+strip_id_or_class: date hcf-atlas
+
+# General cleanup
+strip: //div[contains(@class, 'hcf-screen')]//h1
+strip: //div[@class='hcf-subpage-titles']//ul
+strip_id_or_class: hcf-doctype-media
+strip_id_or_class: hcf-inline-gallery
+strip_id_or_class: hcf-doctype-video
+strip_id_or_class: hcf-links
+strip_id_or_class: hcf-mini-navi
+strip_id_or_class: hcf-media-control
+strip_id_or_class: hcf-hidden
+replace_string(<span class="hcf-update">Update</span>): <strong>Update: </strong>
+
+# Fix pictures and captions
+replace_string(<a class="hcf-doctype-gallery): <p class="hcf-doctype-gallery
+replace_string(<a class="hcf-doctype-enlarge): <p class="hcf-doctype-enlarge
+replace_string(<figcaption class="hcf-caption">): <br><small><em>
+replace_string(</figcaption>): </em></small>
+
+# Fix image galleries
+replace_string(<a class=" ajaxify): <p class="ajaxify
+replace_string(<div class="hcf-caption"><div><p>): <small><em>
+
+# Try it yourself
+test_url: http://www.tagesspiegel.de/berlin/bezirke/wedding/wedding-jetzt/auf-der-suche-nach-einem-stadtteil-wilder-weiter-wedding/8757156.html
+test_url: http://www.tagesspiegel.de/berlin/olympia-in-berlin-der-flughafen-tegel-soll-das-olympische-dorf-werden/10645036.html
+test_url: http://www.tagesspiegel.de/mediacenter/fotostrecken/berlin/bildergalerie-kreuzberger-der-woche/9305534.html
+
index 0b4bfbd6afa64266e4b9aad433da12bfc3ba4a8f..26eb37b06f435aa079789cfffc22bada59e3eaf7 100755 (executable)
@@ -1,3 +1,3 @@
 single_page_link_in_feed: //b/a
 
-test_url_feed: http://www.techmeme.com/feed.xml
\ No newline at end of file
+test_url: http://www.techmeme.com/feed.xml
index aa41b1533f2efb3b9b264f6a2de6f817d34f0481..3fc5611b2c9964666b671da5cd293fb56f8600a4 100755 (executable)
@@ -15,6 +15,8 @@ strip: //div[@class='earthbox']
 
 single_page_link: //article//a[contains(@class, 'print')]
 
+native_ad_clue: //meta[@property="og:url" and contains(@content, '/sponsored/')]
+
 test_url: http://www.theatlantic.com/technology/archive/2011/04/want-to-see-how-crazy-a-bot-run-market-can-be/237773/
 test_url: http://www.theatlantic.com/magazine/archive/2007/11/the-autumn-of-the-multitaskers/6342/
 test_url: http://www.theatlantic.com/entertainment/archive/2012/04/30-rock-live-a-funny-reminder-of-why-sitcoms-arent-shot-live-anymore/256447/
\ No newline at end of file
index 750f8473302d60b5597fc448f7145719abe244bf..2473cad2c532063623c768aaccc08f95d62115b8 100755 (executable)
@@ -1,5 +1,10 @@
+body: //div[contains(@class, 'entry-content')]//div[contains(@class, 'column-2')]
 single_page_link: //div[contains(@class, 'pagination')]//a[contains(@title, 'ingle page')]
+strip_id_or_class: entry-related
+strip_id_or_class: entry-sidebar
+strip_id_or_class: entry-pagination
 tidy: no
 prune: no
 
-test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
\ No newline at end of file
+test_url: http://www.theglobeandmail.com/report-on-business/rob-magazine/how-a-novice-miner-survived-a-summer-in-the-klondike/article2345350/
+test_url: http://www.theglobeandmail.com/report-on-business/industry-news/energy-and-resources/cliffs-natural-resources-looking-to-exit-ontarios-ring-of-fire/article20651617/
\ No newline at end of file
index c803e4e41532b3718d0d9454295ad90c37109c74..88e2ecf4e29f74874621923e34bdeca5b9dc184c 100755 (executable)
@@ -6,8 +6,19 @@ strip: //div[contains(@class, 'kindleWidget')]
 #strip: //a[not(text())]
 strip_id_or_class: pocket-btn
 author: //li[@class='byline']
+native_ad_clue: //meta[@property="article:tag" and contains(@content, "Partner zone")]
+native_ad_clue: //meta[@property="video:tag" and contains(@content, "Partner zone")]
 prune: no
 tidy: no
+
 test_url: http://www.theguardian.com/world/2013/oct/04/nsa-gchq-attack-tor-network-encryption
+test_contains: The National Security Agency has made repeated attempts to develop
+test_contains: The agency did not directly address those questions, instead providing a statement.
+
 test_url: http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester
-test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
\ No newline at end of file
+test_contains: In August, the editor of the Guardian rang me up and asked if I would spend a week in New York
+test_contains: As the second most senior judge in the country, Lord Hoffmann, said in 2004 about a previous version of our anti-terrorism laws
+
+test_url: http://www.theguardian.com/commentisfree/2014/jun/15/britishness-search-identity-my-part-in-camerons-odyssey
+# Native ad
+test_url: http://www.theguardian.com/sustainable-business/2014/jul/18/ben-jerry-turn-ice-cream-into-energy
index 1e1ce58f0405caecd78a30c25810cbd065681ff9..78f8654a00fd7a8fd71d803278d2c86684fdd992 100755 (executable)
@@ -15,6 +15,11 @@ strip: //nav
 strip: //img[contains(@class, 'vox-lazy-load')]
 # deal with bad parsing
 strip: //div[contains(@class, 'story-image')]//div[contains(., 'function(')]
+strip: //div[contains(@class, 'm-linkset')]
+strip: //div[contains(@class, 'm-entry__sidebar')]
+strip: //ul[contains(@class, 'm-article__sources')]
+strip: //div[contains(@class, 'chorus-emc__content')]
+
 
 strip_id_or_class: gallery
 strip_id_or_class: article-meta
@@ -45,4 +50,4 @@ test_url: http://www.theverge.com/2012/2/29/2821763/lytro-review
 test_url: http://www.theverge.com/2011/11/3/2534861/nokia-lumia-800-review
 test_url: http://www.theverge.com/2013/2/24/4026114/barnes-noble-shifting-focus-away-from-nook-hardware
 test_url: http://www.theverge.com/2014/6/19/5824072/top-shelf-living-the-dream
-test_url: http://www.theverge.com/rss/frontpage
\ No newline at end of file
+test_url: http://www.theverge.com/rss/frontpage
diff --git a/inc/3rdparty/site_config/standard/thisiscolossal.com.txt b/inc/3rdparty/site_config/standard/thisiscolossal.com.txt
new file mode 100755 (executable)
index 0000000..ab16ce1
--- /dev/null
@@ -0,0 +1,25 @@
+# Author: zinnober
+
+tidy: no
+prune: no
+
+# Set author
+author: //a[contains(@rel, 'author')]
+
+# Content is here
+body: //article
+
+# Tidy up before article
+strip: //header
+
+# Get rid of doubled images
+strip: //img[contains(@class, '-hidden')]
+
+# Tidy up after article
+strip_id_or_class: social-list
+strip_id_or_class: meta-info
+strip: //footer
+
+# Try it yourself
+test_url: http://www.thisiscolossal.com/2014/09/chicago-in-the-fog-by-michael-salisbury/
+test_url: http://www.thisiscolossal.com/2014/09/bird-portraits-ruffling-with-personality-by-leila-jeffreys/
diff --git a/inc/3rdparty/site_config/standard/towerofthehand.com.txt b/inc/3rdparty/site_config/standard/towerofthehand.com.txt
new file mode 100755 (executable)
index 0000000..a4d87d1
--- /dev/null
@@ -0,0 +1,10 @@
+title: //div[@id='headline']
+body: //div[@class='entry_text']
+author: //div[text() = 'Author:']/following-sibling::div/a
+date: //div[text() = 'Published:']/following-sibling::div
+single_page_link: //a[@href='noscript.html']
+prune: no
+
+test_url: http://towerofthehand.com/blog/2014/08/08-pitch-this-got-spinoff/index.html
+test_url: http://towerofthehand.com/blog/2014/07/31-definitions-and-embodiments/index.html
+test_url: http://towerofthehand.com/blog/2014/07/03-hero-with-thousand-faces/index.html
index 520ebd85925d71ca6ea309531d68e2d6b79b179b..0e5b74878abbb3822b7cd1ff1c7ec96ec19bd0e1 100755 (executable)
@@ -6,4 +6,5 @@ date: //span[contains(@class, 'js-short-timestamp')]/@data-time
 prune: no
 tidy: no
 
-test_url: https://twitter.com/medialens/status/216883678582804480
\ No newline at end of file
+test_url: https://twitter.com/medialens/status/216883678582804480
+test_contains: is all but alone in challenging the tsunami of UK
index efa382244c671d24f5b35b6a7113919bc708ece7..f52339cfa167b7578db950cbe407b379a8e54595 100755 (executable)
@@ -2,6 +2,7 @@ title: //meta[@property="og:title"]/@content
 author: //div[contains(@class, 'byline')]//span[contains(@class, 'name')]
 date: //div[contains(@class, 'cn_date_time')]
 body: //div[contains(@class, 'pageContainers')]
+body: //div[@id='main']
 body: //article[@id='items-container']
 #body: //h2[@class='sub-header'] | //div[contains(@class, 'contributor-type') or @class='display-date' or @class='content-container']
 
@@ -26,5 +27,7 @@ strip: //li[@class='blogNavPrev']
 single_page_link: //a[@title='Print this page']
 
 test_url: http://www.vanityfair.com/politics/features/2011/05/egypt-revolutionaries-201105
+test_contains: nothing can take away from the miracle of Tahrir Square
+
 test_url: http://www.vanityfair.com/politics/features/2008/08/hitchens200808
-test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
\ No newline at end of file
+test_url: http://www.vanityfair.com/style/2012/01/prisoners-of-style-201201
diff --git a/inc/3rdparty/site_config/standard/wn.de.txt b/inc/3rdparty/site_config/standard/wn.de.txt
new file mode 100755 (executable)
index 0000000..ef18c8a
--- /dev/null
@@ -0,0 +1,18 @@
+author: //div[@id='main']//div[@class='col right']//div[contains(@class, 'attribute-author')]
+body: //div[@id='main']//div[@class='col right']
+strip_id_or_class: boxes
+strip_id_or_class: lazy
+strip_id_or_class: comment_box
+strip_id_or_class: fb_comments
+
+find_string: <noscript>
+replace_string: <div>
+find_string: </noscript>
+replace_string: </div>
+
+prune: no
+tidy: no
+
+test_url: http://www.wn.de/Muenster/Kultur/1742956-Wilm-Weppelmann-verlaesst-die-Einsiedelei-Und-dann-ab-unter-die-Dusche
+# feed
+test_url: http://www.wn.de/rss/feed/wn_muenster
\ No newline at end of file
index 8c9c1718cc70cf575653d4e7162fa2344787cccd..9815d478f06e3cd269e409ba8ac7f072d7404c28 100755 (executable)
@@ -1,4 +1,3 @@
-# 2014-10-21 [Marmo] added stripping of inline ads and appropriate test_url
 # 2013.10.30 [rezor92] fixed single_page_link
 # 2012-12-23 [carlo@...] fixed half-assed headlines in articles, removed inline author profiles, adjusted picture captions
 # 2012-03-17 [dkless@...] Cut metadata parts in the beginning and the ends of the content block; copyright entries for pictures removed; Author fixed, not sure if old entries still valid (I left them); Weird problems with some pages addressed (see last section for removing hidden section)
@@ -17,8 +16,6 @@ author: substring-after(//li[@class='source first '], 'Quelle: ')
 
 strip_id_or_class: articleheader
 strip: //div[@id="comments"] | //div[@class="pagination block"] | //p[@class="ressortbacklink"] | //div[@id="relatedArticles"]  |  // div[@class="inline portrait"]
-#Remove inline ads
-strip: //div[@class="innerad"]
 
 #Removes author and date from the start
 strip: //ul[@class="tools"]
@@ -46,4 +43,3 @@ strip_id_or_class:"pagination"
 
 footnotes: no
 test_url: http://www.zeit.de/kultur/film/2012-12/Kurzfilmtag
-test_url: http://www.zeit.de/wissen/2014-10/ebola-nigeria-who