morss: use final request url

Code is not very elegant...
crawler: return dict instead of tuple
2020-04-28 22:30:21 +02:00 · 2020-04-28 22:29:07 +02:00 · 2020-04-28 22:07:25 +02:00 · 2020-04-28 22:03:49 +02:00 · 2020-04-28 21:58:26 +02:00 · 2020-04-28 14:47:23 +02:00
14 changed files with 844 additions and 638 deletions
--- a/8
+++ b/8
@@ -0,0 +1,8 @@
 FROM alpine:latest
 RUN apk add python3 py3-lxml py3-gunicorn py3-pip git
 ADD . /app
 RUN pip3 install /app
 CMD gunicorn --bind 0.0.0.0:8080 -w 4 morss:cgi_standalone_app
--- a/README.md
+++ b/README.md
@@ -24,15 +24,13 @@ hand-written rules (ie. there's no automatic detection of links to build feeds).
 Please mind that feeds based on html files may stop working unexpectedly, due to
 html structure changes on the target website.
-Additionally morss can grab the source xml feed of iTunes podcast, and detect
+Additionally morss can detect rss feeds in html pages' `<meta>`.
 rss feeds in html pages' `<meta>`.
 You can use this program online for free at **[morss.it](https://morss.it/)**.
 Some features of morss:
 - Read RSS/Atom feeds
 - Create RSS feeds from json/html pages
 - Convert iTunes podcast links into xml links
 - Export feeds as RSS/JSON/CSV/HTML
 - Fetch full-text content of feed items
 - Follow 301/meta redirects
@@ -48,6 +46,7 @@ You do need:
 - [python](http://www.python.org/) >= 2.6 (python 3 is supported)
 - [lxml](http://lxml.de/) for xml parsing
 - [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
 - [dateutil](http://labix.org/python-dateutil) to parse feed dates
 - [chardet](https://pypi.python.org/pypi/chardet)
 - [six](https://pypi.python.org/pypi/six), a dependency of chardet
@@ -56,9 +55,13 @@ You do need:
 Simplest way to get these:
 ```shell
-pip install -r requirements.txt
+pip install git+https://git.pictuga.com/pictuga/morss.git@master
 ```
 The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
 C code needs to be compiled). If possible on your distribution, try installing
 it with the system package manager.
 You may also need:
 - Apache, with python-cgi support, to run on a server
@@ -74,9 +77,10 @@ The arguments are:
 - Change what morss does
 	- `json`: output as JSON
 	- `html`: outpout as HTML
 	- `csv`: outpout as CSV
 	- `proxy`: doesn't fill the articles
 	- `clip`: stick the full article content under the original feed content (useful for twitter)
 	- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
 	- `search=STRING`: does a basic case-sensitive search in the feed
 - Advanced
 	- `csv`: export to csv
@@ -85,14 +89,11 @@ The arguments are:
 	- `noref`: drop items' link
 	- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
 	- `debug`: to have some feedback from the script execution. Useful for debugging
 	- `mono`: disable multithreading while fetching, makes debugging easier
 	- `theforce`: force download the rss feed and ignore cached http errros
 	- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
 	- `encoding=ENCODING`: overrides the encoding auto-detection of the crawler. Some web developers did not quite understand the importance of setting charset/encoding tags correctly...
 - http server only
 	- `callback=NAME`: for JSONP calls
 	- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
 	- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
 	- `txt`: changes the http content-type to txt (for faster "`view-source:`")
 - Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
 	- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
@@ -111,7 +112,6 @@ morss will auto-detect what "mode" to use.
 For this, you'll want to change a bit the architecture of the files, for example
 into something like this.
 ```
 /
 ├── cgi
@@ -143,20 +143,40 @@ ensure that the provided `/www/.htaccess` works well with your server.
 Running this command should do:
 ```shell
-uwsgi --http :9090 --plugin python --wsgi-file main.py
+uwsgi --http :8080 --plugin python --wsgi-file main.py
 ```
-However, one problem might be how to serve the provided `index.html` file if it
+#### Using Gunicorn
 isn't in the same directory. Therefore you can add this at the end of the
 command to point to another directory `--pyargv '--root ../../www/'`.
 ```shell
 gunicorn morss:cgi_standalone_app
 ```
 #### Using docker
 Build & run
 ```shell
 docker build https://git.pictuga.com/pictuga/morss.git -t morss
 docker run -p 8080:8080 morss
 ```
 In one line
 ```shell
 docker run -p 8080:8080 $(docker build -q https://git.pictuga.com/pictuga/morss.git)
 ```
 #### Using morss' internal HTTP server
 Morss can run its own HTTP server. The later should start when you run morss
 without any argument, on port 8080.
-You can change the port and the location of the `www/` folder like this `python -m morss 9000 --root ../../www`.
+```shell
 morss
 ```
 You can change the port like this `morss 9000`.
 #### Passing arguments
@@ -176,9 +196,9 @@ Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rs
 Run:
 ```
-python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
+morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
 ```
-For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
+For example: `morss debug http://feeds.bbci.co.uk/news/rss.xml`
 *(Brackets indicate optional text)*
@@ -191,9 +211,9 @@ scripts can be run on top of the RSS feed, using its
 To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
 ```
-[python[2.7]] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
+morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
 ```
-For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
+For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
 *(Brackets indicate optional text)*
@@ -230,20 +250,22 @@ url = 'http://newspaper.example/feed.xml'
 options = morss.Options(csv=True) # arguments
 morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
-rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
+url = morss.UrlFix(url) # make sure the url is properly formatted
 url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
 rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
-output = morss.Format(rss, options) # formats final feed
+output = morss.FeedFormat(rss, options, 'unicode') # formats final feed
 ```
 ## Cache information
-morss uses caching to make loading faster. There are 2 possible cache backends
+morss uses caching to make loading faster. There are 3 possible cache backends
 (visible in `morss/crawler.py`):
 - `{}`: a simple python in-memory dict() object
 - `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
 be cleared every time the program is run
- `MySQLCacheHandler`: /!\ Does NOT support multi-threading
+- `MySQLCacheHandler`
 ## Configuration
 ### Length limitation
@@ -262,7 +284,6 @@ different values at the top of the script.
 - `DELAY` sets the browser cache delay, only for HTTP clients
 - `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
 - `THREADS` sets the number of threads to use. `1` makes no use of multithreading.
 ### Content matching
--- a/main.py
+++ b/main.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python
-from morss import main, cgi_wrapper as application
+from morss import main, cgi_standalone_app as application
 if __name__ == '__main__':
    main()
--- a/morss/crawler.py
+++ b/morss/crawler.py
@@ -7,14 +7,19 @@ import chardet
 from cgi import parse_header
 import lxml.html
 import time
 import random
 try:
    # python 2
    from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
    from urllib import quote
    from urlparse import urlparse, urlunparse
    import mimetools
 except ImportError:
    # python 3
    from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
    from urllib.parse import quote
    from urllib.parse import urlparse, urlunparse
    import email
 try:
@@ -27,13 +32,56 @@ except NameError:
 MIMETYPE = {
    'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
    'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
    'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
-DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
+DEFAULT_UAS = [
    #https://gist.github.com/fijimunkii/952acac988f2d25bef7e0284bc63c406
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
    ]
-def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=False):
+PROTOCOL = ['http', 'https']
 def get(*args, **kwargs):
    return adv_get(*args, **kwargs)['data']
 def adv_get(url, timeout=None, *args, **kwargs):
    url = sanitize_url(url)
    if timeout is None:
        con = custom_handler(*args, **kwargs).open(url)
    else:
        con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
    data = con.read()
    contenttype = con.info().get('Content-Type', '').split(';')[0]
    encoding= detect_encoding(data, con)
    return {
        'data':data,
        'url': con.geturl(),
        'con': con,
        'contenttype': contenttype,
        'encoding': encoding
    }
 def custom_handler(follow=None, delay=None, encoding=None):
    handlers = []
    # as per urllib2 source code, these Handelers are added first
@@ -50,21 +98,54 @@ def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=F
    handlers.append(GZIPHandler())
    handlers.append(HTTPEquivHandler())
    handlers.append(HTTPRefreshHandler())
-    handlers.append(UAHandler(DEFAULT_UA))
+    handlers.append(UAHandler(random.choice(DEFAULT_UAS)))
-
+    handlers.append(BrowserlyHeaderHandler())
    if not basic:
        handlers.append(AutoRefererHandler())
    handlers.append(EncodingFixHandler(encoding))
-    if accept:
+    if follow:
-        handlers.append(ContentNegociationHandler(MIMETYPE[accept], strict))
+        handlers.append(AlternateHandler(MIMETYPE[follow]))
    handlers.append(CacheHandler(force_min=delay))
    return build_opener(*handlers)
 def is_ascii(string):
    # there's a native function in py3, but home-made fix for backward compatibility
    try:
        string.encode('ascii')
    except UnicodeError:
        return False
    else:
        return True
 def sanitize_url(url):
    if isinstance(url, bytes):
        url = url.decode()
    if url.split(':', 1)[0] not in PROTOCOL:
        url = 'http://' + url
    url = url.replace(' ', '%20')
    # Escape non-ascii unicode characters
    # https://stackoverflow.com/a/4391299
    parts = list(urlparse(url))
    for i in range(len(parts)):
        if not is_ascii(parts[i]):
            if i == 1:
                parts[i] = parts[i].encode('idna').decode('ascii')
            else:
                parts[i] = quote(parts[i].encode('utf-8'))
    return urlunparse(parts)
 class DebugHandler(BaseHandler):
    handler_order = 2000
@@ -132,6 +213,15 @@ class GZIPHandler(BaseHandler):
 def detect_encoding(data, resp=None):
    enc = detect_raw_encoding(data, resp)
    if enc == 'gb2312':
        enc = 'gbk'
    return enc
 def detect_raw_encoding(data, resp=None):
    if resp is not None:
        enc = resp.headers.get('charset')
        if enc is not None:
@@ -196,48 +286,37 @@ class UAHandler(BaseHandler):
    https_request = http_request
-class AutoRefererHandler(BaseHandler):
+class BrowserlyHeaderHandler(BaseHandler):
    """ Add more headers to look less suspicious """
    def http_request(self, req):
-        req.add_unredirected_header('Referer', 'http://%s' % req.host)
+        req.add_unredirected_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
        req.add_unredirected_header('Accept-Language', 'en-US,en;q=0.5')
        return req
    https_request = http_request
-class ContentNegociationHandler(BaseHandler):
+class AlternateHandler(BaseHandler):
-    " Handler for content negociation. Also parses <link rel='alternate' type='application/rss+xml' href='...' /> "
+    " Follow <link rel='alternate' type='application/rss+xml' href='...' /> "
-    def __init__(self, accept=None, strict=False):
+    def __init__(self, follow=None):
-        self.accept = accept
+        self.follow = follow or []
        self.strict = strict
    def http_request(self, req):
        if self.accept is not None:
            if isinstance(self.accept, basestring):
                self.accept = (self.accept,)
            string = ','.join(self.accept)
            if self.strict:
                string += ',*/*;q=0.9'
            req.add_unredirected_header('Accept', string)
        return req
    def http_response(self, req, resp):
        contenttype = resp.info().get('Content-Type', '').split(';')[0]
-        if 200 <= resp.code < 300 and self.accept is not None and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
+        if 200 <= resp.code < 300 and len(self.follow) and contenttype in MIMETYPE['html'] and contenttype not in self.follow:
            # opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
            data = resp.read()
            links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
            for link in links:
-                if link.get('type', '') in self.accept:
+                if link.get('type', '') in self.follow:
                    resp.code = 302
                    resp.msg = 'Moved Temporarily'
                    resp.headers['location'] = link.get('href')
                    break
            fp = BytesIO(data)
            old_resp = resp
@@ -246,7 +325,6 @@ class ContentNegociationHandler(BaseHandler):
        return resp
    https_request = http_request
    https_response = http_response
@@ -384,7 +462,7 @@ class CacheHandler(BaseHandler):
        elif  self.force_min is None and ('no-cache' in cc_list
                                        or 'no-store' in cc_list
-                                        or ('private' in cc_list and not self.private)):
+                                        or ('private' in cc_list and not self.private_cache)):
            # kindly follow web servers indications, refresh
            return None
@@ -419,7 +497,7 @@ class CacheHandler(BaseHandler):
            cc_list = [x for x in cache_control if '=' not in x]
-            if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private):
+            if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private_cache):
                # kindly follow web servers indications
                return resp
@@ -461,6 +539,8 @@ class CacheHandler(BaseHandler):
 class BaseCache:
    """ Subclasses must behave like a dict """
    def __contains__(self, url):
        try:
            self[url]
@@ -477,7 +557,7 @@ import sqlite3
 class SQLiteCache(BaseCache):
    def __init__(self, filename=':memory:'):
-        self.con = sqlite3.connect(filename or sqlite_default, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
+        self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
        with self.con:
            self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
@@ -513,18 +593,20 @@ import pymysql.cursors
 class MySQLCacheHandler(BaseCache):
    " NB. Requires mono-threading, as pymysql isn't thread-safe "
    def __init__(self, user, password, database, host='localhost'):
-        self.con = pymysql.connect(host=host, user=user, password=password, database=database, charset='utf8', autocommit=True)
+        self.user = user
        self.password = password
        self.database = database
        self.host = host
-        with self.con.cursor() as cursor:
+        with self.cursor() as cursor:
            cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
-    def __del__(self):
+    def cursor(self):
-        self.con.close()
+        return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
    def __getitem__(self, url):
-        cursor = self.con.cursor()
+        cursor = self.cursor()
        cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
        row = cursor.fetchone()
@@ -535,10 +617,17 @@ class MySQLCacheHandler(BaseCache):
    def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
        if url in self:
-            with self.con.cursor() as cursor:
+            with self.cursor() as cursor:
                cursor.execute('UPDATE data SET code=%s, msg=%s, headers=%s, data=%s, timestamp=%s WHERE url=%s',
                    value + (url,))
        else:
-            with self.con.cursor() as cursor:
+            with self.cursor() as cursor:
                cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s)', (url,) + value)
 if __name__ == '__main__':
    req = adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
    if not sys.flags.interactive:
        print(req['data'].decode(req['encoding']))
--- a/morss/feedify.ini
+++ b/morss/feedify.ini
@@ -90,8 +90,11 @@ item_updated = updated
 [html]
 mode = html
 path =
  http://localhost/
 title = //div[@id='header']/h1
-desc = //div[@id='header']/h2
+desc = //div[@id='header']/p
 items = //div[@id='content']/div
 item_title = ./a
@@ -99,7 +102,7 @@ item_link = ./a/@href
 item_desc = ./div[class=desc]
 item_content = ./div[class=content]
-base = <!DOCTYPE html> <html> <head> <title>Feed reader by morss</title> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" /> </head> <body> <div id="header"> <h1>@feed.title</h1> <h2>@feed.desc</h2> <p>- via morss</p> </div> <div id="content"> <div class="item"> <a class="title link" href="@item.link" target="_blank">@item.title</a> <div class="desc">@item.desc</div> <div class="content">@item.content</div> </div> </div> <script> var items = document.getElementsByClassName('item') for (var i in items) items[i].onclick = function() { this.classList.toggle('active') document.body.classList.toggle('noscroll') } </script> </body> </html>
+base = file:sheet.xsl
 [twitter]
 mode = html
--- a/morss/feedify.py
+++ b/morss/feedify.py
@@ -1,28 +0,0 @@
 import re
 import json
 from . import crawler
 try:
    basestring
 except NameError:
    basestring = str
 def pre_worker(url):
    if url.startswith('http://itunes.apple.com/') or url.startswith('https://itunes.apple.com/'):
        match = re.search('/id([0-9]+)(\?.*)?$', url)
        if match:
            iid = match.groups()[0]
            redirect = 'https://itunes.apple.com/lookup?id=%s' % iid
            try:
                con = crawler.custom_handler(basic=True).open(redirect, timeout=4)
                data = con.read()
            except (IOError, HTTPException):
                raise
            return json.loads(data.decode('utf-8', 'replace'))['results'][0]['feedUrl']
    return None
--- a/morss/feeds.py
+++ b/morss/feeds.py
@@ -15,6 +15,7 @@ import dateutil.parser
 from copy import deepcopy
 import lxml.html
 from .readabilite import parse as html_parse
 json.encoder.c_make_encoder = None
@@ -45,14 +46,32 @@ def parse_rules(filename=None):
    rules = dict([(x, dict(config.items(x))) for x in config.sections()])
    for section in rules.keys():
        # for each ruleset
        for arg in rules[section].keys():
-            if '\n' in rules[section][arg]:
+            # for each rule
            if rules[section][arg].startswith('file:'):
                paths = [os.path.join(sys.prefix, 'share/morss/www', rules[section][arg][5:]),
                    os.path.join(os.path.dirname(__file__), '../www', rules[section][arg][5:]),
                    os.path.join(os.path.dirname(__file__), '../..', rules[section][arg][5:])]
                for path in paths:
                    try:
                        file_raw = open(path).read()
                        file_clean = re.sub('<[/?]?(xsl|xml)[^>]+?>', '', file_raw)
                        rules[section][arg] = file_clean
                    except IOError:
                        pass
            elif '\n' in rules[section][arg]:
                rules[section][arg] = rules[section][arg].split('\n')[1:]
    return rules
-def parse(data, url=None, mimetype=None):
+def parse(data, url=None, encoding=None):
    " Determine which ruleset to use "
    rulesets = parse_rules()
@@ -66,28 +85,22 @@ def parse(data, url=None, mimetype=None):
                for path in ruleset['path']:
                    if fnmatch(url, path):
                        parser = [x for x in parsers if x.mode == ruleset['mode']][0]
-                        return parser(data, ruleset) 
+                        return parser(data, ruleset, encoding=encoding)
-    # 2) Look for a parser based on mimetype
+    # 2) Try each and every parser
    if mimetype is not None:
        parser_candidates = [x for x in parsers if mimetype in x.mimetype]
    if mimetype is None or parser_candidates is None:
        parser_candidates = parsers
    # 3) Look for working ruleset for given parser
        # 3a) See if parsing works
        # 3b) See if .items matches anything
-    for parser in parser_candidates:
+    for parser in parsers:
        ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
            # 'path' as they should have been caught beforehands
        try:
-            feed = parser(data)
+            feed = parser(data, encoding=encoding)
-        except (ValueError):
+        except (ValueError, SyntaxError):
            # parsing did not work
            pass
@@ -112,7 +125,7 @@ def parse(data, url=None, mimetype=None):
 class ParserBase(object):
-    def __init__(self, data=None, rules=None, parent=None):
+    def __init__(self, data=None, rules=None, parent=None, encoding=None):
        if rules is None:
            rules = parse_rules()[self.default_ruleset]
@@ -121,9 +134,10 @@ class ParserBase(object):
        if data is None:
            data = rules['base']
        self.root = self.parse(data)
        self.parent = parent
        self.encoding = encoding
        self.root = self.parse(data)
    def parse(self, raw):
        pass
@@ -148,15 +162,15 @@ class ParserBase(object):
        c = csv.writer(out, dialect=csv.excel)
        for item in self.items:
-            row = [getattr(item, x) for x in item.dic]
+            c.writerow([getattr(item, x) for x in item.dic])
            if encoding != 'unicode':
                row = [x.encode(encoding) if isinstance(x, unicode) else x for x in row]
            c.writerow(row)
        out.seek(0)
-        return out.read()
+        out = out.read()
        if encoding != 'unicode':
            out = out.encode(encoding)
        return out
    def tohtml(self, **k):
        return self.convert(FeedHTML).tostring(**k)
@@ -267,8 +281,15 @@ class ParserBase(object):
        except AttributeError:
            # does not exist, have to create it
-            self.rule_create(self.rules[rule_name])
+            try:
-            self.rule_set(self.rules[rule_name], value)
+                self.rule_create(self.rules[rule_name])
            except AttributeError:
                # no way to create it, give up
                pass
            else:
                self.rule_set(self.rules[rule_name], value)
    def rmv(self, rule_name):
        # easy deleter
@@ -286,10 +307,7 @@ class ParserXML(ParserBase):
    NSMAP = {'atom': 'http://www.w3.org/2005/Atom',
        'atom03': 'http://purl.org/atom/ns#',
        'media': 'http://search.yahoo.com/mrss/',
        'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
        'slash': 'http://purl.org/rss/1.0/modules/slash/',
        'dc': 'http://purl.org/dc/elements/1.1/',
        'content': 'http://purl.org/rss/1.0/modules/content/',
        'rssfake': 'http://purl.org/rss/1.0/'}
@@ -383,7 +401,8 @@ class ParserXML(ParserBase):
            return
        elif key is not None:
-            del x.attrib[key]
+            if key in match.attrib:
                del match.attrib[key]
        else:
            match.getparent().remove(match)
@@ -401,13 +420,14 @@ class ParserXML(ParserBase):
        else:
            if html_rich:
                # atom stuff
                if 'atom' in rule:
                    match.attrib['type'] = 'xhtml'
                self._clean_node(match)
                match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
-                match.find('div').drop_tag()
+
                if self.rules['mode'] == 'html':
                    match.find('div').drop_tag() # not supported by lxml.etree
                else: # i.e. if atom
                    match.attrib['type'] = 'xhtml'
            else:
                if match is not None and len(match):
@@ -440,8 +460,7 @@ class ParserHTML(ParserXML):
    mimetype = ['text/html', 'application/xhtml+xml']
    def parse(self, raw):
-        parser = etree.HTMLParser(remove_blank_text=True) # remove_blank_text needed for pretty_print
+        return html_parse(raw, encoding=self.encoding)
        return etree.fromstring(raw, parser)
    def tostring(self, encoding='unicode', **k):
        return lxml.html.tostring(self.root, encoding=encoding, **k)
@@ -467,6 +486,9 @@ class ParserHTML(ParserXML):
            element = deepcopy(match)
            match.getparent().append(element)
        else:
            raise AttributeError('no way to create item')
 def parse_time(value):
    if value is None or value == 0:
@@ -474,13 +496,13 @@ def parse_time(value):
    elif isinstance(value, basestring):
        if re.match(r'^[0-9]+$', value):
-            return datetime.fromtimestamp(int(value), tz.UTC)
+            return datetime.fromtimestamp(int(value), tz.tzutc())
        else:
-            return dateutil.parser.parse(value)
+            return dateutil.parser.parse(value).replace(tzinfo=tz.tzutc())
    elif isinstance(value, int):
-        return datetime.fromtimestamp(value, tz.UTC)
+        return datetime.fromtimestamp(value, tz.tzutc())
    elif isinstance(value, datetime):
        return value
@@ -732,3 +754,14 @@ class ItemJSON(Item, ParserJSON):
                return
            cur = cur[node]
 if __name__ == '__main__':
    from . import crawler
    req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://www.nytimes.com/', follow='rss')
    feed = parse(req['data'], url=req['url'], encoding=req['encoding'])
    if not sys.flags.interactive:
        for item in feed.items:
            print(item.title, item.link)
--- a/morss/morss.py
+++ b/morss/morss.py
@@ -1,9 +1,10 @@
 import sys
 import os
 import os.path
 import time
-import threading
+import time
 from datetime import datetime
 from dateutil import tz
 from fnmatch import fnmatch
 import re
@@ -12,47 +13,44 @@ import lxml.etree
 import lxml.html
 from . import feeds
 from . import feedify
 from . import crawler
 from . import readabilite
 import wsgiref.simple_server
 import wsgiref.handlers
 import cgitb
 try:
    # python 2
    from Queue import Queue
    from httplib import HTTPException
-    from urllib import quote_plus
+    from urllib import unquote
    from urlparse import urlparse, urljoin, parse_qs
 except ImportError:
    # python 3
    from queue import Queue
    from http.client import HTTPException
-    from urllib.parse import quote_plus
+    from urllib.parse import unquote
    from urllib.parse import urlparse, urljoin, parse_qs
-LIM_ITEM = 100  # deletes what's beyond
+MAX_ITEM = 5  # cache-only beyond
-LIM_TIME = 7  # deletes what's after
+MAX_TIME = 2  # cache-only after (in sec)
-MAX_ITEM = 50  # cache-only beyond
+
-MAX_TIME = 7  # cache-only after (in sec)
+LIM_ITEM = 10  # deletes what's beyond
 LIM_TIME = 2.5  # deletes what's after
 DELAY = 10 * 60  # xml cache & ETag cache (in sec)
 TIMEOUT = 4  # http timeout (in sec)
 THREADS = 10  # number of threads (1 for single-threaded)
 DEBUG = False
 PORT = 8080
 PROTOCOL = ['http', 'https', 'ftp']
 def filterOptions(options):
    return options
    # example of filtering code below
-    #allowed = ['proxy', 'clip', 'keep', 'cache', 'force', 'silent', 'pro', 'debug']
+    #allowed = ['proxy', 'clip', 'cache', 'force', 'silent', 'pro', 'debug']
    #filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
    #return filtered
@@ -66,6 +64,7 @@ def log(txt, force=False):
    if DEBUG or force:
        if 'REQUEST_URI' in os.environ:
            open('morss.log', 'a').write("%s\n" % repr(txt))
        else:
            print(repr(txt))
@@ -73,6 +72,7 @@ def log(txt, force=False):
 def len_html(txt):
    if len(txt):
        return len(lxml.html.fromstring(txt).text_content())
    else:
        return 0
@@ -80,6 +80,7 @@ def len_html(txt):
 def count_words(txt):
    if len(txt):
        return len(lxml.html.fromstring(txt).text_content().split())
    return 0
@@ -88,12 +89,14 @@ class Options:
        if len(args):
            self.options = args
            self.options.update(options or {})
        else:
            self.options = options or {}
    def __getattr__(self, key):
        if key in self.options:
            return self.options[key]
        else:
            return False
@@ -107,17 +110,23 @@ class Options:
 def parseOptions(options):
    """ Turns ['md=True'] into {'md':True} """
    out = {}
    for option in options:
        split = option.split('=', 1)
        if len(split) > 1:
            if split[0].lower() == 'true':
                out[split[0]] = True
            elif split[0].lower() == 'false':
                out[split[0]] = False
            else:
                out[split[0]] = split[1]
        else:
            out[split[0]] = True
    return out
@@ -125,7 +134,7 @@ def ItemFix(item, feedurl='/'):
    """ Improves feed items (absolute links, resolve feedburner links, etc) """
    # check unwanted uppercase title
-    if len(item.title) > 20 and item.title.isupper():
+    if item.title is not None and len(item.title) > 20 and item.title.isupper():
        item.title = item.title.title()
    # check if it includes link
@@ -158,6 +167,11 @@ def ItemFix(item, feedurl='/'):
        item.link = parse_qs(urlparse(item.link).query)['url'][0]
        log(item.link)
    # pocket
    if fnmatch(item.link, 'https://getpocket.com/redirect?url=*'):
        item.link = parse_qs(urlparse(item.link).query)['url'][0]
        log(item.link)
    # facebook
    if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
        item.link = parse_qs(urlparse(item.link).query)['u'][0]
@@ -183,7 +197,7 @@ def ItemFix(item, feedurl='/'):
    # reddit
    if urlparse(feedurl).netloc == 'www.reddit.com':
-        match = lxml.html.fromstring(item.desc).xpath('//a[text()="[link]"]/@href')
+        match = lxml.html.fromstring(item.content).xpath('//a[text()="[link]"]/@href')
        if len(match):
            item.link = match[0]
            log(item.link)
@@ -208,6 +222,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
        if len(match):
            link = match[0]
            log(link)
        else:
            link = None
@@ -217,6 +232,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
        if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
            link = match[0]
            log(link)
        else:
            link = None
@@ -232,19 +248,17 @@ def ItemFill(item, options, feedurl='/', fast=False):
        delay = -2
    try:
-        con = crawler.custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
+        req = crawler.adv_get(url=link, delay=delay, timeout=TIMEOUT)
        data = con.read()
    except (IOError, HTTPException) as e:
        log('http error')
        return False # let's just delete errors stuff when in cache mode
-    contenttype = con.info().get('Content-Type', '').split(';')[0]
+    if req['contenttype'] not in crawler.MIMETYPE['html'] and req['contenttype'] != 'text/plain':
    if contenttype not in crawler.MIMETYPE['html'] and contenttype != 'text/plain':
        log('non-text page')
        return True
-    out = readabilite.get_article(data, link, options.encoding or crawler.detect_encoding(data, con))
+    out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
    if out is not None:
        item.content = out
@@ -268,9 +282,6 @@ def ItemAfter(item, options):
        item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
        del item.desc
    if not options.keep and not options.proxy:
        del item.desc
    if options.nolink and item.content:
        content = lxml.html.fromstring(item.content)
        for link in content.xpath('//a'):
@@ -285,26 +296,6 @@ def ItemAfter(item, options):
 def FeedFetch(url, options):
    # basic url clean-up
    if url is None:
        raise MorssException('No url provided')
    if urlparse(url).scheme not in PROTOCOL:
        url = 'http://' + url
        log(url)
    url = url.replace(' ', '%20')
    if isinstance(url, bytes):
        url = url.decode()
    # allow for code execution for feedify
    pre = feedify.pre_worker(url)
    if pre:
        url = pre
        log('url redirect')
        log(url)
    # fetch feed
    delay = DELAY
@@ -312,44 +303,43 @@ def FeedFetch(url, options):
        delay = 0
    try:
-        con = crawler.custom_handler(accept='xml', strict=True, delay=delay,
+        req = crawler.adv_get(url=url, follow='rss', delay=delay, timeout=TIMEOUT * 2)
            encoding=options.encoding, basic=not options.items) \
            .open(url, timeout=TIMEOUT * 2)
        xml = con.read()
    except (IOError, HTTPException):
        raise MorssException('Error downloading feed')
    contenttype = con.info().get('Content-Type', '').split(';')[0]
    if options.items:
        # using custom rules
-        rss = feeds.FeedHTML(xml, url, contenttype)
+        rss = feeds.FeedHTML(req['data'], encoding=req['encoding'])
-        feed.rule
+
        rss.rules['title'] = options.title              if options.title        else '//head/title'
        rss.rules['desc'] = options.desc                if options.desc         else '//head/meta[@name="description"]/@content'
        rss.rules['items'] = options.items
-        if options.item_title:
+        rss.rules['item_title'] = options.item_title    if options.item_title   else './/a|.'
-            rss.rules['item_title'] = options.item_title
+        rss.rules['item_link'] = options.item_link      if options.item_link    else './@href|.//a/@href'
-        if options.item_link:
+
            rss.rules['item_link'] = options.item_link
        if options.item_content:
            rss.rules['item_content'] = options.item_content
        if options.item_time:
            rss.rules['item_time'] = options.item_time
        rss = rss.convert(feeds.FeedXML)
    else:
        try:
-            rss = feeds.parse(xml, url, contenttype)
+            rss = feeds.parse(req['data'], url=url, encoding=req['encoding'])
            rss = rss.convert(feeds.FeedXML)
                # contains all fields, otherwise much-needed data can be lost
        except TypeError:
            log('random page')
-            log(contenttype)
+            log(req['contenttype'])
            raise MorssException('Link provided is not a valid feed')
-    return rss
+    return req['url'], rss
 def FeedGather(rss, url, options):
@@ -361,34 +351,22 @@ def FeedGather(rss, url, options):
    lim_time = LIM_TIME
    max_item = MAX_ITEM
    max_time = MAX_TIME
    threads = THREADS
    if options.cache:
        max_time = 0
-    if options.mono:
+    now = datetime.now(tz.tzutc())
-        threads = 1
+    sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
-
+    for i, item in enumerate(sorted_items):
    # set
    def runner(queue):
        while True:
            value = queue.get()
            try:
                worker(*value)
            except Exception as e:
                log('Thread Error: %s' % e.message)
            queue.task_done()
    def worker(i, item):
        if time.time() - start_time > lim_time >= 0 or i + 1 > lim_item >= 0:
            log('dropped')
            item.remove()
-            return
+            continue
        item = ItemBefore(item, options)
        if item is None:
-            return
+            continue
        item = ItemFix(item, url)
@@ -396,7 +374,7 @@ def FeedGather(rss, url, options):
            if not options.proxy:
                if ItemFill(item, options, url, True) is False:
                    item.remove()
-                    return
+                    continue
        else:
            if not options.proxy:
@@ -404,22 +382,6 @@ def FeedGather(rss, url, options):
        item = ItemAfter(item, options)
    queue = Queue()
    for i in range(threads):
        t = threading.Thread(target=runner, args=(queue,))
        t.daemon = True
        t.start()
    for i, item in enumerate(list(rss.items)):
        if threads == 1:
            worker(*[i, item])
        else:
            queue.put([i, item])
    if threads != 1:
        queue.join()
    if options.ad:
        new = rss.items.append()
        new.title = "Are you hungry?"
@@ -433,37 +395,38 @@ def FeedGather(rss, url, options):
    return rss
-def FeedFormat(rss, options):
+def FeedFormat(rss, options, encoding='utf-8'):
    if options.callback:
        if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
-            return '%s(%s)' % (options.callback, rss.tojson())
+            out = '%s(%s)' % (options.callback, rss.tojson(encoding='unicode'))
            return out if encoding == 'unicode' else out.encode(encoding)
        else:
            raise MorssException('Invalid callback var name')
    elif options.json:
        if options.indent:
-            return rss.tojson(encoding='UTF-8', indent=4)
+            return rss.tojson(encoding=encoding, indent=4)
        else:
-            return rss.tojson(encoding='UTF-8')
+            return rss.tojson(encoding=encoding)
    elif options.csv:
-        return rss.tocsv(encoding='UTF-8')
+        return rss.tocsv(encoding=encoding)
-    elif options.reader:
+    elif options.html:
        if options.indent:
-            return rss.tohtml(encoding='UTF-8', pretty_print=True)
+            return rss.tohtml(encoding=encoding, pretty_print=True)
        else:
-            return rss.tohtml(encoding='UTF-8')
+            return rss.tohtml(encoding=encoding)
    else:
        if options.indent:
-            return rss.torss(xml_declaration=True, encoding='UTF-8', pretty_print=True)
+            return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding, pretty_print=True)
        else:
-            return rss.torss(xml_declaration=True, encoding='UTF-8')
+            return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding)
 def process(url, cache=None, options=None):
@@ -475,14 +438,15 @@ def process(url, cache=None, options=None):
    if cache:
        crawler.default_cache = crawler.SQLiteCache(cache)
-    rss = FeedFetch(url, options)
+    url, rss = FeedFetch(url, options)
    rss = FeedGather(rss, url, options)
-    return FeedFormat(rss, options)
+    return FeedFormat(rss, options, 'unicode')
-def cgi_app(environ, start_response):
+def cgi_parse_environ(environ):
    # get options
    if 'REQUEST_URI' in environ:
        url = environ['REQUEST_URI'][1:]
    else:
@@ -496,7 +460,7 @@ def cgi_app(environ, start_response):
    if url.startswith(':'):
        split = url.split('/', 1)
-        options = split[0].replace('|', '/').replace('\\\'', '\'').split(':')[1:]
+        raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
        if len(split) > 1:
            url = split[1]
@@ -504,15 +468,22 @@ def cgi_app(environ, start_response):
            url = ''
    else:
-        options = []
+        raw_options = []
    # init
-    options = Options(filterOptions(parseOptions(options)))
+    options = Options(filterOptions(parseOptions(raw_options)))
    headers = {}
    global DEBUG
    DEBUG = options.debug
    return (url, options)
 def cgi_app(environ, start_response):
    url, options = cgi_parse_environ(environ)
    headers = {}
    # headers
    headers['status'] = '200 OK'
    headers['cache-control'] = 'max-age=%s' % DELAY
@@ -520,7 +491,7 @@ def cgi_app(environ, start_response):
    if options.cors:
        headers['access-control-allow-origin'] = '*'
-    if options.html or options.reader:
+    if options.html:
        headers['content-type'] = 'text/html'
    elif options.txt or options.silent:
        headers['content-type'] = 'text/plain'
@@ -534,10 +505,12 @@ def cgi_app(environ, start_response):
    else:
        headers['content-type'] = 'text/xml'
    headers['content-type'] += '; charset=utf-8'
    crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
    # get the work done
-    rss = FeedFetch(url, options)
+    url, rss = FeedFetch(url, options)
    if headers['content-type'] == 'text/xml':
        headers['content-type'] = rss.mimetype[0]
@@ -547,18 +520,42 @@ def cgi_app(environ, start_response):
    rss = FeedGather(rss, url, options)
    out = FeedFormat(rss, options)
-    if not options.silent:
+    if options.silent:
-        return out
+        return ['']
    else:
        return [out]
-def cgi_wrapper(environ, start_response):
+def middleware(func):
-    # simple http server for html and css
+    " Decorator to turn a function into a wsgi middleware "
    # This is called when parsing the "@middleware" code
    def app_builder(app):
        # This is called when doing app = cgi_wrapper(app)
        def app_wrap(environ, start_response):
            # This is called when a http request is being processed
            return func(environ, start_response, app)
        return app_wrap
    return app_builder
@middleware
 def cgi_file_handler(environ, start_response, app):
    " Simple HTTP server to serve static files (.html, .css, etc.) "
    files = {
        '': 'text/html',
-        'index.html': 'text/html'}
+        'index.html': 'text/html',
        'sheet.xsl': 'text/xsl'}
    if 'REQUEST_URI' in environ:
        url = environ['REQUEST_URI'][1:]
    else:
        url = environ['PATH_INFO'][1:]
@@ -568,35 +565,103 @@ def cgi_wrapper(environ, start_response):
        if url == '':
            url = 'index.html'
-        if '--root' in sys.argv[1:]:
+        paths = [os.path.join(sys.prefix, 'share/morss/www', url),
-            path = os.path.join(sys.argv[-1], url)
+            os.path.join(os.path.dirname(__file__), '../www', url)]
        for path in paths:
            try:
                body = open(path, 'rb').read()
                headers['status'] = '200 OK'
                headers['content-type'] = files[url]
                start_response(headers['status'], list(headers.items()))
                return [body]
            except IOError:
                continue
        else:
-            path = url
+            # the for loop did not return, so here we are, i.e. no file found
        try:
            body = open(path, 'rb').read()
            headers['status'] = '200 OK'
            headers['content-type'] = files[url]
            start_response(headers['status'], list(headers.items()))
            return [body]
        except IOError:
            headers['status'] = '404 Not found'
            start_response(headers['status'], list(headers.items()))
            return ['Error %s' % headers['status']]
-    # actual morss use
+    else:
        return app(environ, start_response)
 def cgi_get(environ, start_response):
    url, options = cgi_parse_environ(environ)
    # get page
    req = crawler.adv_get(url=url, timeout=TIMEOUT)
    if req['contenttype'] in ['text/html', 'application/xhtml+xml', 'application/xml']:
        if options.get == 'page':
            html = readabilite.parse(req['data'], encoding=req['encoding'])
            html.make_links_absolute(req['url'])
            kill_tags = ['script', 'iframe', 'noscript']
            for tag in kill_tags:
                for elem in html.xpath('//'+tag):
                    elem.getparent().remove(elem)
            output = lxml.etree.tostring(html.getroottree(), encoding='utf-8')
        elif options.get == 'article':
            output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
        else:
            raise MorssException('no :get option passed')
    else:
        output = req['data']
    # return html page
    headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8'}
    start_response(headers['status'], list(headers.items()))
    return [output]
 dispatch_table = {
    'get': cgi_get,
    }
@middleware
 def cgi_dispatcher(environ, start_response, app):
    url, options = cgi_parse_environ(environ)
    for key in dispatch_table.keys():
        if key in options:
            return dispatch_table[key](environ, start_response)
    return app(environ, start_response)
@middleware
 def cgi_error_handler(environ, start_response, app):
    try:
-        return [cgi_app(environ, start_response) or '(empty)']
+        return app(environ, start_response)
    except (KeyboardInterrupt, SystemExit):
        raise
    except Exception as e:
-        headers = {'status': '500 Oops', 'content-type': 'text/plain'}
+        headers = {'status': '500 Oops', 'content-type': 'text/html'}
        start_response(headers['status'], list(headers.items()), sys.exc_info())
-        log('ERROR <%s>: %s' % (url, e.message), force=True)
+        log('ERROR: %s' % repr(e), force=True)
-        return ['An error happened:\n%s' % e.message]
+        return [cgitb.html(sys.exc_info())]
@middleware
 def cgi_encode(environ, start_response, app):
    out = app(environ, start_response)
    return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
 cgi_standalone_app = cgi_encode(cgi_error_handler(cgi_dispatcher(cgi_file_handler(cgi_app))))
 def cli_app():
@@ -608,12 +673,12 @@ def cli_app():
    crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
-    rss = FeedFetch(url, options)
+    url, rss = FeedFetch(url, options)
    rss = FeedGather(rss, url, options)
-    out = FeedFormat(rss, options)
+    out = FeedFormat(rss, options, 'unicode')
    if not options.silent:
-        print(out.decode('utf-8', 'replace') if isinstance(out, bytes) else out)
+        print(out)
    log('done')
@@ -622,6 +687,7 @@ def isInt(string):
    try:
        int(string)
        return True
    except ValueError:
        return False
@@ -629,31 +695,46 @@ def isInt(string):
 def main():
    if 'REQUEST_URI' in os.environ:
        # mod_cgi
        wsgiref.handlers.CGIHandler().run(cgi_wrapper)
-    elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
+        app = cgi_app
        app = cgi_dispatcher(app)
        app = cgi_error_handler(app)
        app = cgi_encode(app)
        wsgiref.handlers.CGIHandler().run(app)
    elif len(sys.argv) <= 1 or isInt(sys.argv[1]):
        # start internal (basic) http server
        if len(sys.argv) > 1 and isInt(sys.argv[1]):
            argPort = int(sys.argv[1])
            if argPort > 0:
                port = argPort
            else:
                raise MorssException('Port must be positive integer')
        else:
            port = PORT
-        print('Serving http://localhost:%s/'%port)
+        app = cgi_app
-        httpd = wsgiref.simple_server.make_server('', port, cgi_wrapper)
+        app = cgi_file_handler(app)
        app = cgi_dispatcher(app)
        app = cgi_error_handler(app)
        app = cgi_encode(app)
        print('Serving http://localhost:%s/' % port)
        httpd = wsgiref.simple_server.make_server('', port, app)
        httpd.serve_forever()
    else:
        # as a CLI app
        try:
            cli_app()
        except (KeyboardInterrupt, SystemExit):
            raise
        except Exception as e:
            print('ERROR: %s' % e.message)
--- a/morss/readabilite.py
+++ b/morss/readabilite.py
@@ -1,13 +1,17 @@
 import lxml.etree
 import lxml.html
 from bs4 import BeautifulSoup
 import re
 def parse(data, encoding=None):
    if encoding:
-        parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding=encoding)
+        data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
    else:
-        parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True)
+        data = BeautifulSoup(data, 'lxml').prettify('utf-8')
    parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding='utf-8')
    return lxml.html.fromstring(data, parser=parser)
@@ -43,6 +47,12 @@ def count_content(node):
    return count_words(node.text_content()) + len(node.findall('.//img'))
 def percentile(N, P):
    # https://stackoverflow.com/a/7464107
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]
 class_bad = ['comment', 'community', 'extra', 'foot',
    'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
    'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
@@ -60,9 +70,10 @@ class_good = ['and', 'article', 'body', 'column', 'main',
 regex_good = re.compile('|'.join(class_good), re.I)
-tags_junk = ['script', 'head', 'iframe', 'object', 'noscript',
+tags_dangerous = ['script', 'head', 'iframe', 'object', 'style', 'link', 'meta']
-    'param', 'embed', 'layer', 'applet', 'style', 'form', 'input', 'textarea',
+
-    'button', 'footer']
+tags_junk = tags_dangerous + ['noscript', 'param', 'embed', 'layer', 'applet',
    'form', 'input', 'textarea', 'button', 'footer']
 tags_bad = tags_junk + ['a', 'aside']
@@ -93,10 +104,21 @@ def score_node(node):
    class_id = node.get('class', '') + node.get('id', '')
    if (isinstance(node, lxml.html.HtmlComment)
-            or node.tag in tags_bad
+            or isinstance(node, lxml.html.HtmlProcessingInstruction)):
            or regex_bad.search(class_id)):
        return 0
    if node.tag in tags_dangerous:
        return 0
    if node.tag in tags_junk:
        score += -1 # actuall -2 as tags_junk is included tags_bad
    if node.tag in tags_bad:
        score += -1
    if regex_bad.search(class_id):
        score += -1
    if node.tag in tags_good:
        score += 4
@@ -114,33 +136,42 @@ def score_node(node):
    return score
-def score_all(node, grades=None):
+def score_all(node):
    " Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
    if grades is None:
        grades = {}
    for child in node:
        score = score_node(child)
-        child.attrib['seen'] = 'yes, ' + str(int(score))
+        child.attrib['morss_own_score'] = str(float(score))
-        if score > 0:
+        if score > 0 or len(list(child.iterancestors())) <= 2:
-            spread_score(child, score, grades)
+            spread_score(child, score)
-            score_all(child, grades)
+            score_all(child)
    return grades
-def spread_score(node, score, grades):
+def set_score(node, value):
    node.attrib['morss_score'] = str(float(value))
 def get_score(node):
    return float(node.attrib.get('morss_score', 0))
 def incr_score(node, delta):
    set_score(node, get_score(node) + delta)
 def get_all_scores(node):
    return {x:get_score(x) for x in list(node.iter()) if get_score(x) != 0}
 def spread_score(node, score):
    " Spread the node's score to its parents, on a linear way "
    delta = score / 2
    for ancestor in [node,] + list(node.iterancestors()):
        if score >= 1 or ancestor is node:
-            try:
+            incr_score(ancestor, score)
                grades[ancestor] += score
            except KeyError:
                grades[ancestor] = score
            score -= delta
@@ -148,26 +179,29 @@ def spread_score(node, score, grades):
            break
-def write_score_all(root, grades):
+def clean_root(root, keep_threshold=None):
    " Useful for debugging "
    for node in root.iter():
        node.attrib['score'] = str(int(grades.get(node, 0)))
 def clean_root(root):
    for node in list(root):
-        clean_root(node)
+        # bottom-up approach, i.e. starting with children before cleaning current node
-        clean_node(node)
+        clean_root(node, keep_threshold)
        clean_node(node, keep_threshold)
-def clean_node(node):
+def clean_node(node, keep_threshold=None):
    parent = node.getparent()
    if parent is None:
        # this is <html/> (or a removed element waiting for GC)
        return
    # remove dangerous tags, no matter what
    if node.tag in tags_dangerous:
        parent.remove(node)
        return
    if keep_threshold is not None and get_score(node) >= keep_threshold:
        # high score, so keep
        return
    gdparent = parent.getparent()
    # remove shitty tags
@@ -266,41 +300,56 @@ def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
    return nodeA # should always find one tho, at least <html/>, but needed for max_depth
-def rank_nodes(grades):
+def rank_grades(grades):
    # largest score to smallest
    return sorted(grades.items(), key=lambda x: x[1], reverse=True)
-def get_best_node(grades):
+def get_best_node(ranked_grades):
    " To pick the best (raw) node. Another function will clean it "
-    if len(grades) == 1:
+    if len(ranked_grades) == 1:
-        return grades[0]
+        return ranked_grades[0]
-    top = rank_nodes(grades)
+    lowest = lowest_common_ancestor(ranked_grades[0][0], ranked_grades[1][0], 3)
    lowest = lowest_common_ancestor(top[0][0], top[1][0], 3)
    return lowest
-def get_article(data, url=None, encoding=None):
+def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5):
    " Input a raw html string, returns a raw html string of the article "
-    html = parse(data, encoding)
+    html = parse(data, encoding_in)
-    scores = score_all(html)
+    score_all(html)
    scores = rank_grades(get_all_scores(html))
-    if not len(scores):
+    if not len(scores) or scores[0][1] < threshold:
        return None
    best = get_best_node(scores)
    if not debug:
        keep_threshold = percentile([x[1] for x in scores], 0.1)
        clean_root(best, keep_threshold)
    wc = count_words(best.text_content())
    wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
-    if wc - wca < 50 or float(wca) / wc > 0.3:
+    if not debug and (wc - wca < 50 or float(wca) / wc > 0.3):
        return None
    if url:
        best.make_links_absolute(url)
-    clean_root(best)
+    return lxml.etree.tostring(best if not debug else html, pretty_print=True, encoding=encoding_out)
-    return lxml.etree.tostring(best, pretty_print=True)
+
 if __name__ == '__main__':
    import sys
    from . import crawler
    req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
    article = get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
    if not sys.flags.interactive:
        print(article)
--- a/morss/reader.html.template
+++ b/morss/reader.html.template
@@ -1,210 +0,0 @@
@require(feed)
 <!DOCTYPE html>
 <html>
 	<head>
 		<title>@feed.title &#8211; via morss</title>
 		<meta charset="UTF-8" />
 		<meta name="description" content="@feed.desc (via morss)" />
 		<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
 		<style type="text/css">
 			/* columns - from https://thisisdallas.github.io/Simple-Grid/simpleGrid.css */
 			* {
 				box-sizing: border-box;
 			}
 			#content {
 				width: 100%;
 				max-width: 1140px;
 				min-width: 755px;
 				margin: 0 auto;
 				overflow: hidden;
 				padding-top: 20px;
 				padding-left: 20px; /* grid-space to left */
 				padding-right: 0px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-20px=0 */
 			}
 			.item {
 				width: 33.33%;
 				float: left;
 				padding-right: 20px; /* column-space */
 			}
 			@@media handheld, only screen and (max-width: 767px) { /* @@ to escape from the template engine */
 				#content {
 					width: 100%;
 					min-width: 0;
 					margin-left: 0px;
 					margin-right: 0px;
 					padding-left: 20px; /* grid-space to left */
 					padding-right: 10px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-10px=10px */
 				}
 				.item {
 					width: auto;
 					float: none;
 					margin-left: 0px;
 					margin-right: 0px;
 					margin-top: 10px;
 					margin-bottom: 10px;
 					padding-left: 0px;
 					padding-right: 10px; /* column-space */
 				}
 			}
 			/* design */
 			#header h1, #header h2, #header p {
 				font-family: sans;
 				text-align: center;
 				margin: 0;
 				padding: 0;
 			}
 			#header h1 {
 				font-size: 2.5em;
 				font-weight: bold;
 				padding: 1em 0 0.25em;
 			}
 			#header h2 {
 				font-size: 1em;
 				font-weight: normal;
 			}
 			#header p {
 				color: gray;
 				font-style: italic;
 				font-size: 0.75em;
 			}
 			#content {
 				text-align: justify;
 			}
 				.item .title {
 					font-weight: bold;
 					display: block;
 					text-align: center;
 				}
 				.item .link {
 					color: inherit;
 					text-decoration: none;
 				}
 				.item:not(.active) {
 					cursor: pointer;
 					height: 20em;
 					margin-bottom: 20px;
 					overflow: hidden;
 					text-overflow: ellpisps;
 					padding: 0.25em;
 					position: relative;
 				}
 					.item:not(.active) .title {
 						padding-bottom: 0.1em;
 						margin-bottom: 0.1em;
 						border-bottom: 1px solid silver;
 					}
 					.item:not(.active):before {
 						content: " ";
 						display: block;
 						width: 100%;
 						position: absolute;
 						top: 18.5em;
 						height: 1.5em;
 						background: linear-gradient(to bottom, rgba(255,255,255,0) 0%, rgba(255,255,255,1) 100%);
 					}
 					.item:not(.active) .article * {
 						max-width: 100%;
 						font-size: 1em !important;
 						font-weight: normal;
 						display: inline;
 						margin: 0;
 					}
 				.item.active {
 					background: white;
 					position: fixed;
 					overflow: auto;
 					top: 0;
 					left: 0;
 					height: 100%;
 					width: 100%;
 					z-index: 1;
 				}
 					body.noscroll {
 						overflow: hidden;
 					}
 					.item.active > * {
 						max-width: 700px;
 						margin: auto;
 					}
 					.item.active .title {
 						font-size: 2em;
 						padding: 0.5em 0;
 					}
 					.item.active .article object,
 					.item.active .article video,
 					.item.active .article audio {
 						display: none;
 					}
 					.item.active .article img {
 						max-height: 20em;
 						max-width: 100%;
 					}
 		</style>
 	</head>
 	<body>
 		<div id="header">
 			<h1>@feed.title</h1>
 			@if feed.desc:
 				<h2>@feed.desc</h2>
 			@end
 			<p>- via morss</p>
 		</div>
 		<div id="content">
 			@for item in feed.items:
 				<div class="item">
 					@if item.link:
 						<a class="title link" href="@item.link" target="_blank">@item.title</a>
 					@else:
 						<span class="title">@item.title</span>
 					@end
 					<div class="article">
 						@if item.content:
 							@item.content
 						@else:
 							@item.desc
 						@end
 					</div>
 				</div>
 			@end
 		</div>
 	<script>
 		var items = document.getElementsByClassName('item')
 		for (var i in items)
 			items[i].onclick = function()
 			{
 				this.classList.toggle('active')
 				document.body.classList.toggle('noscroll')
 			}
 	</script>
 	</body>
 </html>
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +0,0 @@
 lxml
 python-dateutil <= 1.5
 chardet
 pymysql
--- a/setup.py
+++ b/setup.py
@@ -1,14 +1,24 @@
-from setuptools import setup, find_packages
+from setuptools import setup
 from glob import glob
 package_name = 'morss'
 setup(
-    name=package_name,
+    name = package_name,
-    description='Get full-text RSS feeds',
+    description = 'Get full-text RSS feeds',
-    author='pictuga, Samuel Marks',
+    author = 'pictuga, Samuel Marks',
-    author_email='contact at pictuga dot com',
+    author_email = 'contact at pictuga dot com',
-    url='http://morss.it/',
+    url = 'http://morss.it/',
-    license='AGPL v3',
+    download_url = 'https://git.pictuga.com/pictuga/morss',
-    package_dir={package_name: package_name},
+    license = 'AGPL v3',
-    packages=find_packages(),
+    packages = [package_name],
-    package_data={package_name: ['feedify.ini', 'reader.html.template']},
+    install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
-    test_suite=package_name + '.tests')
+    package_data = {package_name: ['feedify.ini']},
    data_files = [
        ('share/' + package_name, ['README.md', 'LICENSE']),
        ('share/' + package_name + '/www', glob('www/*.*')),
        ('share/' + package_name + '/www/cgi', [])
    ],
    entry_points = {
        'console_scripts': [package_name + '=' + package_name + ':main']
    })
--- a/www/index.html
+++ b/www/index.html
@@ -35,8 +35,8 @@
 			<input type="text" id="url" name="url" placeholder="Feed url (http://example.com/feed.xml)" />
 		</form>
-		<code>Copyright: pictuga 2013-2014<br/>
+		<code>Copyright: pictuga 2013-2020<br/>
-		Source code: https://github.com/pictuga/morss</code>
+		Source code: https://git.pictuga.com/pictuga/morss</code>
 		<script>
 			form = document.forms[0]
--- a/www/sheet.xsl
+++ b/www/sheet.xsl
@@ -1,5 +1,12 @@
 <?xml version="1.0" encoding="utf-8"?>
-<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
+<xsl:stylesheet version="1.1"
 	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 	xmlns:atom="http://www.w3.org/2005/Atom"
 	xmlns:atom03="http://purl.org/atom/ns#"
 	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 	xmlns:content="http://purl.org/rss/1.0/modules/content/"
 	xmlns:rssfake="http://purl.org/rss/1.0/"
 	>
 	<xsl:output method="html"/>
@@ -8,115 +15,262 @@
 		<head>
 			<title>RSS feed by morss</title>
 			<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
 			<meta name="robots" content="noindex" />
 			<style type="text/css">
 				body {
 					overflow-wrap: anywhere;
 					word-wrap: anywhere;
 					font-family: sans;
 				}
-				#url {
+				input, select {
-					background-color: rgba(255, 165, 0, 0.25);
+					font-family: inherit;
-					padding: 1% 5%;
+					font-size: inherit;
-					display: inline-block;
+					text-align: inherit;
 				}
 				header {
 					text-align: center;
 					border-bottom: 1px solid silver;
 				}
 				.input-combo {
 					display: flex;
 					flex-flow: row;
 					align-items: stretch;
 					width: 800px;
 					max-width: 100%;
-				}
+					margin: auto;
-				body > ul {
+					border: 1px solid grey;
 					padding: .5em .5em;
 					background-color: #FFFAF4;
 				}
 				.input-combo * {
 					display: inline-block;
 					line-height: 2em;
 					border: 0;
 					background: transparent;
 				}
 				.input-combo > :not(.button) {
 					max-width: 100%;
 					flex-grow: 1;
 					flex-shrink 0;
 					white-space: nowrap;
 					text-overflow: ellipsis;
 					overflow: hidden;
 				}
 				.input-combo .button {
 					flex-grow: 0;
 					flex-shrink 1;
 					cursor: pointer;
 					min-width: 2em;
 					text-align: center;
 					border-left: 1px solid silver;
 					color: #06f;
 				}
 				[onclick_title] {
 					cursor: pointer;
 					position: relative;
 				}
 				[onclick_title]::before {
 					opacity: 0;
 					content: attr(onclick_title);
 					font-weight: normal;
 					position: absolute;
 					left: -300%;
 					z-index: 1;
 					background: grey;
 					color: white;
 					border-radius: 0.5em;
 					padding: 0 1em;
 				}
 				[onclick_title]:not(:active)::before {
 					transition: opacity 1s ease-in-out;
 				}
 				[onclick_title]:active::before {
 					opacity: 1;
 				}
 				header > form {
 					text-align: center;
 					margin: 1%;
 				}
 				header a {
 					text-decoration: inherit;
 					color: #FF7B0A;
 					font-weight: bold;
 				}
 				.item {
 					background-color: #FFFAF4;
 					border: 1px solid silver;
 					margin: 1%;
 					max-width: 100%;
 				}
 				.item > * {
 					padding: 1%;
 				}
 				.item > :not(:last-child) {
 					border-bottom: 1px solid silver;
 				}
 				.item > a {
 					display: block;
 					font-weight: bold;
 					font-size: 1.5em;
 				}
 				.desc, .content {
 					overflow: hidden;
 				}
 				.desc *, .content * {
 					max-width: 100%;
 				}
 				ul {
 					list-style-type: none;
 				}
 				.tag {
 					color: darkred;
 				}
 				.attr {
 					color: darksalmon;
 				}
 				.value {
 					color: darkblue;
 				}
 				.comment {
 					color: lightgrey;
 				}
 				pre {
 					margin: 0;
 					max-width: 100%;
 					white-space: normal;
 				}
 			</style>
 		</head>
 		<body>
-			<h1>RSS feed by morss</h1>
+			<header>
 				<h1>RSS feed by morss</h1>
-			<p>Your RSS feed is <strong style="color: green">ready</strong>. You
+				<p>Your RSS feed is <strong style="color: green">ready</strong>. You
-			can enter the following url in your newsreader:</p>
+				can enter the following url in your newsreader:</p>
-			<div id="url"></div>
+				<div class="input-combo">
 					<input id="url" readonly="readonly"/>
 					<span class="button" onclick="copy_link()" title="Copy" onclick_title="Copied">
 						<svg width="16px" height="16px" viewBox="0 0 16 16" fill="currentColor" xmlns="http://www.w3.org/2000/svg">
 							<path fill-rule="evenodd" d="M4 1.5H3a2 2 0 00-2 2V14a2 2 0 002 2h10a2 2 0 002-2V3.5a2 2 0 00-2-2h-1v1h1a1 1 0 011 1V14a1 1 0 01-1 1H3a1 1 0 01-1-1V3.5a1 1 0 011-1h1v-1z" clip-rule="evenodd"/>
 							<path fill-rule="evenodd" d="M9.5 1h-3a.5.5 0 00-.5.5v1a.5.5 0 00.5.5h3a.5.5 0 00.5-.5v-1a.5.5 0 00-.5-.5zm-3-1A1.5 1.5 0 005 1.5v1A1.5 1.5 0 006.5 4h3A1.5 1.5 0 0011 2.5v-1A1.5 1.5 0 009.5 0h-3z" clip-rule="evenodd"/>
 						</svg>
 					</span>
 				</div>
-			<ul>
+				<form onchange="open_feed()">
-				<xsl:apply-templates/>
+					More options: Output the 
-			</ul>
+					<select>
 						<option value="">full-text</option>
 						<option value=":proxy">original</option>
 						<option value=":clip">original + full-text</option>
 					</select>
 					feed as 
 					<select>
 						<option value="">RSS</option>
 						<option value=":json:cors">JSON</option>
 						<option value=":html">HTML</option>
 						<option value=":csv">CSV</option>
 					</select>
 					and 
 					<select>
 						<option value="">keep</option>
 						<option value=":nolink:noref">remove</option>
 					</select>
 					links
 					<input type="hidden" value="" name="extra_options"/>
 				</form>
 				<p>Click <a href="/">here</a> to go back to morss</p>
 			</header>
 			<div id="header">
 				<h1>
 					<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:title|rss/channel/title|atom:feed/atom:title|atom03:feed/atom03:title"/>
 				</h1>
 				<p>
 					<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:description|rss/channel/description|atom:feed/atom:subtitle|atom03:feed/atom03:subtitle"/>
 				</p>
 			</div>
 			<div id="content">
 				<xsl:for-each select="rdf:RDF/rssfake:channel/rssfake:item|rss/channel/item|atom:feed/atom:entry|atom03:feed/atom03:entry">
 					<div class="item">
 						<a href="/" target="_blank"><xsl:attribute name="href"><xsl:value-of select="rssfake:link|link|atom:link/@href|atom03:link/@href"/></xsl:attribute>
 								<xsl:value-of select="rssfake:title|title|atom:title|atom03:title"/>
 						</a>
 						<div class="desc">
 							<xsl:copy-of select="rssfake:description|description|atom:summary|atom03:summary"/>
 						</div>
 						<div class="content">
 							<xsl:copy-of select="content:encoded|atom:content|atom03:content"/>
 						</div>
 					</div>
 				</xsl:for-each>
 			</div>
 			<script>
-				document.getElementById("url").innerHTML = window.location.href;
+				document.getElementById("url").value = window.location.href
 				if (!/:html/.test(window.location.href))
 					for (var content of document.querySelectorAll(".desc,.content"))
 						content.innerHTML = (content.innerText.match(/>/g) || []).length > 10 ? content.innerText : content.innerHTML
 				var options = parse_location()[0]
 				if (options) {
 					for (var select of document.forms[0].elements)
 						if (select.tagName == 'SELECT')
 							for (var option of select)
 								if (option.value)
 									if (options.match(option.value)) {
 										select.value = option.value
 										options = options.replace(option.value, '')
 										break
 									}
 					document.forms[0]['extra_options'].value = options
 				}
 				function copy_content(input) {
 					input.focus()
 					input.select()
 					document.execCommand('copy')
 				}
 				function copy_link() {
 					copy_content(document.getElementById("url"))
 				}
 				function parse_location() {
 					return (window.location.pathname + window.location.search).match(/^\/(?:(:[^\/]+)\/)?(.*$)$/).slice(1)
 				}
 				function open_feed() {
 					var url = parse_location()[1]
 					var options = Array.from(document.forms[0].elements).map(x=>x.value).join('')
 					var target = '/' + (options ? options + '/' : '') + url
 					if (target != window.location.pathname)
 						window.location.href = target
 				}
 			</script>
 		</body>
 		</html>
 	</xsl:template>
 	<xsl:template match="*">
 		<li>
 			<span class="element">
 				&lt;
 					<span class="tag"><xsl:value-of select="name()"/></span>
 					<xsl:for-each select="@*">
 						<span class="attr"> <xsl:value-of select="name()"/></span>
 						=
 						"<span class="value"><xsl:value-of select="."/></span>"
 					</xsl:for-each>
 				&gt;
 			</span>
 			<xsl:if test="node()">
 				<ul>
 					<xsl:apply-templates/>
 				</ul>
 			</xsl:if>
 			<span class="element">
 				&lt;/
 					<span class="tag"><xsl:value-of select="name()"/></span>
 				&gt;
 			</span>
 		</li>
 	</xsl:template>
 	<xsl:template match="comment()">
 		<li>
 			<pre class="comment"><![CDATA[<!--]]><xsl:value-of select="."/><![CDATA[-->]]></pre>
 		</li>
 	</xsl:template>
 	<xsl:template match="text()">
 		<li>
 			<pre>
 				<xsl:value-of select="normalize-space(.)"/>
 			</pre>
 		</li>
 	</xsl:template>
 	<xsl:template match="text()[not(normalize-space())]"/>
 </xsl:stylesheet>
Author	SHA1	Message	Date
pictuga	27a42c47aa	morss: use final request url Code is not very elegant...	2020-04-28 22:30:21 +02:00
pictuga	c27c38f7c7	crawler: return dict instead of tuple	2020-04-28 22:29:07 +02:00
pictuga	a1dc96cb50	feeds: remove mimetype from function call as no longer used	2020-04-28 22:07:25 +02:00
pictuga	749acc87fc	Centralize url clean up in crawler.py	2020-04-28 22:03:49 +02:00
pictuga	c186188557	README: warning about lxml installation	2020-04-28 21:58:26 +02:00
pictuga	cb69e3167f	crawler: accept non-ascii urls Covering one more corner case!	2020-04-28 14:47:23 +02:00
pictuga	c3f06da947	morss: process(): specify encoding for clarity	2020-04-28 14:45:00 +02:00
pictuga	44a3e0edc4	readabilite: specify in- and out-going encoding	2020-04-28 14:44:35 +02:00
pictuga	4a9b505499	README: update python lib instructions	2020-04-27 18:12:14 +02:00
pictuga	818cdaaa9b	Make it possible to call sub-libs in non interactive mode Run `python -m morss.feeds http://lemonde.fr` and so on	2020-04-27 18:00:14 +02:00
pictuga	2806c64326	Make it possible to directly run sub-libs (feeds, crawler, readabilite) Run `python -im morss.feeds http://website.sample/rss.xml` and so on	2020-04-27 17:19:31 +02:00
pictuga	d39d7bb19d	sheet.xsl: limit overflow	2020-04-25 15:27:49 +02:00
pictuga	e5e3746fc6	sheet.xsl: show plain url	2020-04-25 15:27:13 +02:00
pictuga	960c9d10d6	sheet.xsl: customize output feed form	2020-04-25 15:26:47 +02:00
pictuga	0e7a5b9780	sheet.xsl: wrap header in <header>	2020-04-25 15:24:57 +02:00
pictuga	186bedcf62	sheet.xsl: smarter html reparser	2020-04-25 15:22:25 +02:00
pictuga	5847e18e42	sheet: improved feed address output (w/ c/c)	2020-04-25 15:21:47 +02:00
pictuga	f6bc23927f	readabilite: drop dangerous tags (script, style)	2020-04-25 12:25:02 +02:00
pictuga	c86572374e	readabilite: minimum score requirement	2020-04-25 12:24:36 +02:00
pictuga	59ef5af9e2	feeds: fix bug when deleting attr in html	2020-04-24 22:12:05 +02:00
pictuga	6a0531ca03	crawler: randomize user agent	2020-04-24 11:28:39 +02:00
pictuga	8187876a06	crawler: stop at first alternative link Should save a few ms and the first one is usually (?) the most relevant/generic	2020-04-23 11:23:45 +02:00
pictuga	325a373e3e	feeds: add SyntaxError catch	2020-04-20 16:15:15 +02:00
pictuga	2719bd6776	crawler: fix chinese encoding	2020-04-20 16:14:55 +02:00
pictuga	285e1e5f42	docker: pip install local	2020-04-19 13:25:53 +02:00
pictuga	41a63900c2	README: improve docker instructions	2020-04-19 13:01:08 +02:00
pictuga	ec8edb02f1	Various small bug fixes	2020-04-19 12:54:02 +02:00
pictuga	d01b943597	Remove leftover threading var	2020-04-19 12:51:11 +02:00
pictuga	b361aa2867	Add timeout to :get	2020-04-19 12:50:26 +02:00
pictuga	4ce3c7cb32	Small code clean ups	2020-04-19 12:50:05 +02:00
pictuga	7e45b2611d	Disable multi-threading Impact was mostly negative due to locks	2020-04-19 12:29:52 +02:00
pictuga	036e5190f1	crawler: remove unused code	2020-04-18 21:40:02 +02:00
pictuga	e99c5b3b71	morss: more sensible default MAX/LIM values	2020-04-18 17:21:45 +02:00
pictuga	4f44df8d63	Make all ports default to 8080	2020-04-18 17:15:59 +02:00
pictuga	497c14db81	Add dockerfile & how to in README	2020-04-18 17:04:44 +02:00
pictuga	a4e1dba8b7	sheet.xsl: improve url display	2020-04-16 10:33:36 +02:00
pictuga	7375adce33	sheet.xsl: fix & improve	2020-04-15 23:34:28 +02:00
pictuga	663212de0a	sheet.xsl: various cosmetic improvements	2020-04-15 23:22:45 +02:00
pictuga	4a2ea1bce9	README: add gunicorn instructions	2020-04-15 22:31:21 +02:00
pictuga	fe82b19c91	Merge .xsl & html template Turns out they somehow serve a similar purpose	2020-04-15 22:30:45 +02:00
pictuga	0b31e97492	morss: remove debug code in http file handler	2020-04-14 23:20:03 +02:00
pictuga	b0ad7c259d	Add README & LICENSE to data_files	2020-04-14 19:34:12 +02:00
pictuga	bffb23f884	README: how to use cli	2020-04-14 18:21:32 +02:00
pictuga	59139272fd	Auto-detect the location of www/ Either ../www or /usr/share/morss Adapted README accordingly	2020-04-14 18:07:19 +02:00
pictuga	39b0a1d7cc	setup.py: fix deps & files	2020-04-14 17:36:42 +02:00
pictuga	65803b328d	New git url and updated date in provided index.html	2020-04-13 15:30:32 +02:00
pictuga	e6b7c0eb33	Fix app definition for uwsgi	2020-04-13 15:30:09 +02:00
pictuga	67c096ad5b	feeds: add fake path to default html parser Without it, some websites were accidentally matching it (false positives)	2020-04-12 13:00:56 +02:00
pictuga	f018437544	crawler: make mysql backend thread safe	2020-04-12 12:53:05 +02:00
pictuga	8e5e8d24a4	Timezone fixes	2020-04-10 20:33:59 +02:00
pictuga	ee78a7875a	morss: focus on the most recent feed items	2020-04-10 16:08:13 +02:00
pictuga	9e7b9d95ee	feeds: properly use html template	2020-04-09 20:00:51 +02:00
pictuga	987a719c4e	feeds: try all parsers regardless of contenttype Turns out some websites send the wrong contenttype (json for html, html for xml, etc.)	2020-04-09 19:17:51 +02:00
pictuga	47b33f4baa	morss: specify server output encoding	2020-04-09 19:10:45 +02:00
pictuga	3c7f512583	feeds: handle several errors	2020-04-09 19:09:10 +02:00
pictuga	a32f5a8536	readabilite: add debug option (also used by :get)	2020-04-09 19:08:13 +02:00
pictuga	63a06524b7	morss: various encoding fixes	2020-04-09 19:06:51 +02:00
pictuga	b0f80c6d3c	morss: fix csv output encoding	2020-04-09 19:05:50 +02:00
pictuga	78cea10ead	morss: replace :getpage with :get Also provides readabilite debugging	2020-04-09 18:43:20 +02:00
pictuga	e5a82ff1f4	crawler: drop auto-referer Was solving some issues. But creating even more issues.	2020-04-07 10:39:21 +02:00
pictuga	f3d1f92b39	Detect encoding everytime	2020-04-07 10:38:36 +02:00
pictuga	7691df5257	Use wrapper for http calls	2020-04-07 10:30:17 +02:00
pictuga	0ae0dbc175	README: mention csv output	2020-04-07 09:24:32 +02:00
pictuga	f1d0431e68	morss: drop :html, replaced with :reader README updated accordingly	2020-04-07 09:23:29 +02:00
pictuga	a09831415f	feeds: fix bug when mimetype matches nothing	2020-04-06 18:53:07 +02:00
pictuga	bfad6b7a4a	readabilite: clean before counting To remove links which are not kept anyway	2020-04-06 16:55:39 +02:00
pictuga	6b8c3e51e7	readabilite: fix threshold feature Awkward typo...	2020-04-06 16:52:06 +02:00
pictuga	dc9e425247	readabilite: don't clean-out the top 10% nodes Loosen up the code once again to limit over-kill	2020-04-06 14:26:28 +02:00
pictuga	2f48e18bb1	readabilite: put scores directly in html node Probably slower but makes code somewhat cleaner...	2020-04-06 14:21:41 +02:00
pictuga	31cac921c7	README: remove ref to iTunes	2020-04-05 22:20:33 +02:00
pictuga	a82ec96eb7	Delete feedify.py leftover code iTunes integration untested, unreliable and not working...	2020-04-05 22:16:52 +02:00
pictuga	aad2398e69	feeds: turns out lxml.etree doesn't have drop_tag	2020-04-05 21:50:38 +02:00
pictuga	eeac630855	crawler: add more "realistic" headers	2020-04-05 21:11:57 +02:00
pictuga	e136b0feb2	readabilite: loosen the slayer Previous impl. lead to too many empty results	2020-04-05 20:47:30 +02:00
pictuga	6cf32af6c0	readabilite: also use BS	2020-04-05 20:46:42 +02:00
pictuga	568e7d7dd2	feeds: make BS's output bytes for lxml's sake	2020-04-05 20:46:04 +02:00
pictuga	3617f86e9d	morss: make cgi_encore more robust	2020-04-05 16:43:11 +02:00
pictuga	d90756b337	morss: drop 'keep' option Because the Firefox behaviour it is working around is no longer in use	2020-04-05 16:37:27 +02:00
pictuga	40c69f17d2	feeds: parse html with BS More robust & to make it consistent with :getpage	2020-04-05 16:12:41 +02:00
pictuga	99461ea185	crawler: fix var name issues (private_cache)	2020-04-05 16:11:36 +02:00
pictuga	bf86c1e962	crawler: make AutoUA match http(s) type	2020-04-05 16:07:51 +02:00
pictuga	d20f6237bd	crawler: replace ContentNegoHandler with AlternateHandler More basic. Sends the same headers no matter what. Make requests more "replicable". Also, drop "text/xml" from RSS contenttype, too broad, matches garbage	2020-04-05 16:05:59 +02:00
pictuga	8a4d68d72c	crawler: drop 'basic' toggle Can't even remember the use case	2020-04-05 16:03:06 +02:00
pictuga	e6811138fd	morss: use redirected url in :getpage Still have to find how to do the same thing with feeds...	2020-04-04 20:04:57 +02:00
pictuga	35b702fffd	morss: default values for feed creation	2020-04-04 19:39:32 +02:00
pictuga	4a88886767	morss: get_page to act as a basic proxy (for iframes)	2020-04-04 16:37:15 +02:00
pictuga	1653394cf7	morss: cgi_dispatcher to be able to create extra functions	2020-04-04 16:35:16 +02:00
pictuga	a8a90cf414	morss: move url/options parsing to own function For future re-use	2020-04-04 16:33:52 +02:00
pictuga	bdbaf0f8a7	morss/cgi: fix handling of special chars in url	2020-04-04 16:21:37 +02:00
pictuga	d0e447a2a6	ItemFix: clean up Pocket links	2020-04-04 16:20:39 +02:00
pictuga	e6817e01b4	sheet.xsl: set font to "sans" Browsers don't all have the same default font. Overriding for consistency	2020-04-03 17:47:19 +02:00
pictuga	7c3091d64c	morss: code spacing One of those commits that make me feel useful	2020-03-21 23:41:46 +01:00
pictuga	37b4e144a9	morss: small fixes Includes dropping off ftp support	2020-03-21 23:30:18 +01:00
pictuga	bd4b7b5bb2	morss: convert HTML feeds to XML ones for completeness	2020-03-21 23:27:42 +01:00
pictuga	68d920d4b5	morss: make FeedFormat more flexible with encoding	2020-03-21 23:26:35 +01:00
pictuga	758ff404a8	morss: fix cgi_app silent output Must return sth	2020-03-21 23:25:25 +01:00
pictuga	463530f02c	morss: middleware to enforce encoding bytes are always expected	2020-03-21 23:23:50 +01:00
pictuga	ec0a28a91d	morss: use middleware for wsgi apps	2020-03-21 23:23:21 +01:00
pictuga	421acb439d	morss: make errors more readable over http	2020-03-21 23:08:29 +01:00
pictuga	42c5d09ccb	morss: split "options" var into "raw_options" & "options" To make it clearer who-is-what	2020-03-21 23:07:07 +01:00
pictuga	056de12484	morss: add sheet.xsl to file handled by http server	2020-03-21 23:06:28 +01:00
pictuga	961a31141f	morss: fix url fixing	2020-03-21 17:28:00 +01:00
pictuga	a7b01ee85e	readabilite: further html processing instructions fix	2020-03-21 17:23:50 +01:00