Compare commits
63 Commits
v1.1
...
0b31e97492
Author | SHA1 | Date | |
---|---|---|---|
0b31e97492 | |||
b0ad7c259d | |||
bffb23f884 | |||
59139272fd | |||
39b0a1d7cc | |||
65803b328d | |||
e6b7c0eb33 | |||
67c096ad5b | |||
f018437544 | |||
8e5e8d24a4 | |||
ee78a7875a | |||
9e7b9d95ee | |||
987a719c4e | |||
47b33f4baa | |||
3c7f512583 | |||
a32f5a8536 | |||
63a06524b7 | |||
b0f80c6d3c | |||
78cea10ead | |||
e5a82ff1f4 | |||
f3d1f92b39 | |||
7691df5257 | |||
0ae0dbc175 | |||
f1d0431e68 | |||
a09831415f | |||
bfad6b7a4a | |||
6b8c3e51e7 | |||
dc9e425247 | |||
2f48e18bb1 | |||
31cac921c7 | |||
a82ec96eb7 | |||
aad2398e69 | |||
eeac630855 | |||
e136b0feb2 | |||
6cf32af6c0 | |||
568e7d7dd2 | |||
3617f86e9d | |||
d90756b337 | |||
40c69f17d2 | |||
99461ea185 | |||
bf86c1e962 | |||
d20f6237bd | |||
8a4d68d72c | |||
e6811138fd | |||
35b702fffd | |||
4a88886767 | |||
1653394cf7 | |||
a8a90cf414 | |||
bdbaf0f8a7 | |||
d0e447a2a6 | |||
e6817e01b4 | |||
7c3091d64c | |||
37b4e144a9 | |||
bd4b7b5bb2 | |||
68d920d4b5 | |||
758ff404a8 | |||
463530f02c | |||
ec0a28a91d | |||
421acb439d | |||
42c5d09ccb | |||
056de12484 | |||
961a31141f | |||
a7b01ee85e |
31
README.md
31
README.md
@@ -24,15 +24,13 @@ hand-written rules (ie. there's no automatic detection of links to build feeds).
|
||||
Please mind that feeds based on html files may stop working unexpectedly, due to
|
||||
html structure changes on the target website.
|
||||
|
||||
Additionally morss can grab the source xml feed of iTunes podcast, and detect
|
||||
rss feeds in html pages' `<meta>`.
|
||||
Additionally morss can detect rss feeds in html pages' `<meta>`.
|
||||
|
||||
You can use this program online for free at **[morss.it](https://morss.it/)**.
|
||||
|
||||
Some features of morss:
|
||||
- Read RSS/Atom feeds
|
||||
- Create RSS feeds from json/html pages
|
||||
- Convert iTunes podcast links into xml links
|
||||
- Export feeds as RSS/JSON/CSV/HTML
|
||||
- Fetch full-text content of feed items
|
||||
- Follow 301/meta redirects
|
||||
@@ -48,6 +46,7 @@ You do need:
|
||||
|
||||
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
|
||||
- [lxml](http://lxml.de/) for xml parsing
|
||||
- [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
|
||||
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
|
||||
- [chardet](https://pypi.python.org/pypi/chardet)
|
||||
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
|
||||
@@ -56,7 +55,7 @@ You do need:
|
||||
Simplest way to get these:
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
pip install git+https://git.pictuga.com/pictuga/morss.git@master
|
||||
```
|
||||
|
||||
You may also need:
|
||||
@@ -74,9 +73,10 @@ The arguments are:
|
||||
|
||||
- Change what morss does
|
||||
- `json`: output as JSON
|
||||
- `html`: outpout as HTML
|
||||
- `csv`: outpout as CSV
|
||||
- `proxy`: doesn't fill the articles
|
||||
- `clip`: stick the full article content under the original feed content (useful for twitter)
|
||||
- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
|
||||
- `search=STRING`: does a basic case-sensitive search in the feed
|
||||
- Advanced
|
||||
- `csv`: export to csv
|
||||
@@ -88,11 +88,9 @@ The arguments are:
|
||||
- `mono`: disable multithreading while fetching, makes debugging easier
|
||||
- `theforce`: force download the rss feed and ignore cached http errros
|
||||
- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
|
||||
- `encoding=ENCODING`: overrides the encoding auto-detection of the crawler. Some web developers did not quite understand the importance of setting charset/encoding tags correctly...
|
||||
- http server only
|
||||
- `callback=NAME`: for JSONP calls
|
||||
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
|
||||
- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
|
||||
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
|
||||
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
|
||||
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
|
||||
@@ -146,17 +144,16 @@ Running this command should do:
|
||||
uwsgi --http :9090 --plugin python --wsgi-file main.py
|
||||
```
|
||||
|
||||
However, one problem might be how to serve the provided `index.html` file if it
|
||||
isn't in the same directory. Therefore you can add this at the end of the
|
||||
command to point to another directory `--pyargv '--root ../../www/'`.
|
||||
|
||||
|
||||
#### Using morss' internal HTTP server
|
||||
|
||||
Morss can run its own HTTP server. The later should start when you run morss
|
||||
without any argument, on port 8080.
|
||||
|
||||
You can change the port and the location of the `www/` folder like this `python -m morss 9000 --root ../../www`.
|
||||
```shell
|
||||
morss
|
||||
```
|
||||
|
||||
You can change the port like this `morss 9000`.
|
||||
|
||||
#### Passing arguments
|
||||
|
||||
@@ -176,9 +173,9 @@ Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rs
|
||||
|
||||
Run:
|
||||
```
|
||||
python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||
```
|
||||
For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
||||
For example: `morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
||||
|
||||
*(Brackets indicate optional text)*
|
||||
|
||||
@@ -191,9 +188,9 @@ scripts can be run on top of the RSS feed, using its
|
||||
|
||||
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
|
||||
```
|
||||
[python[2.7]] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||
```
|
||||
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
|
||||
For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
|
||||
|
||||
*(Brackets indicate optional text)*
|
||||
|
||||
|
2
main.py
2
main.py
@@ -1,6 +1,6 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from morss import main, cgi_wrapper as application
|
||||
from morss import main, cgi_standalone_app as application
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
@@ -27,13 +27,33 @@ except NameError:
|
||||
|
||||
MIMETYPE = {
|
||||
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
|
||||
'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
|
||||
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
|
||||
|
||||
|
||||
DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
|
||||
|
||||
|
||||
def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=False):
|
||||
def get(*args, **kwargs):
|
||||
return adv_get(*args, **kwargs)[0]
|
||||
|
||||
|
||||
def adv_get(url, timeout=None, *args, **kwargs):
|
||||
if timeout is None:
|
||||
con = custom_handler(*args, **kwargs).open(url)
|
||||
|
||||
else:
|
||||
con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
|
||||
|
||||
data = con.read()
|
||||
|
||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||
encoding= detect_encoding(data, con)
|
||||
|
||||
return data, con, contenttype, encoding
|
||||
|
||||
|
||||
def custom_handler(follow=None, delay=None, encoding=None):
|
||||
handlers = []
|
||||
|
||||
# as per urllib2 source code, these Handelers are added first
|
||||
@@ -51,14 +71,12 @@ def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=F
|
||||
handlers.append(HTTPEquivHandler())
|
||||
handlers.append(HTTPRefreshHandler())
|
||||
handlers.append(UAHandler(DEFAULT_UA))
|
||||
|
||||
if not basic:
|
||||
handlers.append(AutoRefererHandler())
|
||||
handlers.append(BrowserlyHeaderHandler())
|
||||
|
||||
handlers.append(EncodingFixHandler(encoding))
|
||||
|
||||
if accept:
|
||||
handlers.append(ContentNegociationHandler(MIMETYPE[accept], strict))
|
||||
if follow:
|
||||
handlers.append(AlternateHandler(MIMETYPE[follow]))
|
||||
|
||||
handlers.append(CacheHandler(force_min=delay))
|
||||
|
||||
@@ -196,45 +214,33 @@ class UAHandler(BaseHandler):
|
||||
https_request = http_request
|
||||
|
||||
|
||||
class AutoRefererHandler(BaseHandler):
|
||||
class BrowserlyHeaderHandler(BaseHandler):
|
||||
""" Add more headers to look less suspicious """
|
||||
|
||||
def http_request(self, req):
|
||||
req.add_unredirected_header('Referer', 'http://%s' % req.host)
|
||||
req.add_unredirected_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
|
||||
req.add_unredirected_header('Accept-Language', 'en-US,en;q=0.5')
|
||||
return req
|
||||
|
||||
https_request = http_request
|
||||
|
||||
|
||||
class ContentNegociationHandler(BaseHandler):
|
||||
" Handler for content negociation. Also parses <link rel='alternate' type='application/rss+xml' href='...' /> "
|
||||
class AlternateHandler(BaseHandler):
|
||||
" Follow <link rel='alternate' type='application/rss+xml' href='...' /> "
|
||||
|
||||
def __init__(self, accept=None, strict=False):
|
||||
self.accept = accept
|
||||
self.strict = strict
|
||||
|
||||
def http_request(self, req):
|
||||
if self.accept is not None:
|
||||
if isinstance(self.accept, basestring):
|
||||
self.accept = (self.accept,)
|
||||
|
||||
string = ','.join(self.accept)
|
||||
|
||||
if self.strict:
|
||||
string += ',*/*;q=0.9'
|
||||
|
||||
req.add_unredirected_header('Accept', string)
|
||||
|
||||
return req
|
||||
def __init__(self, follow=None):
|
||||
self.follow = follow or []
|
||||
|
||||
def http_response(self, req, resp):
|
||||
contenttype = resp.info().get('Content-Type', '').split(';')[0]
|
||||
if 200 <= resp.code < 300 and self.accept is not None and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
|
||||
if 200 <= resp.code < 300 and len(self.follow) and contenttype in MIMETYPE['html'] and contenttype not in self.follow:
|
||||
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
|
||||
|
||||
data = resp.read()
|
||||
links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
|
||||
|
||||
for link in links:
|
||||
if link.get('type', '') in self.accept:
|
||||
if link.get('type', '') in self.follow:
|
||||
resp.code = 302
|
||||
resp.msg = 'Moved Temporarily'
|
||||
resp.headers['location'] = link.get('href')
|
||||
@@ -246,7 +252,6 @@ class ContentNegociationHandler(BaseHandler):
|
||||
|
||||
return resp
|
||||
|
||||
https_request = http_request
|
||||
https_response = http_response
|
||||
|
||||
|
||||
@@ -384,7 +389,7 @@ class CacheHandler(BaseHandler):
|
||||
|
||||
elif self.force_min is None and ('no-cache' in cc_list
|
||||
or 'no-store' in cc_list
|
||||
or ('private' in cc_list and not self.private)):
|
||||
or ('private' in cc_list and not self.private_cache)):
|
||||
# kindly follow web servers indications, refresh
|
||||
return None
|
||||
|
||||
@@ -419,7 +424,7 @@ class CacheHandler(BaseHandler):
|
||||
|
||||
cc_list = [x for x in cache_control if '=' not in x]
|
||||
|
||||
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private):
|
||||
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private_cache):
|
||||
# kindly follow web servers indications
|
||||
return resp
|
||||
|
||||
@@ -513,18 +518,20 @@ import pymysql.cursors
|
||||
|
||||
|
||||
class MySQLCacheHandler(BaseCache):
|
||||
" NB. Requires mono-threading, as pymysql isn't thread-safe "
|
||||
def __init__(self, user, password, database, host='localhost'):
|
||||
self.con = pymysql.connect(host=host, user=user, password=password, database=database, charset='utf8', autocommit=True)
|
||||
self.user = user
|
||||
self.password = password
|
||||
self.database = database
|
||||
self.host = host
|
||||
|
||||
with self.con.cursor() as cursor:
|
||||
with self.cursor() as cursor:
|
||||
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
|
||||
|
||||
def __del__(self):
|
||||
self.con.close()
|
||||
def cursor(self):
|
||||
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
|
||||
|
||||
def __getitem__(self, url):
|
||||
cursor = self.con.cursor()
|
||||
cursor = self.cursor()
|
||||
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
|
||||
row = cursor.fetchone()
|
||||
|
||||
@@ -535,10 +542,10 @@ class MySQLCacheHandler(BaseCache):
|
||||
|
||||
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
|
||||
if url in self:
|
||||
with self.con.cursor() as cursor:
|
||||
with self.cursor() as cursor:
|
||||
cursor.execute('UPDATE data SET code=%s, msg=%s, headers=%s, data=%s, timestamp=%s WHERE url=%s',
|
||||
value + (url,))
|
||||
|
||||
else:
|
||||
with self.con.cursor() as cursor:
|
||||
with self.cursor() as cursor:
|
||||
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s)', (url,) + value)
|
||||
|
@@ -90,6 +90,9 @@ item_updated = updated
|
||||
[html]
|
||||
mode = html
|
||||
|
||||
path =
|
||||
http://localhost/
|
||||
|
||||
title = //div[@id='header']/h1
|
||||
desc = //div[@id='header']/h2
|
||||
items = //div[@id='content']/div
|
||||
@@ -99,7 +102,7 @@ item_link = ./a/@href
|
||||
item_desc = ./div[class=desc]
|
||||
item_content = ./div[class=content]
|
||||
|
||||
base = <!DOCTYPE html> <html> <head> <title>Feed reader by morss</title> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" /> </head> <body> <div id="header"> <h1>@feed.title</h1> <h2>@feed.desc</h2> <p>- via morss</p> </div> <div id="content"> <div class="item"> <a class="title link" href="@item.link" target="_blank">@item.title</a> <div class="desc">@item.desc</div> <div class="content">@item.content</div> </div> </div> <script> var items = document.getElementsByClassName('item') for (var i in items) items[i].onclick = function() { this.classList.toggle('active') document.body.classList.toggle('noscroll') } </script> </body> </html>
|
||||
base = file:reader.html.template
|
||||
|
||||
[twitter]
|
||||
mode = html
|
||||
|
@@ -1,28 +0,0 @@
|
||||
import re
|
||||
import json
|
||||
|
||||
from . import crawler
|
||||
|
||||
try:
|
||||
basestring
|
||||
except NameError:
|
||||
basestring = str
|
||||
|
||||
|
||||
def pre_worker(url):
|
||||
if url.startswith('http://itunes.apple.com/') or url.startswith('https://itunes.apple.com/'):
|
||||
match = re.search('/id([0-9]+)(\?.*)?$', url)
|
||||
if match:
|
||||
iid = match.groups()[0]
|
||||
redirect = 'https://itunes.apple.com/lookup?id=%s' % iid
|
||||
|
||||
try:
|
||||
con = crawler.custom_handler(basic=True).open(redirect, timeout=4)
|
||||
data = con.read()
|
||||
|
||||
except (IOError, HTTPException):
|
||||
raise
|
||||
|
||||
return json.loads(data.decode('utf-8', 'replace'))['results'][0]['feedUrl']
|
||||
|
||||
return None
|
@@ -15,6 +15,7 @@ import dateutil.parser
|
||||
from copy import deepcopy
|
||||
|
||||
import lxml.html
|
||||
from .readabilite import parse as html_parse
|
||||
|
||||
json.encoder.c_make_encoder = None
|
||||
|
||||
@@ -46,13 +47,17 @@ def parse_rules(filename=None):
|
||||
|
||||
for section in rules.keys():
|
||||
for arg in rules[section].keys():
|
||||
if '\n' in rules[section][arg]:
|
||||
if rules[section][arg].startswith('file:'):
|
||||
import_file = os.path.join(os.path.dirname(__file__), rules[section][arg][5:])
|
||||
rules[section][arg] = open(import_file).read()
|
||||
|
||||
elif '\n' in rules[section][arg]:
|
||||
rules[section][arg] = rules[section][arg].split('\n')[1:]
|
||||
|
||||
return rules
|
||||
|
||||
|
||||
def parse(data, url=None, mimetype=None):
|
||||
def parse(data, url=None, mimetype=None, encoding=None):
|
||||
" Determine which ruleset to use "
|
||||
|
||||
rulesets = parse_rules()
|
||||
@@ -66,26 +71,20 @@ def parse(data, url=None, mimetype=None):
|
||||
for path in ruleset['path']:
|
||||
if fnmatch(url, path):
|
||||
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
|
||||
return parser(data, ruleset)
|
||||
return parser(data, ruleset, encoding=encoding)
|
||||
|
||||
# 2) Look for a parser based on mimetype
|
||||
|
||||
if mimetype is not None:
|
||||
parser_candidates = [x for x in parsers if mimetype in x.mimetype]
|
||||
|
||||
if mimetype is None or parser_candidates is None:
|
||||
parser_candidates = parsers
|
||||
# 2) Try each and every parser
|
||||
|
||||
# 3) Look for working ruleset for given parser
|
||||
# 3a) See if parsing works
|
||||
# 3b) See if .items matches anything
|
||||
|
||||
for parser in parser_candidates:
|
||||
for parser in parsers:
|
||||
ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
|
||||
# 'path' as they should have been caught beforehands
|
||||
|
||||
try:
|
||||
feed = parser(data)
|
||||
feed = parser(data, encoding=encoding)
|
||||
|
||||
except (ValueError):
|
||||
# parsing did not work
|
||||
@@ -112,7 +111,7 @@ def parse(data, url=None, mimetype=None):
|
||||
|
||||
|
||||
class ParserBase(object):
|
||||
def __init__(self, data=None, rules=None, parent=None):
|
||||
def __init__(self, data=None, rules=None, parent=None, encoding=None):
|
||||
if rules is None:
|
||||
rules = parse_rules()[self.default_ruleset]
|
||||
|
||||
@@ -121,9 +120,10 @@ class ParserBase(object):
|
||||
if data is None:
|
||||
data = rules['base']
|
||||
|
||||
self.root = self.parse(data)
|
||||
self.parent = parent
|
||||
self.encoding = encoding
|
||||
|
||||
self.root = self.parse(data)
|
||||
|
||||
def parse(self, raw):
|
||||
pass
|
||||
@@ -148,15 +148,15 @@ class ParserBase(object):
|
||||
c = csv.writer(out, dialect=csv.excel)
|
||||
|
||||
for item in self.items:
|
||||
row = [getattr(item, x) for x in item.dic]
|
||||
|
||||
if encoding != 'unicode':
|
||||
row = [x.encode(encoding) if isinstance(x, unicode) else x for x in row]
|
||||
|
||||
c.writerow(row)
|
||||
c.writerow([getattr(item, x) for x in item.dic])
|
||||
|
||||
out.seek(0)
|
||||
return out.read()
|
||||
out = out.read()
|
||||
|
||||
if encoding != 'unicode':
|
||||
out = out.encode(encoding)
|
||||
|
||||
return out
|
||||
|
||||
def tohtml(self, **k):
|
||||
return self.convert(FeedHTML).tostring(**k)
|
||||
@@ -267,8 +267,15 @@ class ParserBase(object):
|
||||
|
||||
except AttributeError:
|
||||
# does not exist, have to create it
|
||||
self.rule_create(self.rules[rule_name])
|
||||
self.rule_set(self.rules[rule_name], value)
|
||||
try:
|
||||
self.rule_create(self.rules[rule_name])
|
||||
|
||||
except AttributeError:
|
||||
# no way to create it, give up
|
||||
pass
|
||||
|
||||
else:
|
||||
self.rule_set(self.rules[rule_name], value)
|
||||
|
||||
def rmv(self, rule_name):
|
||||
# easy deleter
|
||||
@@ -401,13 +408,14 @@ class ParserXML(ParserBase):
|
||||
|
||||
else:
|
||||
if html_rich:
|
||||
# atom stuff
|
||||
if 'atom' in rule:
|
||||
match.attrib['type'] = 'xhtml'
|
||||
|
||||
self._clean_node(match)
|
||||
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
|
||||
match.find('div').drop_tag()
|
||||
|
||||
if self.rules['mode'] == 'html':
|
||||
match.find('div').drop_tag() # not supported by lxml.etree
|
||||
|
||||
else: # i.e. if atom
|
||||
match.attrib['type'] = 'xhtml'
|
||||
|
||||
else:
|
||||
if match is not None and len(match):
|
||||
@@ -440,8 +448,7 @@ class ParserHTML(ParserXML):
|
||||
mimetype = ['text/html', 'application/xhtml+xml']
|
||||
|
||||
def parse(self, raw):
|
||||
parser = etree.HTMLParser(remove_blank_text=True) # remove_blank_text needed for pretty_print
|
||||
return etree.fromstring(raw, parser)
|
||||
return html_parse(raw, encoding=self.encoding)
|
||||
|
||||
def tostring(self, encoding='unicode', **k):
|
||||
return lxml.html.tostring(self.root, encoding=encoding, **k)
|
||||
@@ -467,6 +474,9 @@ class ParserHTML(ParserXML):
|
||||
element = deepcopy(match)
|
||||
match.getparent().append(element)
|
||||
|
||||
else:
|
||||
raise AttributeError('no way to create item')
|
||||
|
||||
|
||||
def parse_time(value):
|
||||
if value is None or value == 0:
|
||||
@@ -474,13 +484,13 @@ def parse_time(value):
|
||||
|
||||
elif isinstance(value, basestring):
|
||||
if re.match(r'^[0-9]+$', value):
|
||||
return datetime.fromtimestamp(int(value), tz.UTC)
|
||||
return datetime.fromtimestamp(int(value), tz.tzutc())
|
||||
|
||||
else:
|
||||
return dateutil.parser.parse(value)
|
||||
return dateutil.parser.parse(value).replace(tzinfo=tz.tzutc())
|
||||
|
||||
elif isinstance(value, int):
|
||||
return datetime.fromtimestamp(value, tz.UTC)
|
||||
return datetime.fromtimestamp(value, tz.tzutc())
|
||||
|
||||
elif isinstance(value, datetime):
|
||||
return value
|
||||
|
304
morss/morss.py
304
morss/morss.py
@@ -1,7 +1,10 @@
|
||||
import sys
|
||||
import os
|
||||
import os.path
|
||||
|
||||
import time
|
||||
from datetime import datetime
|
||||
from dateutil import tz
|
||||
|
||||
import threading
|
||||
|
||||
@@ -12,25 +15,25 @@ import lxml.etree
|
||||
import lxml.html
|
||||
|
||||
from . import feeds
|
||||
from . import feedify
|
||||
from . import crawler
|
||||
from . import readabilite
|
||||
|
||||
import wsgiref.simple_server
|
||||
import wsgiref.handlers
|
||||
import cgitb
|
||||
|
||||
|
||||
try:
|
||||
# python 2
|
||||
from Queue import Queue
|
||||
from httplib import HTTPException
|
||||
from urllib import quote_plus
|
||||
from urllib import unquote
|
||||
from urlparse import urlparse, urljoin, parse_qs
|
||||
except ImportError:
|
||||
# python 3
|
||||
from queue import Queue
|
||||
from http.client import HTTPException
|
||||
from urllib.parse import quote_plus
|
||||
from urllib.parse import unquote
|
||||
from urllib.parse import urlparse, urljoin, parse_qs
|
||||
|
||||
LIM_ITEM = 100 # deletes what's beyond
|
||||
@@ -44,7 +47,7 @@ THREADS = 10 # number of threads (1 for single-threaded)
|
||||
DEBUG = False
|
||||
PORT = 8080
|
||||
|
||||
PROTOCOL = ['http', 'https', 'ftp']
|
||||
PROTOCOL = ['http', 'https']
|
||||
|
||||
|
||||
def filterOptions(options):
|
||||
@@ -52,7 +55,7 @@ def filterOptions(options):
|
||||
|
||||
# example of filtering code below
|
||||
|
||||
#allowed = ['proxy', 'clip', 'keep', 'cache', 'force', 'silent', 'pro', 'debug']
|
||||
#allowed = ['proxy', 'clip', 'cache', 'force', 'silent', 'pro', 'debug']
|
||||
#filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
|
||||
|
||||
#return filtered
|
||||
@@ -66,6 +69,7 @@ def log(txt, force=False):
|
||||
if DEBUG or force:
|
||||
if 'REQUEST_URI' in os.environ:
|
||||
open('morss.log', 'a').write("%s\n" % repr(txt))
|
||||
|
||||
else:
|
||||
print(repr(txt))
|
||||
|
||||
@@ -73,6 +77,7 @@ def log(txt, force=False):
|
||||
def len_html(txt):
|
||||
if len(txt):
|
||||
return len(lxml.html.fromstring(txt).text_content())
|
||||
|
||||
else:
|
||||
return 0
|
||||
|
||||
@@ -80,6 +85,7 @@ def len_html(txt):
|
||||
def count_words(txt):
|
||||
if len(txt):
|
||||
return len(lxml.html.fromstring(txt).text_content().split())
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
@@ -88,12 +94,14 @@ class Options:
|
||||
if len(args):
|
||||
self.options = args
|
||||
self.options.update(options or {})
|
||||
|
||||
else:
|
||||
self.options = options or {}
|
||||
|
||||
def __getattr__(self, key):
|
||||
if key in self.options:
|
||||
return self.options[key]
|
||||
|
||||
else:
|
||||
return False
|
||||
|
||||
@@ -107,17 +115,23 @@ class Options:
|
||||
def parseOptions(options):
|
||||
""" Turns ['md=True'] into {'md':True} """
|
||||
out = {}
|
||||
|
||||
for option in options:
|
||||
split = option.split('=', 1)
|
||||
|
||||
if len(split) > 1:
|
||||
if split[0].lower() == 'true':
|
||||
out[split[0]] = True
|
||||
|
||||
elif split[0].lower() == 'false':
|
||||
out[split[0]] = False
|
||||
|
||||
else:
|
||||
out[split[0]] = split[1]
|
||||
|
||||
else:
|
||||
out[split[0]] = True
|
||||
|
||||
return out
|
||||
|
||||
|
||||
@@ -158,6 +172,11 @@ def ItemFix(item, feedurl='/'):
|
||||
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||
log(item.link)
|
||||
|
||||
# pocket
|
||||
if fnmatch(item.link, 'https://getpocket.com/redirect?url=*'):
|
||||
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||
log(item.link)
|
||||
|
||||
# facebook
|
||||
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
|
||||
item.link = parse_qs(urlparse(item.link).query)['u'][0]
|
||||
@@ -208,6 +227,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
||||
if len(match):
|
||||
link = match[0]
|
||||
log(link)
|
||||
|
||||
else:
|
||||
link = None
|
||||
|
||||
@@ -217,6 +237,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
||||
if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
|
||||
link = match[0]
|
||||
log(link)
|
||||
|
||||
else:
|
||||
link = None
|
||||
|
||||
@@ -232,19 +253,17 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
||||
delay = -2
|
||||
|
||||
try:
|
||||
con = crawler.custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
|
||||
data = con.read()
|
||||
data, con, contenttype, encoding = crawler.adv_get(url=link, delay=delay, timeout=TIMEOUT)
|
||||
|
||||
except (IOError, HTTPException) as e:
|
||||
log('http error')
|
||||
return False # let's just delete errors stuff when in cache mode
|
||||
|
||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||
if contenttype not in crawler.MIMETYPE['html'] and contenttype != 'text/plain':
|
||||
log('non-text page')
|
||||
return True
|
||||
|
||||
out = readabilite.get_article(data, link, options.encoding or crawler.detect_encoding(data, con))
|
||||
out = readabilite.get_article(data, url=con.geturl(), encoding=encoding)
|
||||
|
||||
if out is not None:
|
||||
item.content = out
|
||||
@@ -268,9 +287,6 @@ def ItemAfter(item, options):
|
||||
item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
|
||||
del item.desc
|
||||
|
||||
if not options.keep and not options.proxy:
|
||||
del item.desc
|
||||
|
||||
if options.nolink and item.content:
|
||||
content = lxml.html.fromstring(item.content)
|
||||
for link in content.xpath('//a'):
|
||||
@@ -284,27 +300,23 @@ def ItemAfter(item, options):
|
||||
return item
|
||||
|
||||
|
||||
def FeedFetch(url, options):
|
||||
# basic url clean-up
|
||||
def UrlFix(url):
|
||||
if url is None:
|
||||
raise MorssException('No url provided')
|
||||
|
||||
if isinstance(url, bytes):
|
||||
url = url.decode()
|
||||
|
||||
if urlparse(url).scheme not in PROTOCOL:
|
||||
url = 'http://' + url
|
||||
log(url)
|
||||
|
||||
url = url.replace(' ', '%20')
|
||||
|
||||
if isinstance(url, bytes):
|
||||
url = url.decode()
|
||||
return url
|
||||
|
||||
# allow for code execution for feedify
|
||||
pre = feedify.pre_worker(url)
|
||||
if pre:
|
||||
url = pre
|
||||
log('url redirect')
|
||||
log(url)
|
||||
|
||||
def FeedFetch(url, options):
|
||||
# fetch feed
|
||||
delay = DELAY
|
||||
|
||||
@@ -312,35 +324,34 @@ def FeedFetch(url, options):
|
||||
delay = 0
|
||||
|
||||
try:
|
||||
con = crawler.custom_handler(accept='xml', strict=True, delay=delay,
|
||||
encoding=options.encoding, basic=not options.items) \
|
||||
.open(url, timeout=TIMEOUT * 2)
|
||||
xml = con.read()
|
||||
xml, con, contenttype, encoding = crawler.adv_get(url=url, follow='rss', delay=delay, timeout=TIMEOUT * 2)
|
||||
|
||||
except (IOError, HTTPException):
|
||||
raise MorssException('Error downloading feed')
|
||||
|
||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||
|
||||
if options.items:
|
||||
# using custom rules
|
||||
rss = feeds.FeedHTML(xml, url, contenttype)
|
||||
feed.rule
|
||||
rss = feeds.FeedHTML(xml, encoding=encoding)
|
||||
|
||||
rss.rules['title'] = options.title if options.title else '//head/title'
|
||||
rss.rules['desc'] = options.desc if options.desc else '//head/meta[@name="description"]/@content'
|
||||
|
||||
rss.rules['items'] = options.items
|
||||
|
||||
if options.item_title:
|
||||
rss.rules['item_title'] = options.item_title
|
||||
if options.item_link:
|
||||
rss.rules['item_link'] = options.item_link
|
||||
rss.rules['item_title'] = options.item_title if options.item_title else './/a|.'
|
||||
rss.rules['item_link'] = options.item_link if options.item_link else './@href|.//a/@href'
|
||||
|
||||
if options.item_content:
|
||||
rss.rules['item_content'] = options.item_content
|
||||
|
||||
if options.item_time:
|
||||
rss.rules['item_time'] = options.item_time
|
||||
|
||||
rss = rss.convert(feeds.FeedXML)
|
||||
|
||||
else:
|
||||
try:
|
||||
rss = feeds.parse(xml, url, contenttype)
|
||||
rss = feeds.parse(xml, url, contenttype, encoding=encoding)
|
||||
rss = rss.convert(feeds.FeedXML)
|
||||
# contains all fields, otherwise much-needed data can be lost
|
||||
|
||||
@@ -375,6 +386,7 @@ def FeedGather(rss, url, options):
|
||||
value = queue.get()
|
||||
try:
|
||||
worker(*value)
|
||||
|
||||
except Exception as e:
|
||||
log('Thread Error: %s' % e.message)
|
||||
queue.task_done()
|
||||
@@ -411,9 +423,12 @@ def FeedGather(rss, url, options):
|
||||
t.daemon = True
|
||||
t.start()
|
||||
|
||||
for i, item in enumerate(list(rss.items)):
|
||||
now = datetime.now(tz.tzutc())
|
||||
sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
|
||||
for i, item in enumerate(sorted_items):
|
||||
if threads == 1:
|
||||
worker(*[i, item])
|
||||
|
||||
else:
|
||||
queue.put([i, item])
|
||||
|
||||
@@ -433,37 +448,38 @@ def FeedGather(rss, url, options):
|
||||
return rss
|
||||
|
||||
|
||||
def FeedFormat(rss, options):
|
||||
def FeedFormat(rss, options, encoding='utf-8'):
|
||||
if options.callback:
|
||||
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
|
||||
return '%s(%s)' % (options.callback, rss.tojson())
|
||||
out = '%s(%s)' % (options.callback, rss.tojson(encoding='unicode'))
|
||||
return out if encoding == 'unicode' else out.encode(encoding)
|
||||
|
||||
else:
|
||||
raise MorssException('Invalid callback var name')
|
||||
|
||||
elif options.json:
|
||||
if options.indent:
|
||||
return rss.tojson(encoding='UTF-8', indent=4)
|
||||
return rss.tojson(encoding=encoding, indent=4)
|
||||
|
||||
else:
|
||||
return rss.tojson(encoding='UTF-8')
|
||||
return rss.tojson(encoding=encoding)
|
||||
|
||||
elif options.csv:
|
||||
return rss.tocsv(encoding='UTF-8')
|
||||
return rss.tocsv(encoding=encoding)
|
||||
|
||||
elif options.reader:
|
||||
elif options.html:
|
||||
if options.indent:
|
||||
return rss.tohtml(encoding='UTF-8', pretty_print=True)
|
||||
return rss.tohtml(encoding=encoding, pretty_print=True)
|
||||
|
||||
else:
|
||||
return rss.tohtml(encoding='UTF-8')
|
||||
return rss.tohtml(encoding=encoding)
|
||||
|
||||
else:
|
||||
if options.indent:
|
||||
return rss.torss(xml_declaration=True, encoding='UTF-8', pretty_print=True)
|
||||
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding, pretty_print=True)
|
||||
|
||||
else:
|
||||
return rss.torss(xml_declaration=True, encoding='UTF-8')
|
||||
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding)
|
||||
|
||||
|
||||
def process(url, cache=None, options=None):
|
||||
@@ -475,14 +491,16 @@ def process(url, cache=None, options=None):
|
||||
if cache:
|
||||
crawler.default_cache = crawler.SQLiteCache(cache)
|
||||
|
||||
url = UrlFix(url)
|
||||
rss = FeedFetch(url, options)
|
||||
rss = FeedGather(rss, url, options)
|
||||
|
||||
return FeedFormat(rss, options)
|
||||
|
||||
|
||||
def cgi_app(environ, start_response):
|
||||
def cgi_parse_environ(environ):
|
||||
# get options
|
||||
|
||||
if 'REQUEST_URI' in environ:
|
||||
url = environ['REQUEST_URI'][1:]
|
||||
else:
|
||||
@@ -496,7 +514,7 @@ def cgi_app(environ, start_response):
|
||||
if url.startswith(':'):
|
||||
split = url.split('/', 1)
|
||||
|
||||
options = split[0].replace('|', '/').replace('\\\'', '\'').split(':')[1:]
|
||||
raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
|
||||
|
||||
if len(split) > 1:
|
||||
url = split[1]
|
||||
@@ -504,15 +522,22 @@ def cgi_app(environ, start_response):
|
||||
url = ''
|
||||
|
||||
else:
|
||||
options = []
|
||||
raw_options = []
|
||||
|
||||
# init
|
||||
options = Options(filterOptions(parseOptions(options)))
|
||||
headers = {}
|
||||
options = Options(filterOptions(parseOptions(raw_options)))
|
||||
|
||||
global DEBUG
|
||||
DEBUG = options.debug
|
||||
|
||||
return (url, options)
|
||||
|
||||
|
||||
def cgi_app(environ, start_response):
|
||||
url, options = cgi_parse_environ(environ)
|
||||
|
||||
headers = {}
|
||||
|
||||
# headers
|
||||
headers['status'] = '200 OK'
|
||||
headers['cache-control'] = 'max-age=%s' % DELAY
|
||||
@@ -520,7 +545,7 @@ def cgi_app(environ, start_response):
|
||||
if options.cors:
|
||||
headers['access-control-allow-origin'] = '*'
|
||||
|
||||
if options.html or options.reader:
|
||||
if options.html:
|
||||
headers['content-type'] = 'text/html'
|
||||
elif options.txt or options.silent:
|
||||
headers['content-type'] = 'text/plain'
|
||||
@@ -534,9 +559,12 @@ def cgi_app(environ, start_response):
|
||||
else:
|
||||
headers['content-type'] = 'text/xml'
|
||||
|
||||
headers['content-type'] += '; charset=utf-8'
|
||||
|
||||
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
|
||||
|
||||
# get the work done
|
||||
url = UrlFix(url)
|
||||
rss = FeedFetch(url, options)
|
||||
|
||||
if headers['content-type'] == 'text/xml':
|
||||
@@ -547,18 +575,42 @@ def cgi_app(environ, start_response):
|
||||
rss = FeedGather(rss, url, options)
|
||||
out = FeedFormat(rss, options)
|
||||
|
||||
if not options.silent:
|
||||
return out
|
||||
if options.silent:
|
||||
return ['']
|
||||
|
||||
else:
|
||||
return [out]
|
||||
|
||||
|
||||
def cgi_wrapper(environ, start_response):
|
||||
# simple http server for html and css
|
||||
def middleware(func):
|
||||
" Decorator to turn a function into a wsgi middleware "
|
||||
# This is called when parsing the code
|
||||
|
||||
def app_builder(app):
|
||||
# This is called when doing app = cgi_wrapper(app)
|
||||
|
||||
def app_wrap(environ, start_response):
|
||||
# This is called when a http request is being processed
|
||||
|
||||
return func(environ, start_response, app)
|
||||
|
||||
return app_wrap
|
||||
|
||||
return app_builder
|
||||
|
||||
|
||||
@middleware
|
||||
def cgi_file_handler(environ, start_response, app):
|
||||
" Simple HTTP server to serve static files (.html, .css, etc.) "
|
||||
|
||||
files = {
|
||||
'': 'text/html',
|
||||
'index.html': 'text/html'}
|
||||
'index.html': 'text/html',
|
||||
'sheet.xsl': 'text/xsl'}
|
||||
|
||||
if 'REQUEST_URI' in environ:
|
||||
url = environ['REQUEST_URI'][1:]
|
||||
|
||||
else:
|
||||
url = environ['PATH_INFO'][1:]
|
||||
|
||||
@@ -568,35 +620,108 @@ def cgi_wrapper(environ, start_response):
|
||||
if url == '':
|
||||
url = 'index.html'
|
||||
|
||||
if '--root' in sys.argv[1:]:
|
||||
path = os.path.join(sys.argv[-1], url)
|
||||
paths = [os.path.join(sys.prefix, 'share/morss/www', url),
|
||||
os.path.join(os.path.dirname(__file__), '../www', url)]
|
||||
|
||||
for path in paths:
|
||||
try:
|
||||
body = open(path, 'rb').read()
|
||||
|
||||
headers['status'] = '200 OK'
|
||||
headers['content-type'] = files[url]
|
||||
start_response(headers['status'], list(headers.items()))
|
||||
return [body]
|
||||
|
||||
except IOError:
|
||||
continue
|
||||
|
||||
else:
|
||||
path = url
|
||||
|
||||
try:
|
||||
body = open(path, 'rb').read()
|
||||
|
||||
headers['status'] = '200 OK'
|
||||
headers['content-type'] = files[url]
|
||||
start_response(headers['status'], list(headers.items()))
|
||||
return [body]
|
||||
|
||||
except IOError:
|
||||
# the for loop did not return, so here we are, i.e. no file found
|
||||
headers['status'] = '404 Not found'
|
||||
start_response(headers['status'], list(headers.items()))
|
||||
return ['Error %s' % headers['status']]
|
||||
|
||||
# actual morss use
|
||||
else:
|
||||
return app(environ, start_response)
|
||||
|
||||
|
||||
def cgi_get(environ, start_response):
|
||||
url, options = cgi_parse_environ(environ)
|
||||
|
||||
# get page
|
||||
PROTOCOL = ['http', 'https']
|
||||
|
||||
if urlparse(url).scheme not in ['http', 'https']:
|
||||
url = 'http://' + url
|
||||
|
||||
data, con, contenttype, encoding = crawler.adv_get(url=url)
|
||||
|
||||
if contenttype in ['text/html', 'application/xhtml+xml', 'application/xml']:
|
||||
if options.get == 'page':
|
||||
html = readabilite.parse(data, encoding=encoding)
|
||||
html.make_links_absolute(con.geturl())
|
||||
|
||||
kill_tags = ['script', 'iframe', 'noscript']
|
||||
|
||||
for tag in kill_tags:
|
||||
for elem in html.xpath('//'+tag):
|
||||
elem.getparent().remove(elem)
|
||||
|
||||
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8')
|
||||
|
||||
elif options.get == 'article':
|
||||
output = readabilite.get_article(data, url=con.geturl(), encoding=encoding, debug=options.debug)
|
||||
|
||||
else:
|
||||
raise MorssException('no :get option passed')
|
||||
|
||||
else:
|
||||
output = data
|
||||
|
||||
# return html page
|
||||
headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8'}
|
||||
start_response(headers['status'], list(headers.items()))
|
||||
return [output]
|
||||
|
||||
|
||||
dispatch_table = {
|
||||
'get': cgi_get,
|
||||
}
|
||||
|
||||
|
||||
@middleware
|
||||
def cgi_dispatcher(environ, start_response, app):
|
||||
url, options = cgi_parse_environ(environ)
|
||||
|
||||
for key in dispatch_table.keys():
|
||||
if key in options:
|
||||
return dispatch_table[key](environ, start_response)
|
||||
|
||||
return app(environ, start_response)
|
||||
|
||||
|
||||
@middleware
|
||||
def cgi_error_handler(environ, start_response, app):
|
||||
try:
|
||||
return [cgi_app(environ, start_response) or '(empty)']
|
||||
return app(environ, start_response)
|
||||
|
||||
except (KeyboardInterrupt, SystemExit):
|
||||
raise
|
||||
|
||||
except Exception as e:
|
||||
headers = {'status': '500 Oops', 'content-type': 'text/plain'}
|
||||
headers = {'status': '500 Oops', 'content-type': 'text/html'}
|
||||
start_response(headers['status'], list(headers.items()), sys.exc_info())
|
||||
log('ERROR <%s>: %s' % (url, e.message), force=True)
|
||||
return ['An error happened:\n%s' % e.message]
|
||||
log('ERROR: %s' % repr(e), force=True)
|
||||
return [cgitb.html(sys.exc_info())]
|
||||
|
||||
|
||||
@middleware
|
||||
def cgi_encode(environ, start_response, app):
|
||||
out = app(environ, start_response)
|
||||
return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
|
||||
|
||||
|
||||
cgi_standalone_app = cgi_encode(cgi_error_handler(cgi_dispatcher(cgi_file_handler(cgi_app))))
|
||||
|
||||
|
||||
def cli_app():
|
||||
@@ -608,12 +733,13 @@ def cli_app():
|
||||
|
||||
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
|
||||
|
||||
url = UrlFix(url)
|
||||
rss = FeedFetch(url, options)
|
||||
rss = FeedGather(rss, url, options)
|
||||
out = FeedFormat(rss, options)
|
||||
out = FeedFormat(rss, options, 'unicode')
|
||||
|
||||
if not options.silent:
|
||||
print(out.decode('utf-8', 'replace') if isinstance(out, bytes) else out)
|
||||
print(out)
|
||||
|
||||
log('done')
|
||||
|
||||
@@ -622,6 +748,7 @@ def isInt(string):
|
||||
try:
|
||||
int(string)
|
||||
return True
|
||||
|
||||
except ValueError:
|
||||
return False
|
||||
|
||||
@@ -629,31 +756,46 @@ def isInt(string):
|
||||
def main():
|
||||
if 'REQUEST_URI' in os.environ:
|
||||
# mod_cgi
|
||||
wsgiref.handlers.CGIHandler().run(cgi_wrapper)
|
||||
|
||||
elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
|
||||
app = cgi_app
|
||||
app = cgi_dispatcher(app)
|
||||
app = cgi_error_handler(app)
|
||||
app = cgi_encode(app)
|
||||
|
||||
wsgiref.handlers.CGIHandler().run(app)
|
||||
|
||||
elif len(sys.argv) <= 1 or isInt(sys.argv[1]):
|
||||
# start internal (basic) http server
|
||||
|
||||
if len(sys.argv) > 1 and isInt(sys.argv[1]):
|
||||
argPort = int(sys.argv[1])
|
||||
if argPort > 0:
|
||||
port = argPort
|
||||
|
||||
else:
|
||||
raise MorssException('Port must be positive integer')
|
||||
|
||||
else:
|
||||
port = PORT
|
||||
|
||||
print('Serving http://localhost:%s/'%port)
|
||||
httpd = wsgiref.simple_server.make_server('', port, cgi_wrapper)
|
||||
app = cgi_app
|
||||
app = cgi_file_handler(app)
|
||||
app = cgi_dispatcher(app)
|
||||
app = cgi_error_handler(app)
|
||||
app = cgi_encode(app)
|
||||
|
||||
print('Serving http://localhost:%s/' % port)
|
||||
httpd = wsgiref.simple_server.make_server('', port, app)
|
||||
httpd.serve_forever()
|
||||
|
||||
else:
|
||||
# as a CLI app
|
||||
try:
|
||||
cli_app()
|
||||
|
||||
except (KeyboardInterrupt, SystemExit):
|
||||
raise
|
||||
|
||||
except Exception as e:
|
||||
print('ERROR: %s' % e.message)
|
||||
|
||||
|
@@ -1,13 +1,17 @@
|
||||
import lxml.etree
|
||||
import lxml.html
|
||||
from bs4 import BeautifulSoup
|
||||
import re
|
||||
|
||||
|
||||
def parse(data, encoding=None):
|
||||
if encoding:
|
||||
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding=encoding)
|
||||
data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
|
||||
|
||||
else:
|
||||
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True)
|
||||
data = BeautifulSoup(data, 'lxml').prettify('utf-8')
|
||||
|
||||
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding='utf-8')
|
||||
|
||||
return lxml.html.fromstring(data, parser=parser)
|
||||
|
||||
@@ -43,6 +47,12 @@ def count_content(node):
|
||||
return count_words(node.text_content()) + len(node.findall('.//img'))
|
||||
|
||||
|
||||
def percentile(N, P):
|
||||
# https://stackoverflow.com/a/7464107
|
||||
n = max(int(round(P * len(N) + 0.5)), 2)
|
||||
return N[n-2]
|
||||
|
||||
|
||||
class_bad = ['comment', 'community', 'extra', 'foot',
|
||||
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
|
||||
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
|
||||
@@ -62,7 +72,7 @@ regex_good = re.compile('|'.join(class_good), re.I)
|
||||
|
||||
tags_junk = ['script', 'head', 'iframe', 'object', 'noscript',
|
||||
'param', 'embed', 'layer', 'applet', 'style', 'form', 'input', 'textarea',
|
||||
'button', 'footer']
|
||||
'button', 'footer', 'link', 'meta']
|
||||
|
||||
tags_bad = tags_junk + ['a', 'aside']
|
||||
|
||||
@@ -93,10 +103,18 @@ def score_node(node):
|
||||
class_id = node.get('class', '') + node.get('id', '')
|
||||
|
||||
if (isinstance(node, lxml.html.HtmlComment)
|
||||
or node.tag in tags_bad
|
||||
or regex_bad.search(class_id)):
|
||||
or isinstance(node, lxml.html.HtmlProcessingInstruction)):
|
||||
return 0
|
||||
|
||||
if node.tag in tags_junk:
|
||||
score += -1 # actuall -2 as tags_junk is included tags_bad
|
||||
|
||||
if node.tag in tags_bad:
|
||||
score += -1
|
||||
|
||||
if regex_bad.search(class_id):
|
||||
score += -1
|
||||
|
||||
if node.tag in tags_good:
|
||||
score += 4
|
||||
|
||||
@@ -114,33 +132,42 @@ def score_node(node):
|
||||
return score
|
||||
|
||||
|
||||
def score_all(node, grades=None):
|
||||
def score_all(node):
|
||||
" Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
|
||||
|
||||
if grades is None:
|
||||
grades = {}
|
||||
|
||||
for child in node:
|
||||
score = score_node(child)
|
||||
child.attrib['seen'] = 'yes, ' + str(int(score))
|
||||
|
||||
if score > 0:
|
||||
spread_score(child, score, grades)
|
||||
score_all(child, grades)
|
||||
|
||||
return grades
|
||||
if score > 0 or len(list(child.iterancestors())) <= 2:
|
||||
spread_score(child, score)
|
||||
score_all(child)
|
||||
|
||||
|
||||
def spread_score(node, score, grades):
|
||||
def set_score(node, value):
|
||||
node.attrib['morss_score'] = str(float(value))
|
||||
|
||||
|
||||
def get_score(node):
|
||||
return float(node.attrib.get('morss_score', 0))
|
||||
|
||||
|
||||
def incr_score(node, delta):
|
||||
set_score(node, get_score(node) + delta)
|
||||
|
||||
|
||||
def get_all_scores(node):
|
||||
return {x:get_score(x) for x in list(node.iter()) if get_score(x) != 0}
|
||||
|
||||
|
||||
def spread_score(node, score):
|
||||
" Spread the node's score to its parents, on a linear way "
|
||||
|
||||
delta = score / 2
|
||||
|
||||
for ancestor in [node,] + list(node.iterancestors()):
|
||||
if score >= 1 or ancestor is node:
|
||||
try:
|
||||
grades[ancestor] += score
|
||||
except KeyError:
|
||||
grades[ancestor] = score
|
||||
incr_score(ancestor, score)
|
||||
|
||||
score -= delta
|
||||
|
||||
@@ -148,26 +175,24 @@ def spread_score(node, score, grades):
|
||||
break
|
||||
|
||||
|
||||
def write_score_all(root, grades):
|
||||
" Useful for debugging "
|
||||
|
||||
for node in root.iter():
|
||||
node.attrib['score'] = str(int(grades.get(node, 0)))
|
||||
|
||||
|
||||
def clean_root(root):
|
||||
def clean_root(root, keep_threshold=None):
|
||||
for node in list(root):
|
||||
clean_root(node)
|
||||
clean_node(node)
|
||||
# bottom-up approach, i.e. starting with children before cleaning current node
|
||||
clean_root(node, keep_threshold)
|
||||
clean_node(node, keep_threshold)
|
||||
|
||||
|
||||
def clean_node(node):
|
||||
def clean_node(node, keep_threshold=None):
|
||||
parent = node.getparent()
|
||||
|
||||
if parent is None:
|
||||
# this is <html/> (or a removed element waiting for GC)
|
||||
return
|
||||
|
||||
if keep_threshold is not None and get_score(node) >= keep_threshold:
|
||||
# high score, so keep
|
||||
return
|
||||
|
||||
gdparent = parent.getparent()
|
||||
|
||||
# remove shitty tags
|
||||
@@ -266,41 +291,45 @@ def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
|
||||
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
|
||||
|
||||
|
||||
def rank_nodes(grades):
|
||||
def rank_grades(grades):
|
||||
# largest score to smallest
|
||||
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
|
||||
|
||||
|
||||
def get_best_node(grades):
|
||||
def get_best_node(ranked_grades):
|
||||
" To pick the best (raw) node. Another function will clean it "
|
||||
|
||||
if len(grades) == 1:
|
||||
return grades[0]
|
||||
if len(ranked_grades) == 1:
|
||||
return ranked_grades[0]
|
||||
|
||||
top = rank_nodes(grades)
|
||||
lowest = lowest_common_ancestor(top[0][0], top[1][0], 3)
|
||||
lowest = lowest_common_ancestor(ranked_grades[0][0], ranked_grades[1][0], 3)
|
||||
|
||||
return lowest
|
||||
|
||||
|
||||
def get_article(data, url=None, encoding=None):
|
||||
def get_article(data, url=None, encoding=None, debug=False):
|
||||
" Input a raw html string, returns a raw html string of the article "
|
||||
|
||||
html = parse(data, encoding)
|
||||
scores = score_all(html)
|
||||
score_all(html)
|
||||
scores = rank_grades(get_all_scores(html))
|
||||
|
||||
if not len(scores):
|
||||
return None
|
||||
|
||||
best = get_best_node(scores)
|
||||
|
||||
if not debug:
|
||||
keep_threshold = percentile([x[1] for x in scores], 0.1)
|
||||
clean_root(best, keep_threshold)
|
||||
|
||||
wc = count_words(best.text_content())
|
||||
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
|
||||
|
||||
if wc - wca < 50 or float(wca) / wc > 0.3:
|
||||
if not debug and (wc - wca < 50 or float(wca) / wc > 0.3):
|
||||
return None
|
||||
|
||||
if url:
|
||||
best.make_links_absolute(url)
|
||||
|
||||
clean_root(best)
|
||||
|
||||
return lxml.etree.tostring(best, pretty_print=True)
|
||||
return lxml.etree.tostring(best if not debug else html, pretty_print=True)
|
||||
|
@@ -1,11 +1,9 @@
|
||||
@require(feed)
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>@feed.title – via morss</title>
|
||||
<meta charset="UTF-8" />
|
||||
<meta name="description" content="@feed.desc (via morss)" />
|
||||
<title>Feed reader by morss</title>
|
||||
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
||||
<meta name="robots" content="noindex" />
|
||||
|
||||
<style type="text/css">
|
||||
/* columns - from https://thisisdallas.github.io/Simple-Grid/simpleGrid.css */
|
||||
@@ -32,7 +30,7 @@
|
||||
padding-right: 20px; /* column-space */
|
||||
}
|
||||
|
||||
@@media handheld, only screen and (max-width: 767px) { /* @@ to escape from the template engine */
|
||||
@media handheld, only screen and (max-width: 767px) {
|
||||
#content {
|
||||
width: 100%;
|
||||
min-width: 0;
|
||||
@@ -82,6 +80,7 @@
|
||||
|
||||
#content {
|
||||
text-align: justify;
|
||||
line-height: 1.5em;
|
||||
}
|
||||
|
||||
.item .title {
|
||||
@@ -171,30 +170,17 @@
|
||||
|
||||
<body>
|
||||
<div id="header">
|
||||
<h1>@feed.title</h1>
|
||||
@if feed.desc:
|
||||
<h2>@feed.desc</h2>
|
||||
@end
|
||||
<h1>RSS feed</h1>
|
||||
<h2>with full text articles</h2>
|
||||
<p>- via morss</p>
|
||||
</div>
|
||||
|
||||
<div id="content">
|
||||
@for item in feed.items:
|
||||
<div class="item">
|
||||
@if item.link:
|
||||
<a class="title link" href="@item.link" target="_blank">@item.title</a>
|
||||
@else:
|
||||
<span class="title">@item.title</span>
|
||||
@end
|
||||
<div class="article">
|
||||
@if item.content:
|
||||
@item.content
|
||||
@else:
|
||||
@item.desc
|
||||
@end
|
||||
</div>
|
||||
<a class="title link" href="@item.link" target="_blank"></a>
|
||||
<div class="desc"></div>
|
||||
<div class="content"></div>
|
||||
</div>
|
||||
@end
|
||||
</div>
|
||||
|
||||
<script>
|
||||
|
@@ -1,4 +0,0 @@
|
||||
lxml
|
||||
python-dateutil <= 1.5
|
||||
chardet
|
||||
pymysql
|
32
setup.py
32
setup.py
@@ -1,14 +1,24 @@
|
||||
from setuptools import setup, find_packages
|
||||
from setuptools import setup
|
||||
from glob import glob
|
||||
|
||||
package_name = 'morss'
|
||||
|
||||
setup(
|
||||
name=package_name,
|
||||
description='Get full-text RSS feeds',
|
||||
author='pictuga, Samuel Marks',
|
||||
author_email='contact at pictuga dot com',
|
||||
url='http://morss.it/',
|
||||
license='AGPL v3',
|
||||
package_dir={package_name: package_name},
|
||||
packages=find_packages(),
|
||||
package_data={package_name: ['feedify.ini', 'reader.html.template']},
|
||||
test_suite=package_name + '.tests')
|
||||
name = package_name,
|
||||
description = 'Get full-text RSS feeds',
|
||||
author = 'pictuga, Samuel Marks',
|
||||
author_email = 'contact at pictuga dot com',
|
||||
url = 'http://morss.it/',
|
||||
download_url = 'https://git.pictuga.com/pictuga/morss',
|
||||
license = 'AGPL v3',
|
||||
packages = [package_name],
|
||||
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
|
||||
package_data = {package_name: ['feedify.ini', 'reader.html.template']},
|
||||
data_files = [
|
||||
('share/' + package_name, ['README.md', 'LICENSE']),
|
||||
('share/' + package_name + '/www', glob('www/*.*')),
|
||||
('share/' + package_name + '/www/cgi', [])
|
||||
],
|
||||
entry_points = {
|
||||
'console_scripts': [package_name + '=' + package_name + ':main']
|
||||
})
|
||||
|
@@ -35,8 +35,8 @@
|
||||
<input type="text" id="url" name="url" placeholder="Feed url (http://example.com/feed.xml)" />
|
||||
</form>
|
||||
|
||||
<code>Copyright: pictuga 2013-2014<br/>
|
||||
Source code: https://github.com/pictuga/morss</code>
|
||||
<code>Copyright: pictuga 2013-2020<br/>
|
||||
Source code: https://git.pictuga.com/pictuga/morss</code>
|
||||
|
||||
<script>
|
||||
form = document.forms[0]
|
||||
|
@@ -13,6 +13,7 @@
|
||||
body {
|
||||
overflow-wrap: anywhere;
|
||||
word-wrap: anywhere;
|
||||
font-family: sans;
|
||||
}
|
||||
|
||||
#url {
|
||||
|
Reference in New Issue
Block a user