Compare commits

...

78 Commits

Author SHA1 Message Date
41a63900c2 README: improve docker instructions 2020-04-19 13:01:08 +02:00
ec8edb02f1 Various small bug fixes 2020-04-19 12:54:02 +02:00
d01b943597 Remove leftover threading var 2020-04-19 12:51:11 +02:00
b361aa2867 Add timeout to :get 2020-04-19 12:50:26 +02:00
4ce3c7cb32 Small code clean ups 2020-04-19 12:50:05 +02:00
7e45b2611d Disable multi-threading
Impact was mostly negative due to locks
2020-04-19 12:29:52 +02:00
036e5190f1 crawler: remove unused code 2020-04-18 21:40:02 +02:00
e99c5b3b71 morss: more sensible default MAX/LIM values 2020-04-18 17:21:45 +02:00
4f44df8d63 Make all ports default to 8080 2020-04-18 17:15:59 +02:00
497c14db81 Add dockerfile & how to in README 2020-04-18 17:04:44 +02:00
a4e1dba8b7 sheet.xsl: improve url display 2020-04-16 10:33:36 +02:00
7375adce33 sheet.xsl: fix & improve 2020-04-15 23:34:28 +02:00
663212de0a sheet.xsl: various cosmetic improvements 2020-04-15 23:22:45 +02:00
4a2ea1bce9 README: add gunicorn instructions 2020-04-15 22:31:21 +02:00
fe82b19c91 Merge .xsl & html template
Turns out they somehow serve a similar purpose
2020-04-15 22:30:45 +02:00
0b31e97492 morss: remove debug code in http file handler 2020-04-14 23:20:03 +02:00
b0ad7c259d Add README & LICENSE to data_files 2020-04-14 19:34:12 +02:00
bffb23f884 README: how to use cli 2020-04-14 18:21:32 +02:00
59139272fd Auto-detect the location of www/
Either ../www or /usr/share/morss
Adapted README accordingly
2020-04-14 18:07:19 +02:00
39b0a1d7cc setup.py: fix deps & files 2020-04-14 17:36:42 +02:00
65803b328d New git url and updated date in provided index.html 2020-04-13 15:30:32 +02:00
e6b7c0eb33 Fix app definition for uwsgi 2020-04-13 15:30:09 +02:00
67c096ad5b feeds: add fake path to default html parser
Without it, some websites were accidentally matching it (false positives)
2020-04-12 13:00:56 +02:00
f018437544 crawler: make mysql backend thread safe 2020-04-12 12:53:05 +02:00
8e5e8d24a4 Timezone fixes 2020-04-10 20:33:59 +02:00
ee78a7875a morss: focus on the most recent feed items 2020-04-10 16:08:13 +02:00
9e7b9d95ee feeds: properly use html template 2020-04-09 20:00:51 +02:00
987a719c4e feeds: try all parsers regardless of contenttype
Turns out some websites send the wrong contenttype (json for html, html for xml, etc.)
2020-04-09 19:17:51 +02:00
47b33f4baa morss: specify server output encoding 2020-04-09 19:10:45 +02:00
3c7f512583 feeds: handle several errors 2020-04-09 19:09:10 +02:00
a32f5a8536 readabilite: add debug option (also used by :get) 2020-04-09 19:08:13 +02:00
63a06524b7 morss: various encoding fixes 2020-04-09 19:06:51 +02:00
b0f80c6d3c morss: fix csv output encoding 2020-04-09 19:05:50 +02:00
78cea10ead morss: replace :getpage with :get
Also provides readabilite debugging
2020-04-09 18:43:20 +02:00
e5a82ff1f4 crawler: drop auto-referer
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
f3d1f92b39 Detect encoding everytime 2020-04-07 10:38:36 +02:00
7691df5257 Use wrapper for http calls 2020-04-07 10:30:17 +02:00
0ae0dbc175 README: mention csv output 2020-04-07 09:24:32 +02:00
f1d0431e68 morss: drop :html, replaced with :reader
README updated accordingly
2020-04-07 09:23:29 +02:00
a09831415f feeds: fix bug when mimetype matches nothing 2020-04-06 18:53:07 +02:00
bfad6b7a4a readabilite: clean before counting
To remove links which are not kept anyway
2020-04-06 16:55:39 +02:00
6b8c3e51e7 readabilite: fix threshold feature
Awkward typo...
2020-04-06 16:52:06 +02:00
dc9e425247 readabilite: don't clean-out the top 10% nodes
Loosen up the code once again to limit over-kill
2020-04-06 14:26:28 +02:00
2f48e18bb1 readabilite: put scores directly in html node
Probably slower but makes code somewhat cleaner...
2020-04-06 14:21:41 +02:00
31cac921c7 README: remove ref to iTunes 2020-04-05 22:20:33 +02:00
a82ec96eb7 Delete feedify.py leftover code
iTunes integration untested, unreliable and not working...
2020-04-05 22:16:52 +02:00
aad2398e69 feeds: turns out lxml.etree doesn't have drop_tag 2020-04-05 21:50:38 +02:00
eeac630855 crawler: add more "realistic" headers 2020-04-05 21:11:57 +02:00
e136b0feb2 readabilite: loosen the slayer
Previous impl. lead to too many empty results
2020-04-05 20:47:30 +02:00
6cf32af6c0 readabilite: also use BS 2020-04-05 20:46:42 +02:00
568e7d7dd2 feeds: make BS's output bytes for lxml's sake 2020-04-05 20:46:04 +02:00
3617f86e9d morss: make cgi_encore more robust 2020-04-05 16:43:11 +02:00
d90756b337 morss: drop 'keep' option
Because the Firefox behaviour it is working around is no longer in use
2020-04-05 16:37:27 +02:00
40c69f17d2 feeds: parse html with BS
More robust & to make it consistent with :getpage
2020-04-05 16:12:41 +02:00
99461ea185 crawler: fix var name issues (private_cache) 2020-04-05 16:11:36 +02:00
bf86c1e962 crawler: make AutoUA match http(s) type 2020-04-05 16:07:51 +02:00
d20f6237bd crawler: replace ContentNegoHandler with AlternateHandler
More basic. Sends the same headers no matter what. Make requests more "replicable".
Also, drop "text/xml" from RSS contenttype, too broad, matches garbage
2020-04-05 16:05:59 +02:00
8a4d68d72c crawler: drop 'basic' toggle
Can't even remember the use case
2020-04-05 16:03:06 +02:00
e6811138fd morss: use redirected url in :getpage
Still have to find how to do the same thing with feeds...
2020-04-04 20:04:57 +02:00
35b702fffd morss: default values for feed creation 2020-04-04 19:39:32 +02:00
4a88886767 morss: get_page to act as a basic proxy (for iframes) 2020-04-04 16:37:15 +02:00
1653394cf7 morss: cgi_dispatcher to be able to create extra functions 2020-04-04 16:35:16 +02:00
a8a90cf414 morss: move url/options parsing to own function
For future re-use
2020-04-04 16:33:52 +02:00
bdbaf0f8a7 morss/cgi: fix handling of special chars in url 2020-04-04 16:21:37 +02:00
d0e447a2a6 ItemFix: clean up Pocket links 2020-04-04 16:20:39 +02:00
e6817e01b4 sheet.xsl: set font to "sans"
Browsers don't all have the same default font. Overriding for consistency
2020-04-03 17:47:19 +02:00
7c3091d64c morss: code spacing
One of those commits that make me feel useful
2020-03-21 23:41:46 +01:00
37b4e144a9 morss: small fixes
Includes dropping off ftp support
2020-03-21 23:30:18 +01:00
bd4b7b5bb2 morss: convert HTML feeds to XML ones for completeness 2020-03-21 23:27:42 +01:00
68d920d4b5 morss: make FeedFormat more flexible with encoding 2020-03-21 23:26:35 +01:00
758ff404a8 morss: fix cgi_app silent output
*Must* return sth
2020-03-21 23:25:25 +01:00
463530f02c morss: middleware to enforce encoding
bytes are always expected
2020-03-21 23:23:50 +01:00
ec0a28a91d morss: use middleware for wsgi apps 2020-03-21 23:23:21 +01:00
421acb439d morss: make errors more readable over http 2020-03-21 23:08:29 +01:00
42c5d09ccb morss: split "options" var into "raw_options" & "options"
To make it clearer who-is-what
2020-03-21 23:07:07 +01:00
056de12484 morss: add sheet.xsl to file handled by http server 2020-03-21 23:06:28 +01:00
961a31141f morss: fix url fixing 2020-03-21 17:28:00 +01:00
a7b01ee85e readabilite: further html processing instructions fix 2020-03-21 17:23:50 +01:00
14 changed files with 550 additions and 602 deletions

8
Dockerfile Normal file
View File

@@ -0,0 +1,8 @@
FROM alpine:latest
RUN apk add python3 py3-lxml py3-pip git
RUN pip3 install gunicorn
RUN pip3 install git+https://git.pictuga.com/pictuga/morss.git@master
CMD gunicorn --bind 0.0.0.0:8080 -w 4 morss:cgi_standalone_app

View File

@@ -24,15 +24,13 @@ hand-written rules (ie. there's no automatic detection of links to build feeds).
Please mind that feeds based on html files may stop working unexpectedly, due to
html structure changes on the target website.
Additionally morss can grab the source xml feed of iTunes podcast, and detect
rss feeds in html pages' `<meta>`.
Additionally morss can detect rss feeds in html pages' `<meta>`.
You can use this program online for free at **[morss.it](https://morss.it/)**.
Some features of morss:
- Read RSS/Atom feeds
- Create RSS feeds from json/html pages
- Convert iTunes podcast links into xml links
- Export feeds as RSS/JSON/CSV/HTML
- Fetch full-text content of feed items
- Follow 301/meta redirects
@@ -48,6 +46,7 @@ You do need:
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
- [lxml](http://lxml.de/) for xml parsing
- [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
- [chardet](https://pypi.python.org/pypi/chardet)
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
@@ -56,7 +55,7 @@ You do need:
Simplest way to get these:
```shell
pip install -r requirements.txt
pip install git+https://git.pictuga.com/pictuga/morss.git@master
```
You may also need:
@@ -74,9 +73,10 @@ The arguments are:
- Change what morss does
- `json`: output as JSON
- `html`: outpout as HTML
- `csv`: outpout as CSV
- `proxy`: doesn't fill the articles
- `clip`: stick the full article content under the original feed content (useful for twitter)
- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
- `search=STRING`: does a basic case-sensitive search in the feed
- Advanced
- `csv`: export to csv
@@ -85,14 +85,11 @@ The arguments are:
- `noref`: drop items' link
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
- `debug`: to have some feedback from the script execution. Useful for debugging
- `mono`: disable multithreading while fetching, makes debugging easier
- `theforce`: force download the rss feed and ignore cached http errros
- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
- `encoding=ENCODING`: overrides the encoding auto-detection of the crawler. Some web developers did not quite understand the importance of setting charset/encoding tags correctly...
- http server only
- `callback=NAME`: for JSONP calls
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
@@ -111,7 +108,6 @@ morss will auto-detect what "mode" to use.
For this, you'll want to change a bit the architecture of the files, for example
into something like this.
```
/
├── cgi
@@ -143,20 +139,40 @@ ensure that the provided `/www/.htaccess` works well with your server.
Running this command should do:
```shell
uwsgi --http :9090 --plugin python --wsgi-file main.py
uwsgi --http :8080 --plugin python --wsgi-file main.py
```
However, one problem might be how to serve the provided `index.html` file if it
isn't in the same directory. Therefore you can add this at the end of the
command to point to another directory `--pyargv '--root ../../www/'`.
#### Using Gunicorn
```shell
gunicorn morss:cgi_standalone_app
```
#### Using docker
Build & run
```shell
docker build https://git.pictuga.com/pictuga/morss.git -t morss
docker run -p 8080:8080 morss
```
In one line
```shell
docker run -p 8080:8080 $(docker build -q https://git.pictuga.com/pictuga/morss.git)
```
#### Using morss' internal HTTP server
Morss can run its own HTTP server. The later should start when you run morss
without any argument, on port 8080.
You can change the port and the location of the `www/` folder like this `python -m morss 9000 --root ../../www`.
```shell
morss
```
You can change the port like this `morss 9000`.
#### Passing arguments
@@ -176,9 +192,9 @@ Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rs
Run:
```
python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
```
For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
For example: `morss debug http://feeds.bbci.co.uk/news/rss.xml`
*(Brackets indicate optional text)*
@@ -191,9 +207,9 @@ scripts can be run on top of the RSS feed, using its
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
```
[python[2.7]] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
```
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
*(Brackets indicate optional text)*
@@ -238,12 +254,13 @@ output = morss.Format(rss, options) # formats final feed
## Cache information
morss uses caching to make loading faster. There are 2 possible cache backends
morss uses caching to make loading faster. There are 3 possible cache backends
(visible in `morss/crawler.py`):
- `{}`: a simple python in-memory dict() object
- `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
be cleared every time the program is run
- `MySQLCacheHandler`: /!\ Does NOT support multi-threading
- `MySQLCacheHandler`
## Configuration
### Length limitation
@@ -262,7 +279,6 @@ different values at the top of the script.
- `DELAY` sets the browser cache delay, only for HTTP clients
- `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
- `THREADS` sets the number of threads to use. `1` makes no use of multithreading.
### Content matching

View File

@@ -1,6 +1,6 @@
#!/usr/bin/env python
from morss import main, cgi_wrapper as application
from morss import main, cgi_standalone_app as application
if __name__ == '__main__':
main()

View File

@@ -27,13 +27,33 @@ except NameError:
MIMETYPE = {
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=False):
def get(*args, **kwargs):
return adv_get(*args, **kwargs)[0]
def adv_get(url, timeout=None, *args, **kwargs):
if timeout is None:
con = custom_handler(*args, **kwargs).open(url)
else:
con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
data = con.read()
contenttype = con.info().get('Content-Type', '').split(';')[0]
encoding= detect_encoding(data, con)
return data, con, contenttype, encoding
def custom_handler(follow=None, delay=None, encoding=None):
handlers = []
# as per urllib2 source code, these Handelers are added first
@@ -51,14 +71,11 @@ def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=F
handlers.append(HTTPEquivHandler())
handlers.append(HTTPRefreshHandler())
handlers.append(UAHandler(DEFAULT_UA))
if not basic:
handlers.append(AutoRefererHandler())
handlers.append(BrowserlyHeaderHandler())
handlers.append(EncodingFixHandler(encoding))
if accept:
handlers.append(ContentNegociationHandler(MIMETYPE[accept], strict))
if follow:
handlers.append(AlternateHandler(MIMETYPE[follow]))
handlers.append(CacheHandler(force_min=delay))
@@ -196,45 +213,33 @@ class UAHandler(BaseHandler):
https_request = http_request
class AutoRefererHandler(BaseHandler):
class BrowserlyHeaderHandler(BaseHandler):
""" Add more headers to look less suspicious """
def http_request(self, req):
req.add_unredirected_header('Referer', 'http://%s' % req.host)
req.add_unredirected_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_unredirected_header('Accept-Language', 'en-US,en;q=0.5')
return req
https_request = http_request
class ContentNegociationHandler(BaseHandler):
" Handler for content negociation. Also parses <link rel='alternate' type='application/rss+xml' href='...' /> "
class AlternateHandler(BaseHandler):
" Follow <link rel='alternate' type='application/rss+xml' href='...' /> "
def __init__(self, accept=None, strict=False):
self.accept = accept
self.strict = strict
def http_request(self, req):
if self.accept is not None:
if isinstance(self.accept, basestring):
self.accept = (self.accept,)
string = ','.join(self.accept)
if self.strict:
string += ',*/*;q=0.9'
req.add_unredirected_header('Accept', string)
return req
def __init__(self, follow=None):
self.follow = follow or []
def http_response(self, req, resp):
contenttype = resp.info().get('Content-Type', '').split(';')[0]
if 200 <= resp.code < 300 and self.accept is not None and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
if 200 <= resp.code < 300 and len(self.follow) and contenttype in MIMETYPE['html'] and contenttype not in self.follow:
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
data = resp.read()
links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
for link in links:
if link.get('type', '') in self.accept:
if link.get('type', '') in self.follow:
resp.code = 302
resp.msg = 'Moved Temporarily'
resp.headers['location'] = link.get('href')
@@ -246,7 +251,6 @@ class ContentNegociationHandler(BaseHandler):
return resp
https_request = http_request
https_response = http_response
@@ -384,7 +388,7 @@ class CacheHandler(BaseHandler):
elif self.force_min is None and ('no-cache' in cc_list
or 'no-store' in cc_list
or ('private' in cc_list and not self.private)):
or ('private' in cc_list and not self.private_cache)):
# kindly follow web servers indications, refresh
return None
@@ -419,7 +423,7 @@ class CacheHandler(BaseHandler):
cc_list = [x for x in cache_control if '=' not in x]
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private):
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private_cache):
# kindly follow web servers indications
return resp
@@ -461,6 +465,8 @@ class CacheHandler(BaseHandler):
class BaseCache:
""" Subclasses must behave like a dict """
def __contains__(self, url):
try:
self[url]
@@ -477,7 +483,7 @@ import sqlite3
class SQLiteCache(BaseCache):
def __init__(self, filename=':memory:'):
self.con = sqlite3.connect(filename or sqlite_default, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
with self.con:
self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
@@ -513,18 +519,20 @@ import pymysql.cursors
class MySQLCacheHandler(BaseCache):
" NB. Requires mono-threading, as pymysql isn't thread-safe "
def __init__(self, user, password, database, host='localhost'):
self.con = pymysql.connect(host=host, user=user, password=password, database=database, charset='utf8', autocommit=True)
self.user = user
self.password = password
self.database = database
self.host = host
with self.con.cursor() as cursor:
with self.cursor() as cursor:
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
def __del__(self):
self.con.close()
def cursor(self):
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
def __getitem__(self, url):
cursor = self.con.cursor()
cursor = self.cursor()
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
row = cursor.fetchone()
@@ -535,10 +543,10 @@ class MySQLCacheHandler(BaseCache):
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
if url in self:
with self.con.cursor() as cursor:
with self.cursor() as cursor:
cursor.execute('UPDATE data SET code=%s, msg=%s, headers=%s, data=%s, timestamp=%s WHERE url=%s',
value + (url,))
else:
with self.con.cursor() as cursor:
with self.cursor() as cursor:
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s)', (url,) + value)

View File

@@ -90,8 +90,11 @@ item_updated = updated
[html]
mode = html
path =
http://localhost/
title = //div[@id='header']/h1
desc = //div[@id='header']/h2
desc = //div[@id='header']/p
items = //div[@id='content']/div
item_title = ./a
@@ -99,7 +102,7 @@ item_link = ./a/@href
item_desc = ./div[class=desc]
item_content = ./div[class=content]
base = <!DOCTYPE html> <html> <head> <title>Feed reader by morss</title> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" /> </head> <body> <div id="header"> <h1>@feed.title</h1> <h2>@feed.desc</h2> <p>- via morss</p> </div> <div id="content"> <div class="item"> <a class="title link" href="@item.link" target="_blank">@item.title</a> <div class="desc">@item.desc</div> <div class="content">@item.content</div> </div> </div> <script> var items = document.getElementsByClassName('item') for (var i in items) items[i].onclick = function() { this.classList.toggle('active') document.body.classList.toggle('noscroll') } </script> </body> </html>
base = file:sheet.xsl
[twitter]
mode = html

View File

@@ -1,28 +0,0 @@
import re
import json
from . import crawler
try:
basestring
except NameError:
basestring = str
def pre_worker(url):
if url.startswith('http://itunes.apple.com/') or url.startswith('https://itunes.apple.com/'):
match = re.search('/id([0-9]+)(\?.*)?$', url)
if match:
iid = match.groups()[0]
redirect = 'https://itunes.apple.com/lookup?id=%s' % iid
try:
con = crawler.custom_handler(basic=True).open(redirect, timeout=4)
data = con.read()
except (IOError, HTTPException):
raise
return json.loads(data.decode('utf-8', 'replace'))['results'][0]['feedUrl']
return None

View File

@@ -15,6 +15,7 @@ import dateutil.parser
from copy import deepcopy
import lxml.html
from .readabilite import parse as html_parse
json.encoder.c_make_encoder = None
@@ -45,14 +46,32 @@ def parse_rules(filename=None):
rules = dict([(x, dict(config.items(x))) for x in config.sections()])
for section in rules.keys():
# for each ruleset
for arg in rules[section].keys():
if '\n' in rules[section][arg]:
# for each rule
if rules[section][arg].startswith('file:'):
paths = [os.path.join(sys.prefix, 'share/morss/www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../..', rules[section][arg][5:])]
for path in paths:
try:
file_raw = open(path).read()
file_clean = re.sub('<[/?]?(xsl|xml)[^>]+?>', '', file_raw)
rules[section][arg] = file_clean
except IOError:
pass
elif '\n' in rules[section][arg]:
rules[section][arg] = rules[section][arg].split('\n')[1:]
return rules
def parse(data, url=None, mimetype=None):
def parse(data, url=None, mimetype=None, encoding=None):
" Determine which ruleset to use "
rulesets = parse_rules()
@@ -66,26 +85,20 @@ def parse(data, url=None, mimetype=None):
for path in ruleset['path']:
if fnmatch(url, path):
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
return parser(data, ruleset)
return parser(data, ruleset, encoding=encoding)
# 2) Look for a parser based on mimetype
if mimetype is not None:
parser_candidates = [x for x in parsers if mimetype in x.mimetype]
if mimetype is None or parser_candidates is None:
parser_candidates = parsers
# 2) Try each and every parser
# 3) Look for working ruleset for given parser
# 3a) See if parsing works
# 3b) See if .items matches anything
for parser in parser_candidates:
for parser in parsers:
ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
# 'path' as they should have been caught beforehands
try:
feed = parser(data)
feed = parser(data, encoding=encoding)
except (ValueError):
# parsing did not work
@@ -112,7 +125,7 @@ def parse(data, url=None, mimetype=None):
class ParserBase(object):
def __init__(self, data=None, rules=None, parent=None):
def __init__(self, data=None, rules=None, parent=None, encoding=None):
if rules is None:
rules = parse_rules()[self.default_ruleset]
@@ -121,9 +134,10 @@ class ParserBase(object):
if data is None:
data = rules['base']
self.root = self.parse(data)
self.parent = parent
self.encoding = encoding
self.root = self.parse(data)
def parse(self, raw):
pass
@@ -148,15 +162,15 @@ class ParserBase(object):
c = csv.writer(out, dialect=csv.excel)
for item in self.items:
row = [getattr(item, x) for x in item.dic]
if encoding != 'unicode':
row = [x.encode(encoding) if isinstance(x, unicode) else x for x in row]
c.writerow(row)
c.writerow([getattr(item, x) for x in item.dic])
out.seek(0)
return out.read()
out = out.read()
if encoding != 'unicode':
out = out.encode(encoding)
return out
def tohtml(self, **k):
return self.convert(FeedHTML).tostring(**k)
@@ -267,7 +281,14 @@ class ParserBase(object):
except AttributeError:
# does not exist, have to create it
try:
self.rule_create(self.rules[rule_name])
except AttributeError:
# no way to create it, give up
pass
else:
self.rule_set(self.rules[rule_name], value)
def rmv(self, rule_name):
@@ -286,10 +307,7 @@ class ParserXML(ParserBase):
NSMAP = {'atom': 'http://www.w3.org/2005/Atom',
'atom03': 'http://purl.org/atom/ns#',
'media': 'http://search.yahoo.com/mrss/',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'slash': 'http://purl.org/rss/1.0/modules/slash/',
'dc': 'http://purl.org/dc/elements/1.1/',
'content': 'http://purl.org/rss/1.0/modules/content/',
'rssfake': 'http://purl.org/rss/1.0/'}
@@ -401,13 +419,14 @@ class ParserXML(ParserBase):
else:
if html_rich:
# atom stuff
if 'atom' in rule:
match.attrib['type'] = 'xhtml'
self._clean_node(match)
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
match.find('div').drop_tag()
if self.rules['mode'] == 'html':
match.find('div').drop_tag() # not supported by lxml.etree
else: # i.e. if atom
match.attrib['type'] = 'xhtml'
else:
if match is not None and len(match):
@@ -440,8 +459,7 @@ class ParserHTML(ParserXML):
mimetype = ['text/html', 'application/xhtml+xml']
def parse(self, raw):
parser = etree.HTMLParser(remove_blank_text=True) # remove_blank_text needed for pretty_print
return etree.fromstring(raw, parser)
return html_parse(raw, encoding=self.encoding)
def tostring(self, encoding='unicode', **k):
return lxml.html.tostring(self.root, encoding=encoding, **k)
@@ -467,6 +485,9 @@ class ParserHTML(ParserXML):
element = deepcopy(match)
match.getparent().append(element)
else:
raise AttributeError('no way to create item')
def parse_time(value):
if value is None or value == 0:
@@ -474,13 +495,13 @@ def parse_time(value):
elif isinstance(value, basestring):
if re.match(r'^[0-9]+$', value):
return datetime.fromtimestamp(int(value), tz.UTC)
return datetime.fromtimestamp(int(value), tz.tzutc())
else:
return dateutil.parser.parse(value)
return dateutil.parser.parse(value).replace(tzinfo=tz.tzutc())
elif isinstance(value, int):
return datetime.fromtimestamp(value, tz.UTC)
return datetime.fromtimestamp(value, tz.tzutc())
elif isinstance(value, datetime):
return value

View File

@@ -1,9 +1,10 @@
import sys
import os
import os.path
import time
import threading
import time
from datetime import datetime
from dateutil import tz
from fnmatch import fnmatch
import re
@@ -12,39 +13,38 @@ import lxml.etree
import lxml.html
from . import feeds
from . import feedify
from . import crawler
from . import readabilite
import wsgiref.simple_server
import wsgiref.handlers
import cgitb
try:
# python 2
from Queue import Queue
from httplib import HTTPException
from urllib import quote_plus
from urllib import unquote
from urlparse import urlparse, urljoin, parse_qs
except ImportError:
# python 3
from queue import Queue
from http.client import HTTPException
from urllib.parse import quote_plus
from urllib.parse import unquote
from urllib.parse import urlparse, urljoin, parse_qs
LIM_ITEM = 100 # deletes what's beyond
LIM_TIME = 7 # deletes what's after
MAX_ITEM = 50 # cache-only beyond
MAX_TIME = 7 # cache-only after (in sec)
MAX_ITEM = 5 # cache-only beyond
MAX_TIME = 2 # cache-only after (in sec)
LIM_ITEM = 10 # deletes what's beyond
LIM_TIME = 2.5 # deletes what's after
DELAY = 10 * 60 # xml cache & ETag cache (in sec)
TIMEOUT = 4 # http timeout (in sec)
THREADS = 10 # number of threads (1 for single-threaded)
DEBUG = False
PORT = 8080
PROTOCOL = ['http', 'https', 'ftp']
PROTOCOL = ['http', 'https']
def filterOptions(options):
@@ -52,7 +52,7 @@ def filterOptions(options):
# example of filtering code below
#allowed = ['proxy', 'clip', 'keep', 'cache', 'force', 'silent', 'pro', 'debug']
#allowed = ['proxy', 'clip', 'cache', 'force', 'silent', 'pro', 'debug']
#filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
#return filtered
@@ -66,6 +66,7 @@ def log(txt, force=False):
if DEBUG or force:
if 'REQUEST_URI' in os.environ:
open('morss.log', 'a').write("%s\n" % repr(txt))
else:
print(repr(txt))
@@ -73,6 +74,7 @@ def log(txt, force=False):
def len_html(txt):
if len(txt):
return len(lxml.html.fromstring(txt).text_content())
else:
return 0
@@ -80,6 +82,7 @@ def len_html(txt):
def count_words(txt):
if len(txt):
return len(lxml.html.fromstring(txt).text_content().split())
return 0
@@ -88,12 +91,14 @@ class Options:
if len(args):
self.options = args
self.options.update(options or {})
else:
self.options = options or {}
def __getattr__(self, key):
if key in self.options:
return self.options[key]
else:
return False
@@ -107,17 +112,23 @@ class Options:
def parseOptions(options):
""" Turns ['md=True'] into {'md':True} """
out = {}
for option in options:
split = option.split('=', 1)
if len(split) > 1:
if split[0].lower() == 'true':
out[split[0]] = True
elif split[0].lower() == 'false':
out[split[0]] = False
else:
out[split[0]] = split[1]
else:
out[split[0]] = True
return out
@@ -125,7 +136,7 @@ def ItemFix(item, feedurl='/'):
""" Improves feed items (absolute links, resolve feedburner links, etc) """
# check unwanted uppercase title
if len(item.title) > 20 and item.title.isupper():
if item.title is not None and len(item.title) > 20 and item.title.isupper():
item.title = item.title.title()
# check if it includes link
@@ -158,6 +169,11 @@ def ItemFix(item, feedurl='/'):
item.link = parse_qs(urlparse(item.link).query)['url'][0]
log(item.link)
# pocket
if fnmatch(item.link, 'https://getpocket.com/redirect?url=*'):
item.link = parse_qs(urlparse(item.link).query)['url'][0]
log(item.link)
# facebook
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
item.link = parse_qs(urlparse(item.link).query)['u'][0]
@@ -183,7 +199,7 @@ def ItemFix(item, feedurl='/'):
# reddit
if urlparse(feedurl).netloc == 'www.reddit.com':
match = lxml.html.fromstring(item.desc).xpath('//a[text()="[link]"]/@href')
match = lxml.html.fromstring(item.content).xpath('//a[text()="[link]"]/@href')
if len(match):
item.link = match[0]
log(item.link)
@@ -208,6 +224,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
if len(match):
link = match[0]
log(link)
else:
link = None
@@ -217,6 +234,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
link = match[0]
log(link)
else:
link = None
@@ -232,19 +250,17 @@ def ItemFill(item, options, feedurl='/', fast=False):
delay = -2
try:
con = crawler.custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
data = con.read()
data, con, contenttype, encoding = crawler.adv_get(url=link, delay=delay, timeout=TIMEOUT)
except (IOError, HTTPException) as e:
log('http error')
return False # let's just delete errors stuff when in cache mode
contenttype = con.info().get('Content-Type', '').split(';')[0]
if contenttype not in crawler.MIMETYPE['html'] and contenttype != 'text/plain':
log('non-text page')
return True
out = readabilite.get_article(data, link, options.encoding or crawler.detect_encoding(data, con))
out = readabilite.get_article(data, url=con.geturl(), encoding=encoding)
if out is not None:
item.content = out
@@ -268,9 +284,6 @@ def ItemAfter(item, options):
item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
del item.desc
if not options.keep and not options.proxy:
del item.desc
if options.nolink and item.content:
content = lxml.html.fromstring(item.content)
for link in content.xpath('//a'):
@@ -284,27 +297,23 @@ def ItemAfter(item, options):
return item
def FeedFetch(url, options):
# basic url clean-up
def UrlFix(url):
if url is None:
raise MorssException('No url provided')
if isinstance(url, bytes):
url = url.decode()
if urlparse(url).scheme not in PROTOCOL:
url = 'http://' + url
log(url)
url = url.replace(' ', '%20')
if isinstance(url, bytes):
url = url.decode()
return url
# allow for code execution for feedify
pre = feedify.pre_worker(url)
if pre:
url = pre
log('url redirect')
log(url)
def FeedFetch(url, options):
# fetch feed
delay = DELAY
@@ -312,35 +321,34 @@ def FeedFetch(url, options):
delay = 0
try:
con = crawler.custom_handler(accept='xml', strict=True, delay=delay,
encoding=options.encoding, basic=not options.items) \
.open(url, timeout=TIMEOUT * 2)
xml = con.read()
xml, con, contenttype, encoding = crawler.adv_get(url=url, follow='rss', delay=delay, timeout=TIMEOUT * 2)
except (IOError, HTTPException):
raise MorssException('Error downloading feed')
contenttype = con.info().get('Content-Type', '').split(';')[0]
if options.items:
# using custom rules
rss = feeds.FeedHTML(xml, url, contenttype)
feed.rule
rss = feeds.FeedHTML(xml, encoding=encoding)
rss.rules['title'] = options.title if options.title else '//head/title'
rss.rules['desc'] = options.desc if options.desc else '//head/meta[@name="description"]/@content'
rss.rules['items'] = options.items
if options.item_title:
rss.rules['item_title'] = options.item_title
if options.item_link:
rss.rules['item_link'] = options.item_link
rss.rules['item_title'] = options.item_title if options.item_title else './/a|.'
rss.rules['item_link'] = options.item_link if options.item_link else './@href|.//a/@href'
if options.item_content:
rss.rules['item_content'] = options.item_content
if options.item_time:
rss.rules['item_time'] = options.item_time
rss = rss.convert(feeds.FeedXML)
else:
try:
rss = feeds.parse(xml, url, contenttype)
rss = feeds.parse(xml, url, contenttype, encoding=encoding)
rss = rss.convert(feeds.FeedXML)
# contains all fields, otherwise much-needed data can be lost
@@ -361,34 +369,22 @@ def FeedGather(rss, url, options):
lim_time = LIM_TIME
max_item = MAX_ITEM
max_time = MAX_TIME
threads = THREADS
if options.cache:
max_time = 0
if options.mono:
threads = 1
# set
def runner(queue):
while True:
value = queue.get()
try:
worker(*value)
except Exception as e:
log('Thread Error: %s' % e.message)
queue.task_done()
def worker(i, item):
now = datetime.now(tz.tzutc())
sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
for i, item in enumerate(sorted_items):
if time.time() - start_time > lim_time >= 0 or i + 1 > lim_item >= 0:
log('dropped')
item.remove()
return
continue
item = ItemBefore(item, options)
if item is None:
return
continue
item = ItemFix(item, url)
@@ -396,7 +392,7 @@ def FeedGather(rss, url, options):
if not options.proxy:
if ItemFill(item, options, url, True) is False:
item.remove()
return
continue
else:
if not options.proxy:
@@ -404,22 +400,6 @@ def FeedGather(rss, url, options):
item = ItemAfter(item, options)
queue = Queue()
for i in range(threads):
t = threading.Thread(target=runner, args=(queue,))
t.daemon = True
t.start()
for i, item in enumerate(list(rss.items)):
if threads == 1:
worker(*[i, item])
else:
queue.put([i, item])
if threads != 1:
queue.join()
if options.ad:
new = rss.items.append()
new.title = "Are you hungry?"
@@ -433,37 +413,38 @@ def FeedGather(rss, url, options):
return rss
def FeedFormat(rss, options):
def FeedFormat(rss, options, encoding='utf-8'):
if options.callback:
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
return '%s(%s)' % (options.callback, rss.tojson())
out = '%s(%s)' % (options.callback, rss.tojson(encoding='unicode'))
return out if encoding == 'unicode' else out.encode(encoding)
else:
raise MorssException('Invalid callback var name')
elif options.json:
if options.indent:
return rss.tojson(encoding='UTF-8', indent=4)
return rss.tojson(encoding=encoding, indent=4)
else:
return rss.tojson(encoding='UTF-8')
return rss.tojson(encoding=encoding)
elif options.csv:
return rss.tocsv(encoding='UTF-8')
return rss.tocsv(encoding=encoding)
elif options.reader:
elif options.html:
if options.indent:
return rss.tohtml(encoding='UTF-8', pretty_print=True)
return rss.tohtml(encoding=encoding, pretty_print=True)
else:
return rss.tohtml(encoding='UTF-8')
return rss.tohtml(encoding=encoding)
else:
if options.indent:
return rss.torss(xml_declaration=True, encoding='UTF-8', pretty_print=True)
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding, pretty_print=True)
else:
return rss.torss(xml_declaration=True, encoding='UTF-8')
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding)
def process(url, cache=None, options=None):
@@ -475,14 +456,16 @@ def process(url, cache=None, options=None):
if cache:
crawler.default_cache = crawler.SQLiteCache(cache)
url = UrlFix(url)
rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)
return FeedFormat(rss, options)
def cgi_app(environ, start_response):
def cgi_parse_environ(environ):
# get options
if 'REQUEST_URI' in environ:
url = environ['REQUEST_URI'][1:]
else:
@@ -496,7 +479,7 @@ def cgi_app(environ, start_response):
if url.startswith(':'):
split = url.split('/', 1)
options = split[0].replace('|', '/').replace('\\\'', '\'').split(':')[1:]
raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
if len(split) > 1:
url = split[1]
@@ -504,15 +487,22 @@ def cgi_app(environ, start_response):
url = ''
else:
options = []
raw_options = []
# init
options = Options(filterOptions(parseOptions(options)))
headers = {}
options = Options(filterOptions(parseOptions(raw_options)))
global DEBUG
DEBUG = options.debug
return (url, options)
def cgi_app(environ, start_response):
url, options = cgi_parse_environ(environ)
headers = {}
# headers
headers['status'] = '200 OK'
headers['cache-control'] = 'max-age=%s' % DELAY
@@ -520,7 +510,7 @@ def cgi_app(environ, start_response):
if options.cors:
headers['access-control-allow-origin'] = '*'
if options.html or options.reader:
if options.html:
headers['content-type'] = 'text/html'
elif options.txt or options.silent:
headers['content-type'] = 'text/plain'
@@ -534,9 +524,12 @@ def cgi_app(environ, start_response):
else:
headers['content-type'] = 'text/xml'
headers['content-type'] += '; charset=utf-8'
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
# get the work done
url = UrlFix(url)
rss = FeedFetch(url, options)
if headers['content-type'] == 'text/xml':
@@ -547,18 +540,42 @@ def cgi_app(environ, start_response):
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options)
if not options.silent:
return out
if options.silent:
return ['']
else:
return [out]
def cgi_wrapper(environ, start_response):
# simple http server for html and css
def middleware(func):
" Decorator to turn a function into a wsgi middleware "
# This is called when parsing the "@middleware" code
def app_builder(app):
# This is called when doing app = cgi_wrapper(app)
def app_wrap(environ, start_response):
# This is called when a http request is being processed
return func(environ, start_response, app)
return app_wrap
return app_builder
@middleware
def cgi_file_handler(environ, start_response, app):
" Simple HTTP server to serve static files (.html, .css, etc.) "
files = {
'': 'text/html',
'index.html': 'text/html'}
'index.html': 'text/html',
'sheet.xsl': 'text/xsl'}
if 'REQUEST_URI' in environ:
url = environ['REQUEST_URI'][1:]
else:
url = environ['PATH_INFO'][1:]
@@ -568,12 +585,10 @@ def cgi_wrapper(environ, start_response):
if url == '':
url = 'index.html'
if '--root' in sys.argv[1:]:
path = os.path.join(sys.argv[-1], url)
else:
path = url
paths = [os.path.join(sys.prefix, 'share/morss/www', url),
os.path.join(os.path.dirname(__file__), '../www', url)]
for path in paths:
try:
body = open(path, 'rb').read()
@@ -583,20 +598,95 @@ def cgi_wrapper(environ, start_response):
return [body]
except IOError:
continue
else:
# the for loop did not return, so here we are, i.e. no file found
headers['status'] = '404 Not found'
start_response(headers['status'], list(headers.items()))
return ['Error %s' % headers['status']]
# actual morss use
else:
return app(environ, start_response)
def cgi_get(environ, start_response):
url, options = cgi_parse_environ(environ)
# get page
PROTOCOL = ['http', 'https']
if urlparse(url).scheme not in ['http', 'https']:
url = 'http://' + url
data, con, contenttype, encoding = crawler.adv_get(url=url, timeout=TIMEOUT)
if contenttype in ['text/html', 'application/xhtml+xml', 'application/xml']:
if options.get == 'page':
html = readabilite.parse(data, encoding=encoding)
html.make_links_absolute(con.geturl())
kill_tags = ['script', 'iframe', 'noscript']
for tag in kill_tags:
for elem in html.xpath('//'+tag):
elem.getparent().remove(elem)
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8')
elif options.get == 'article':
output = readabilite.get_article(data, url=con.geturl(), encoding=encoding, debug=options.debug)
else:
raise MorssException('no :get option passed')
else:
output = data
# return html page
headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8'}
start_response(headers['status'], list(headers.items()))
return [output]
dispatch_table = {
'get': cgi_get,
}
@middleware
def cgi_dispatcher(environ, start_response, app):
url, options = cgi_parse_environ(environ)
for key in dispatch_table.keys():
if key in options:
return dispatch_table[key](environ, start_response)
return app(environ, start_response)
@middleware
def cgi_error_handler(environ, start_response, app):
try:
return [cgi_app(environ, start_response) or '(empty)']
return app(environ, start_response)
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
headers = {'status': '500 Oops', 'content-type': 'text/plain'}
headers = {'status': '500 Oops', 'content-type': 'text/html'}
start_response(headers['status'], list(headers.items()), sys.exc_info())
log('ERROR <%s>: %s' % (url, e.message), force=True)
return ['An error happened:\n%s' % e.message]
log('ERROR: %s' % repr(e), force=True)
return [cgitb.html(sys.exc_info())]
@middleware
def cgi_encode(environ, start_response, app):
out = app(environ, start_response)
return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
cgi_standalone_app = cgi_encode(cgi_error_handler(cgi_dispatcher(cgi_file_handler(cgi_app))))
def cli_app():
@@ -608,12 +698,13 @@ def cli_app():
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
url = UrlFix(url)
rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options)
out = FeedFormat(rss, options, 'unicode')
if not options.silent:
print(out.decode('utf-8', 'replace') if isinstance(out, bytes) else out)
print(out)
log('done')
@@ -622,6 +713,7 @@ def isInt(string):
try:
int(string)
return True
except ValueError:
return False
@@ -629,31 +721,46 @@ def isInt(string):
def main():
if 'REQUEST_URI' in os.environ:
# mod_cgi
wsgiref.handlers.CGIHandler().run(cgi_wrapper)
elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
app = cgi_app
app = cgi_dispatcher(app)
app = cgi_error_handler(app)
app = cgi_encode(app)
wsgiref.handlers.CGIHandler().run(app)
elif len(sys.argv) <= 1 or isInt(sys.argv[1]):
# start internal (basic) http server
if len(sys.argv) > 1 and isInt(sys.argv[1]):
argPort = int(sys.argv[1])
if argPort > 0:
port = argPort
else:
raise MorssException('Port must be positive integer')
else:
port = PORT
app = cgi_app
app = cgi_file_handler(app)
app = cgi_dispatcher(app)
app = cgi_error_handler(app)
app = cgi_encode(app)
print('Serving http://localhost:%s/' % port)
httpd = wsgiref.simple_server.make_server('', port, cgi_wrapper)
httpd = wsgiref.simple_server.make_server('', port, app)
httpd.serve_forever()
else:
# as a CLI app
try:
cli_app()
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
print('ERROR: %s' % e.message)

View File

@@ -1,13 +1,17 @@
import lxml.etree
import lxml.html
from bs4 import BeautifulSoup
import re
def parse(data, encoding=None):
if encoding:
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding=encoding)
data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
else:
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True)
data = BeautifulSoup(data, 'lxml').prettify('utf-8')
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding='utf-8')
return lxml.html.fromstring(data, parser=parser)
@@ -43,6 +47,12 @@ def count_content(node):
return count_words(node.text_content()) + len(node.findall('.//img'))
def percentile(N, P):
# https://stackoverflow.com/a/7464107
n = max(int(round(P * len(N) + 0.5)), 2)
return N[n-2]
class_bad = ['comment', 'community', 'extra', 'foot',
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
@@ -62,7 +72,7 @@ regex_good = re.compile('|'.join(class_good), re.I)
tags_junk = ['script', 'head', 'iframe', 'object', 'noscript',
'param', 'embed', 'layer', 'applet', 'style', 'form', 'input', 'textarea',
'button', 'footer']
'button', 'footer', 'link', 'meta']
tags_bad = tags_junk + ['a', 'aside']
@@ -93,10 +103,18 @@ def score_node(node):
class_id = node.get('class', '') + node.get('id', '')
if (isinstance(node, lxml.html.HtmlComment)
or node.tag in tags_bad
or regex_bad.search(class_id)):
or isinstance(node, lxml.html.HtmlProcessingInstruction)):
return 0
if node.tag in tags_junk:
score += -1 # actuall -2 as tags_junk is included tags_bad
if node.tag in tags_bad:
score += -1
if regex_bad.search(class_id):
score += -1
if node.tag in tags_good:
score += 4
@@ -114,33 +132,42 @@ def score_node(node):
return score
def score_all(node, grades=None):
def score_all(node):
" Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
if grades is None:
grades = {}
for child in node:
score = score_node(child)
child.attrib['seen'] = 'yes, ' + str(int(score))
child.attrib['morss_own_score'] = str(float(score))
if score > 0:
spread_score(child, score, grades)
score_all(child, grades)
return grades
if score > 0 or len(list(child.iterancestors())) <= 2:
spread_score(child, score)
score_all(child)
def spread_score(node, score, grades):
def set_score(node, value):
node.attrib['morss_score'] = str(float(value))
def get_score(node):
return float(node.attrib.get('morss_score', 0))
def incr_score(node, delta):
set_score(node, get_score(node) + delta)
def get_all_scores(node):
return {x:get_score(x) for x in list(node.iter()) if get_score(x) != 0}
def spread_score(node, score):
" Spread the node's score to its parents, on a linear way "
delta = score / 2
for ancestor in [node,] + list(node.iterancestors()):
if score >= 1 or ancestor is node:
try:
grades[ancestor] += score
except KeyError:
grades[ancestor] = score
incr_score(ancestor, score)
score -= delta
@@ -148,26 +175,24 @@ def spread_score(node, score, grades):
break
def write_score_all(root, grades):
" Useful for debugging "
for node in root.iter():
node.attrib['score'] = str(int(grades.get(node, 0)))
def clean_root(root):
def clean_root(root, keep_threshold=None):
for node in list(root):
clean_root(node)
clean_node(node)
# bottom-up approach, i.e. starting with children before cleaning current node
clean_root(node, keep_threshold)
clean_node(node, keep_threshold)
def clean_node(node):
def clean_node(node, keep_threshold=None):
parent = node.getparent()
if parent is None:
# this is <html/> (or a removed element waiting for GC)
return
if keep_threshold is not None and get_score(node) >= keep_threshold:
# high score, so keep
return
gdparent = parent.getparent()
# remove shitty tags
@@ -266,41 +291,45 @@ def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
def rank_nodes(grades):
def rank_grades(grades):
# largest score to smallest
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
def get_best_node(grades):
def get_best_node(ranked_grades):
" To pick the best (raw) node. Another function will clean it "
if len(grades) == 1:
return grades[0]
if len(ranked_grades) == 1:
return ranked_grades[0]
top = rank_nodes(grades)
lowest = lowest_common_ancestor(top[0][0], top[1][0], 3)
lowest = lowest_common_ancestor(ranked_grades[0][0], ranked_grades[1][0], 3)
return lowest
def get_article(data, url=None, encoding=None):
def get_article(data, url=None, encoding=None, debug=False):
" Input a raw html string, returns a raw html string of the article "
html = parse(data, encoding)
scores = score_all(html)
score_all(html)
scores = rank_grades(get_all_scores(html))
if not len(scores):
return None
best = get_best_node(scores)
if not debug:
keep_threshold = percentile([x[1] for x in scores], 0.1)
clean_root(best, keep_threshold)
wc = count_words(best.text_content())
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
if wc - wca < 50 or float(wca) / wc > 0.3:
if not debug and (wc - wca < 50 or float(wca) / wc > 0.3):
return None
if url:
best.make_links_absolute(url)
clean_root(best)
return lxml.etree.tostring(best, pretty_print=True)
return lxml.etree.tostring(best if not debug else html, pretty_print=True)

View File

@@ -1,210 +0,0 @@
@require(feed)
<!DOCTYPE html>
<html>
<head>
<title>@feed.title &#8211; via morss</title>
<meta charset="UTF-8" />
<meta name="description" content="@feed.desc (via morss)" />
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
<style type="text/css">
/* columns - from https://thisisdallas.github.io/Simple-Grid/simpleGrid.css */
* {
box-sizing: border-box;
}
#content {
width: 100%;
max-width: 1140px;
min-width: 755px;
margin: 0 auto;
overflow: hidden;
padding-top: 20px;
padding-left: 20px; /* grid-space to left */
padding-right: 0px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-20px=0 */
}
.item {
width: 33.33%;
float: left;
padding-right: 20px; /* column-space */
}
@@media handheld, only screen and (max-width: 767px) { /* @@ to escape from the template engine */
#content {
width: 100%;
min-width: 0;
margin-left: 0px;
margin-right: 0px;
padding-left: 20px; /* grid-space to left */
padding-right: 10px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-10px=10px */
}
.item {
width: auto;
float: none;
margin-left: 0px;
margin-right: 0px;
margin-top: 10px;
margin-bottom: 10px;
padding-left: 0px;
padding-right: 10px; /* column-space */
}
}
/* design */
#header h1, #header h2, #header p {
font-family: sans;
text-align: center;
margin: 0;
padding: 0;
}
#header h1 {
font-size: 2.5em;
font-weight: bold;
padding: 1em 0 0.25em;
}
#header h2 {
font-size: 1em;
font-weight: normal;
}
#header p {
color: gray;
font-style: italic;
font-size: 0.75em;
}
#content {
text-align: justify;
}
.item .title {
font-weight: bold;
display: block;
text-align: center;
}
.item .link {
color: inherit;
text-decoration: none;
}
.item:not(.active) {
cursor: pointer;
height: 20em;
margin-bottom: 20px;
overflow: hidden;
text-overflow: ellpisps;
padding: 0.25em;
position: relative;
}
.item:not(.active) .title {
padding-bottom: 0.1em;
margin-bottom: 0.1em;
border-bottom: 1px solid silver;
}
.item:not(.active):before {
content: " ";
display: block;
width: 100%;
position: absolute;
top: 18.5em;
height: 1.5em;
background: linear-gradient(to bottom, rgba(255,255,255,0) 0%, rgba(255,255,255,1) 100%);
}
.item:not(.active) .article * {
max-width: 100%;
font-size: 1em !important;
font-weight: normal;
display: inline;
margin: 0;
}
.item.active {
background: white;
position: fixed;
overflow: auto;
top: 0;
left: 0;
height: 100%;
width: 100%;
z-index: 1;
}
body.noscroll {
overflow: hidden;
}
.item.active > * {
max-width: 700px;
margin: auto;
}
.item.active .title {
font-size: 2em;
padding: 0.5em 0;
}
.item.active .article object,
.item.active .article video,
.item.active .article audio {
display: none;
}
.item.active .article img {
max-height: 20em;
max-width: 100%;
}
</style>
</head>
<body>
<div id="header">
<h1>@feed.title</h1>
@if feed.desc:
<h2>@feed.desc</h2>
@end
<p>- via morss</p>
</div>
<div id="content">
@for item in feed.items:
<div class="item">
@if item.link:
<a class="title link" href="@item.link" target="_blank">@item.title</a>
@else:
<span class="title">@item.title</span>
@end
<div class="article">
@if item.content:
@item.content
@else:
@item.desc
@end
</div>
</div>
@end
</div>
<script>
var items = document.getElementsByClassName('item')
for (var i in items)
items[i].onclick = function()
{
this.classList.toggle('active')
document.body.classList.toggle('noscroll')
}
</script>
</body>
</html>

View File

@@ -1,4 +0,0 @@
lxml
python-dateutil <= 1.5
chardet
pymysql

View File

@@ -1,14 +1,24 @@
from setuptools import setup, find_packages
from setuptools import setup
from glob import glob
package_name = 'morss'
setup(
name = package_name,
description = 'Get full-text RSS feeds',
author = 'pictuga, Samuel Marks',
author_email = 'contact at pictuga dot com',
url = 'http://morss.it/',
download_url = 'https://git.pictuga.com/pictuga/morss',
license = 'AGPL v3',
package_dir={package_name: package_name},
packages=find_packages(),
package_data={package_name: ['feedify.ini', 'reader.html.template']},
test_suite=package_name + '.tests')
packages = [package_name],
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
package_data = {package_name: ['feedify.ini']},
data_files = [
('share/' + package_name, ['README.md', 'LICENSE']),
('share/' + package_name + '/www', glob('www/*.*')),
('share/' + package_name + '/www/cgi', [])
],
entry_points = {
'console_scripts': [package_name + '=' + package_name + ':main']
})

View File

@@ -35,8 +35,8 @@
<input type="text" id="url" name="url" placeholder="Feed url (http://example.com/feed.xml)" />
</form>
<code>Copyright: pictuga 2013-2014<br/>
Source code: https://github.com/pictuga/morss</code>
<code>Copyright: pictuga 2013-2020<br/>
Source code: https://git.pictuga.com/pictuga/morss</code>
<script>
form = document.forms[0]

View File

@@ -1,5 +1,12 @@
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:stylesheet version="1.1"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:atom03="http://purl.org/atom/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:rssfake="http://purl.org/rss/1.0/"
>
<xsl:output method="html"/>
@@ -8,11 +15,13 @@
<head>
<title>RSS feed by morss</title>
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
<meta name="robots" content="noindex" />
<style type="text/css">
body {
overflow-wrap: anywhere;
word-wrap: anywhere;
font-family: sans;
}
#url {
@@ -22,37 +31,31 @@
max-width: 100%;
}
body > ul {
.item {
background-color: #FFFAF4;
border: 1px solid silver;
margin: 1%;
max-width: 100%;
}
.item > * {
padding: 1%;
}
.item > :not(:last-child) {
border-bottom: 1px solid silver;
}
.item > a {
display: block;
font-weight: bold;
font-size: 1.5em;
}
.content * {
max-width: 100%;
}
ul {
list-style-type: none;
}
.tag {
color: darkred;
}
.attr {
color: darksalmon;
}
.value {
color: darkblue;
}
.comment {
color: lightgrey;
}
pre {
margin: 0;
max-width: 100%;
white-space: normal;
}
</style>
</head>
@@ -64,59 +67,44 @@
<div id="url"></div>
<ul>
<xsl:apply-templates/>
</ul>
<hr/>
<div id="header">
<h1>
<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:title|rss/channel/title|atom:feed/atom:title|atom03:feed/atom03:title"/>
</h1>
<p>
<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:description|rss/channel/description|atom:feed/atom:subtitle|atom03:feed/atom03:subtitle"/>
</p>
</div>
<div id="content">
<xsl:for-each select="rdf:RDF/rssfake:channel/rssfake:item|rss/channel/item|atom:feed/atom:entry|atom03:feed/atom03:entry">
<div class="item">
<a href="/" target="_blank"><xsl:attribute name="href"><xsl:value-of select="rssfake:link|link|atom:link/@href|atom03:link/@href"/></xsl:attribute>
<xsl:value-of select="rssfake:title|title|atom:title|atom03:title"/>
</a>
<div class="desc">
<xsl:copy-of select="rssfake:description|description|atom:summary|atom03:summary"/>
</div>
<div class="content">
<xsl:copy-of select="content:encoded|atom:content|atom03:content"/>
</div>
</div>
</xsl:for-each>
</div>
<script>
document.getElementById("url").innerHTML = window.location.href;
document.getElementById("url").innerHTML = window.location.href.replace(/:html\/?/, '')
if (!/:html/.test(window.location.href))
for (var content of document.querySelectorAll(".desc,.content"))
content.innerHTML = content.children.children ? content.innerHTML : content.innerText
</script>
</body>
</html>
</xsl:template>
<xsl:template match="*">
<li>
<span class="element">
&lt;
<span class="tag"><xsl:value-of select="name()"/></span>
<xsl:for-each select="@*">
<span class="attr"> <xsl:value-of select="name()"/></span>
=
"<span class="value"><xsl:value-of select="."/></span>"
</xsl:for-each>
&gt;
</span>
<xsl:if test="node()">
<ul>
<xsl:apply-templates/>
</ul>
</xsl:if>
<span class="element">
&lt;/
<span class="tag"><xsl:value-of select="name()"/></span>
&gt;
</span>
</li>
</xsl:template>
<xsl:template match="comment()">
<li>
<pre class="comment"><![CDATA[<!--]]><xsl:value-of select="."/><![CDATA[-->]]></pre>
</li>
</xsl:template>
<xsl:template match="text()">
<li>
<pre>
<xsl:value-of select="normalize-space(.)"/>
</pre>
</li>
</xsl:template>
<xsl:template match="text()[not(normalize-space())]"/>
</xsl:stylesheet>