Compare commits
132 Commits
v1.1
...
9815794a97
Author | SHA1 | Date | |
---|---|---|---|
9815794a97 | |||
758b6861b9 | |||
ce4cf01aa6 | |||
dcfdb75a15 | |||
4ccc0dafcd | |||
2fe3e0b8ee | |||
ad3ba9de1a | |||
68c46a1823 | |||
91be2d229e | |||
038f267ea2 | |||
22005065e8 | |||
7d0d416610 | |||
5dac4c69a1 | |||
36e2a1c3fd | |||
83dd2925d3 | |||
e09d0abf54 | |||
ff26a560cb | |||
74d7a1eca2 | |||
eba295cba8 | |||
f27631954e | |||
c74abfa2f4 | |||
1d5272c299 | |||
f685139137 | |||
73b477665e | |||
b425992783 | |||
271ac8f80f | |||
64e41b807d | |||
a2c4691090 | |||
b6000923bc | |||
27a42c47aa | |||
c27c38f7c7 | |||
a1dc96cb50 | |||
749acc87fc | |||
c186188557 | |||
cb69e3167f | |||
c3f06da947 | |||
44a3e0edc4 | |||
4a9b505499 | |||
818cdaaa9b | |||
2806c64326 | |||
d39d7bb19d | |||
e5e3746fc6 | |||
960c9d10d6 | |||
0e7a5b9780 | |||
186bedcf62 | |||
5847e18e42 | |||
f6bc23927f | |||
c86572374e | |||
59ef5af9e2 | |||
6a0531ca03 | |||
8187876a06 | |||
325a373e3e | |||
2719bd6776 | |||
285e1e5f42 | |||
41a63900c2 | |||
ec8edb02f1 | |||
d01b943597 | |||
b361aa2867 | |||
4ce3c7cb32 | |||
7e45b2611d | |||
036e5190f1 | |||
e99c5b3b71 | |||
4f44df8d63 | |||
497c14db81 | |||
a4e1dba8b7 | |||
7375adce33 | |||
663212de0a | |||
4a2ea1bce9 | |||
fe82b19c91 | |||
0b31e97492 | |||
b0ad7c259d | |||
bffb23f884 | |||
59139272fd | |||
39b0a1d7cc | |||
65803b328d | |||
e6b7c0eb33 | |||
67c096ad5b | |||
f018437544 | |||
8e5e8d24a4 | |||
ee78a7875a | |||
9e7b9d95ee | |||
987a719c4e | |||
47b33f4baa | |||
3c7f512583 | |||
a32f5a8536 | |||
63a06524b7 | |||
b0f80c6d3c | |||
78cea10ead | |||
e5a82ff1f4 | |||
f3d1f92b39 | |||
7691df5257 | |||
0ae0dbc175 | |||
f1d0431e68 | |||
a09831415f | |||
bfad6b7a4a | |||
6b8c3e51e7 | |||
dc9e425247 | |||
2f48e18bb1 | |||
31cac921c7 | |||
a82ec96eb7 | |||
aad2398e69 | |||
eeac630855 | |||
e136b0feb2 | |||
6cf32af6c0 | |||
568e7d7dd2 | |||
3617f86e9d | |||
d90756b337 | |||
40c69f17d2 | |||
99461ea185 | |||
bf86c1e962 | |||
d20f6237bd | |||
8a4d68d72c | |||
e6811138fd | |||
35b702fffd | |||
4a88886767 | |||
1653394cf7 | |||
a8a90cf414 | |||
bdbaf0f8a7 | |||
d0e447a2a6 | |||
e6817e01b4 | |||
7c3091d64c | |||
37b4e144a9 | |||
bd4b7b5bb2 | |||
68d920d4b5 | |||
758ff404a8 | |||
463530f02c | |||
ec0a28a91d | |||
421acb439d | |||
42c5d09ccb | |||
056de12484 | |||
961a31141f | |||
a7b01ee85e |
8
Dockerfile
Normal file
8
Dockerfile
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
FROM alpine:latest
|
||||||
|
|
||||||
|
RUN apk add python3 py3-lxml py3-gunicorn py3-pip git
|
||||||
|
|
||||||
|
ADD . /app
|
||||||
|
RUN pip3 install /app
|
||||||
|
|
||||||
|
CMD gunicorn --bind 0.0.0.0:8080 -w 4 morss:cgi_standalone_app
|
68
README.md
68
README.md
@@ -24,15 +24,13 @@ hand-written rules (ie. there's no automatic detection of links to build feeds).
|
|||||||
Please mind that feeds based on html files may stop working unexpectedly, due to
|
Please mind that feeds based on html files may stop working unexpectedly, due to
|
||||||
html structure changes on the target website.
|
html structure changes on the target website.
|
||||||
|
|
||||||
Additionally morss can grab the source xml feed of iTunes podcast, and detect
|
Additionally morss can detect rss feeds in html pages' `<meta>`.
|
||||||
rss feeds in html pages' `<meta>`.
|
|
||||||
|
|
||||||
You can use this program online for free at **[morss.it](https://morss.it/)**.
|
You can use this program online for free at **[morss.it](https://morss.it/)**.
|
||||||
|
|
||||||
Some features of morss:
|
Some features of morss:
|
||||||
- Read RSS/Atom feeds
|
- Read RSS/Atom feeds
|
||||||
- Create RSS feeds from json/html pages
|
- Create RSS feeds from json/html pages
|
||||||
- Convert iTunes podcast links into xml links
|
|
||||||
- Export feeds as RSS/JSON/CSV/HTML
|
- Export feeds as RSS/JSON/CSV/HTML
|
||||||
- Fetch full-text content of feed items
|
- Fetch full-text content of feed items
|
||||||
- Follow 301/meta redirects
|
- Follow 301/meta redirects
|
||||||
@@ -48,6 +46,7 @@ You do need:
|
|||||||
|
|
||||||
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
|
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
|
||||||
- [lxml](http://lxml.de/) for xml parsing
|
- [lxml](http://lxml.de/) for xml parsing
|
||||||
|
- [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
|
||||||
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
|
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
|
||||||
- [chardet](https://pypi.python.org/pypi/chardet)
|
- [chardet](https://pypi.python.org/pypi/chardet)
|
||||||
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
|
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
|
||||||
@@ -56,9 +55,13 @@ You do need:
|
|||||||
Simplest way to get these:
|
Simplest way to get these:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
pip install -r requirements.txt
|
pip install git+https://git.pictuga.com/pictuga/morss.git@master
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
|
||||||
|
C code needs to be compiled). If possible on your distribution, try installing
|
||||||
|
it with the system package manager.
|
||||||
|
|
||||||
You may also need:
|
You may also need:
|
||||||
|
|
||||||
- Apache, with python-cgi support, to run on a server
|
- Apache, with python-cgi support, to run on a server
|
||||||
@@ -74,9 +77,10 @@ The arguments are:
|
|||||||
|
|
||||||
- Change what morss does
|
- Change what morss does
|
||||||
- `json`: output as JSON
|
- `json`: output as JSON
|
||||||
|
- `html`: outpout as HTML
|
||||||
|
- `csv`: outpout as CSV
|
||||||
- `proxy`: doesn't fill the articles
|
- `proxy`: doesn't fill the articles
|
||||||
- `clip`: stick the full article content under the original feed content (useful for twitter)
|
- `clip`: stick the full article content under the original feed content (useful for twitter)
|
||||||
- `keep`: by default, morss does drop feed description whenever the full-content is found (so as not to mislead users who use Firefox, since the latter only shows the description in the feed preview, so they might believe morss doens't work), but with this argument, the description is kept
|
|
||||||
- `search=STRING`: does a basic case-sensitive search in the feed
|
- `search=STRING`: does a basic case-sensitive search in the feed
|
||||||
- Advanced
|
- Advanced
|
||||||
- `csv`: export to csv
|
- `csv`: export to csv
|
||||||
@@ -85,14 +89,11 @@ The arguments are:
|
|||||||
- `noref`: drop items' link
|
- `noref`: drop items' link
|
||||||
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
|
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
|
||||||
- `debug`: to have some feedback from the script execution. Useful for debugging
|
- `debug`: to have some feedback from the script execution. Useful for debugging
|
||||||
- `mono`: disable multithreading while fetching, makes debugging easier
|
- `force`: force refetch the rss feed and articles
|
||||||
- `theforce`: force download the rss feed and ignore cached http errros
|
|
||||||
- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
|
- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
|
||||||
- `encoding=ENCODING`: overrides the encoding auto-detection of the crawler. Some web developers did not quite understand the importance of setting charset/encoding tags correctly...
|
|
||||||
- http server only
|
- http server only
|
||||||
- `callback=NAME`: for JSONP calls
|
- `callback=NAME`: for JSONP calls
|
||||||
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
|
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
|
||||||
- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
|
|
||||||
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
|
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
|
||||||
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
|
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
|
||||||
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
|
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
|
||||||
@@ -111,7 +112,6 @@ morss will auto-detect what "mode" to use.
|
|||||||
For this, you'll want to change a bit the architecture of the files, for example
|
For this, you'll want to change a bit the architecture of the files, for example
|
||||||
into something like this.
|
into something like this.
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
/
|
/
|
||||||
├── cgi
|
├── cgi
|
||||||
@@ -143,20 +143,40 @@ ensure that the provided `/www/.htaccess` works well with your server.
|
|||||||
Running this command should do:
|
Running this command should do:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
uwsgi --http :9090 --plugin python --wsgi-file main.py
|
uwsgi --http :8080 --plugin python --wsgi-file main.py
|
||||||
```
|
```
|
||||||
|
|
||||||
However, one problem might be how to serve the provided `index.html` file if it
|
#### Using Gunicorn
|
||||||
isn't in the same directory. Therefore you can add this at the end of the
|
|
||||||
command to point to another directory `--pyargv '--root ../../www/'`.
|
|
||||||
|
|
||||||
|
```shell
|
||||||
|
gunicorn morss:cgi_standalone_app
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Using docker
|
||||||
|
|
||||||
|
Build & run
|
||||||
|
|
||||||
|
```shell
|
||||||
|
docker build https://git.pictuga.com/pictuga/morss.git -t morss
|
||||||
|
docker run -p 8080:8080 morss
|
||||||
|
```
|
||||||
|
|
||||||
|
In one line
|
||||||
|
|
||||||
|
```shell
|
||||||
|
docker run -p 8080:8080 $(docker build -q https://git.pictuga.com/pictuga/morss.git)
|
||||||
|
```
|
||||||
|
|
||||||
#### Using morss' internal HTTP server
|
#### Using morss' internal HTTP server
|
||||||
|
|
||||||
Morss can run its own HTTP server. The later should start when you run morss
|
Morss can run its own HTTP server. The later should start when you run morss
|
||||||
without any argument, on port 8080.
|
without any argument, on port 8080.
|
||||||
|
|
||||||
You can change the port and the location of the `www/` folder like this `python -m morss 9000 --root ../../www`.
|
```shell
|
||||||
|
morss
|
||||||
|
```
|
||||||
|
|
||||||
|
You can change the port like this `morss 9000`.
|
||||||
|
|
||||||
#### Passing arguments
|
#### Passing arguments
|
||||||
|
|
||||||
@@ -176,9 +196,9 @@ Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rs
|
|||||||
|
|
||||||
Run:
|
Run:
|
||||||
```
|
```
|
||||||
python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||||
```
|
```
|
||||||
For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
For example: `morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
||||||
|
|
||||||
*(Brackets indicate optional text)*
|
*(Brackets indicate optional text)*
|
||||||
|
|
||||||
@@ -191,9 +211,9 @@ scripts can be run on top of the RSS feed, using its
|
|||||||
|
|
||||||
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
|
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
|
||||||
```
|
```
|
||||||
[python[2.7]] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||||
```
|
```
|
||||||
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
|
For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
|
||||||
|
|
||||||
*(Brackets indicate optional text)*
|
*(Brackets indicate optional text)*
|
||||||
|
|
||||||
@@ -230,20 +250,21 @@ url = 'http://newspaper.example/feed.xml'
|
|||||||
options = morss.Options(csv=True) # arguments
|
options = morss.Options(csv=True) # arguments
|
||||||
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
|
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
|
||||||
|
|
||||||
rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
|
url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
|
||||||
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
|
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
|
||||||
|
|
||||||
output = morss.Format(rss, options) # formats final feed
|
output = morss.FeedFormat(rss, options, 'unicode') # formats final feed
|
||||||
```
|
```
|
||||||
|
|
||||||
## Cache information
|
## Cache information
|
||||||
|
|
||||||
morss uses caching to make loading faster. There are 2 possible cache backends
|
morss uses caching to make loading faster. There are 3 possible cache backends
|
||||||
(visible in `morss/crawler.py`):
|
(visible in `morss/crawler.py`):
|
||||||
|
|
||||||
|
- `{}`: a simple python in-memory dict() object
|
||||||
- `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
|
- `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
|
||||||
be cleared every time the program is run
|
be cleared every time the program is run
|
||||||
- `MySQLCacheHandler`: /!\ Does NOT support multi-threading
|
- `MySQLCacheHandler`
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
### Length limitation
|
### Length limitation
|
||||||
@@ -262,7 +283,6 @@ different values at the top of the script.
|
|||||||
|
|
||||||
- `DELAY` sets the browser cache delay, only for HTTP clients
|
- `DELAY` sets the browser cache delay, only for HTTP clients
|
||||||
- `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
|
- `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
|
||||||
- `THREADS` sets the number of threads to use. `1` makes no use of multithreading.
|
|
||||||
|
|
||||||
### Content matching
|
### Content matching
|
||||||
|
|
||||||
|
2
main.py
2
main.py
@@ -1,6 +1,6 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
|
||||||
from morss import main, cgi_wrapper as application
|
from morss import main, cgi_standalone_app as application
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
main()
|
main()
|
||||||
|
248
morss/crawler.py
248
morss/crawler.py
@@ -7,14 +7,19 @@ import chardet
|
|||||||
from cgi import parse_header
|
from cgi import parse_header
|
||||||
import lxml.html
|
import lxml.html
|
||||||
import time
|
import time
|
||||||
|
import random
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# python 2
|
# python 2
|
||||||
from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
||||||
|
from urllib import quote
|
||||||
|
from urlparse import urlparse, urlunparse
|
||||||
import mimetools
|
import mimetools
|
||||||
except ImportError:
|
except ImportError:
|
||||||
# python 3
|
# python 3
|
||||||
from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
||||||
|
from urllib.parse import quote
|
||||||
|
from urllib.parse import urlparse, urlunparse
|
||||||
import email
|
import email
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -27,13 +32,56 @@ except NameError:
|
|||||||
|
|
||||||
MIMETYPE = {
|
MIMETYPE = {
|
||||||
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
|
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
|
||||||
|
'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
|
||||||
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
|
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
|
DEFAULT_UAS = [
|
||||||
|
#https://gist.github.com/fijimunkii/952acac988f2d25bef7e0284bc63c406
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15",
|
||||||
|
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
|
||||||
|
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=False):
|
PROTOCOL = ['http', 'https']
|
||||||
|
|
||||||
|
|
||||||
|
def get(*args, **kwargs):
|
||||||
|
return adv_get(*args, **kwargs)['data']
|
||||||
|
|
||||||
|
|
||||||
|
def adv_get(url, timeout=None, *args, **kwargs):
|
||||||
|
url = sanitize_url(url)
|
||||||
|
|
||||||
|
if timeout is None:
|
||||||
|
con = custom_handler(*args, **kwargs).open(url)
|
||||||
|
|
||||||
|
else:
|
||||||
|
con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
|
||||||
|
|
||||||
|
data = con.read()
|
||||||
|
|
||||||
|
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||||
|
encoding= detect_encoding(data, con)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'data':data,
|
||||||
|
'url': con.geturl(),
|
||||||
|
'con': con,
|
||||||
|
'contenttype': contenttype,
|
||||||
|
'encoding': encoding
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def custom_handler(follow=None, delay=None, encoding=None):
|
||||||
handlers = []
|
handlers = []
|
||||||
|
|
||||||
# as per urllib2 source code, these Handelers are added first
|
# as per urllib2 source code, these Handelers are added first
|
||||||
@@ -45,26 +93,65 @@ def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=F
|
|||||||
# & HTTPSHandler
|
# & HTTPSHandler
|
||||||
|
|
||||||
#handlers.append(DebugHandler())
|
#handlers.append(DebugHandler())
|
||||||
handlers.append(SizeLimitHandler(100*1024)) # 100KiB
|
handlers.append(SizeLimitHandler(500*1024)) # 500KiB
|
||||||
handlers.append(HTTPCookieProcessor())
|
handlers.append(HTTPCookieProcessor())
|
||||||
handlers.append(GZIPHandler())
|
handlers.append(GZIPHandler())
|
||||||
handlers.append(HTTPEquivHandler())
|
handlers.append(HTTPEquivHandler())
|
||||||
handlers.append(HTTPRefreshHandler())
|
handlers.append(HTTPRefreshHandler())
|
||||||
handlers.append(UAHandler(DEFAULT_UA))
|
handlers.append(UAHandler(random.choice(DEFAULT_UAS)))
|
||||||
|
handlers.append(BrowserlyHeaderHandler())
|
||||||
if not basic:
|
|
||||||
handlers.append(AutoRefererHandler())
|
|
||||||
|
|
||||||
handlers.append(EncodingFixHandler(encoding))
|
handlers.append(EncodingFixHandler(encoding))
|
||||||
|
|
||||||
if accept:
|
if follow:
|
||||||
handlers.append(ContentNegociationHandler(MIMETYPE[accept], strict))
|
handlers.append(AlternateHandler(MIMETYPE[follow]))
|
||||||
|
|
||||||
handlers.append(CacheHandler(force_min=delay))
|
handlers.append(CacheHandler(force_min=delay))
|
||||||
|
|
||||||
return build_opener(*handlers)
|
return build_opener(*handlers)
|
||||||
|
|
||||||
|
|
||||||
|
def is_ascii(string):
|
||||||
|
# there's a native function in py3, but home-made fix for backward compatibility
|
||||||
|
try:
|
||||||
|
string.encode('ascii')
|
||||||
|
|
||||||
|
except UnicodeError:
|
||||||
|
return False
|
||||||
|
|
||||||
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def sanitize_url(url):
|
||||||
|
# make sure the url is unicode, i.e. not bytes
|
||||||
|
if isinstance(url, bytes):
|
||||||
|
url = url.decode()
|
||||||
|
|
||||||
|
# make sure there's a protocol (http://)
|
||||||
|
if url.split(':', 1)[0] not in PROTOCOL:
|
||||||
|
url = 'http://' + url
|
||||||
|
|
||||||
|
# turns out some websites have really badly fomatted urls (fix http:/badurl)
|
||||||
|
url = re.sub('^(https?):/([^/])', r'\1://\2', url)
|
||||||
|
|
||||||
|
# escape spaces
|
||||||
|
url = url.replace(' ', '%20')
|
||||||
|
|
||||||
|
# escape non-ascii unicode characters
|
||||||
|
# https://stackoverflow.com/a/4391299
|
||||||
|
parts = list(urlparse(url))
|
||||||
|
|
||||||
|
for i in range(len(parts)):
|
||||||
|
if not is_ascii(parts[i]):
|
||||||
|
if i == 1:
|
||||||
|
parts[i] = parts[i].encode('idna').decode('ascii')
|
||||||
|
|
||||||
|
else:
|
||||||
|
parts[i] = quote(parts[i].encode('utf-8'))
|
||||||
|
|
||||||
|
return urlunparse(parts)
|
||||||
|
|
||||||
|
|
||||||
class DebugHandler(BaseHandler):
|
class DebugHandler(BaseHandler):
|
||||||
handler_order = 2000
|
handler_order = 2000
|
||||||
|
|
||||||
@@ -132,6 +219,15 @@ class GZIPHandler(BaseHandler):
|
|||||||
|
|
||||||
|
|
||||||
def detect_encoding(data, resp=None):
|
def detect_encoding(data, resp=None):
|
||||||
|
enc = detect_raw_encoding(data, resp)
|
||||||
|
|
||||||
|
if enc.lower() == 'gb2312':
|
||||||
|
enc = 'gbk'
|
||||||
|
|
||||||
|
return enc
|
||||||
|
|
||||||
|
|
||||||
|
def detect_raw_encoding(data, resp=None):
|
||||||
if resp is not None:
|
if resp is not None:
|
||||||
enc = resp.headers.get('charset')
|
enc = resp.headers.get('charset')
|
||||||
if enc is not None:
|
if enc is not None:
|
||||||
@@ -165,12 +261,8 @@ class EncodingFixHandler(BaseHandler):
|
|||||||
if 200 <= resp.code < 300 and maintype == 'text':
|
if 200 <= resp.code < 300 and maintype == 'text':
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
|
|
||||||
if not self.encoding:
|
enc = self.encoding or detect_encoding(data, resp)
|
||||||
enc = detect_encoding(data, resp)
|
|
||||||
else:
|
|
||||||
enc = self.encoding
|
|
||||||
|
|
||||||
if enc:
|
|
||||||
data = data.decode(enc, 'replace')
|
data = data.decode(enc, 'replace')
|
||||||
data = data.encode(enc)
|
data = data.encode(enc)
|
||||||
|
|
||||||
@@ -196,48 +288,37 @@ class UAHandler(BaseHandler):
|
|||||||
https_request = http_request
|
https_request = http_request
|
||||||
|
|
||||||
|
|
||||||
class AutoRefererHandler(BaseHandler):
|
class BrowserlyHeaderHandler(BaseHandler):
|
||||||
|
""" Add more headers to look less suspicious """
|
||||||
|
|
||||||
def http_request(self, req):
|
def http_request(self, req):
|
||||||
req.add_unredirected_header('Referer', 'http://%s' % req.host)
|
req.add_unredirected_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
|
||||||
|
req.add_unredirected_header('Accept-Language', 'en-US,en;q=0.5')
|
||||||
return req
|
return req
|
||||||
|
|
||||||
https_request = http_request
|
https_request = http_request
|
||||||
|
|
||||||
|
|
||||||
class ContentNegociationHandler(BaseHandler):
|
class AlternateHandler(BaseHandler):
|
||||||
" Handler for content negociation. Also parses <link rel='alternate' type='application/rss+xml' href='...' /> "
|
" Follow <link rel='alternate' type='application/rss+xml' href='...' /> "
|
||||||
|
|
||||||
def __init__(self, accept=None, strict=False):
|
def __init__(self, follow=None):
|
||||||
self.accept = accept
|
self.follow = follow or []
|
||||||
self.strict = strict
|
|
||||||
|
|
||||||
def http_request(self, req):
|
|
||||||
if self.accept is not None:
|
|
||||||
if isinstance(self.accept, basestring):
|
|
||||||
self.accept = (self.accept,)
|
|
||||||
|
|
||||||
string = ','.join(self.accept)
|
|
||||||
|
|
||||||
if self.strict:
|
|
||||||
string += ',*/*;q=0.9'
|
|
||||||
|
|
||||||
req.add_unredirected_header('Accept', string)
|
|
||||||
|
|
||||||
return req
|
|
||||||
|
|
||||||
def http_response(self, req, resp):
|
def http_response(self, req, resp):
|
||||||
contenttype = resp.info().get('Content-Type', '').split(';')[0]
|
contenttype = resp.info().get('Content-Type', '').split(';')[0]
|
||||||
if 200 <= resp.code < 300 and self.accept is not None and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
|
if 200 <= resp.code < 300 and len(self.follow) and contenttype in MIMETYPE['html'] and contenttype not in self.follow:
|
||||||
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
|
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
|
||||||
|
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
|
links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
|
||||||
|
|
||||||
for link in links:
|
for link in links:
|
||||||
if link.get('type', '') in self.accept:
|
if link.get('type', '') in self.follow:
|
||||||
resp.code = 302
|
resp.code = 302
|
||||||
resp.msg = 'Moved Temporarily'
|
resp.msg = 'Moved Temporarily'
|
||||||
resp.headers['location'] = link.get('href')
|
resp.headers['location'] = link.get('href')
|
||||||
|
break
|
||||||
|
|
||||||
fp = BytesIO(data)
|
fp = BytesIO(data)
|
||||||
old_resp = resp
|
old_resp = resp
|
||||||
@@ -246,7 +327,6 @@ class ContentNegociationHandler(BaseHandler):
|
|||||||
|
|
||||||
return resp
|
return resp
|
||||||
|
|
||||||
https_request = http_request
|
|
||||||
https_response = http_response
|
https_response = http_response
|
||||||
|
|
||||||
|
|
||||||
@@ -303,12 +383,28 @@ default_cache = {}
|
|||||||
class CacheHandler(BaseHandler):
|
class CacheHandler(BaseHandler):
|
||||||
" Cache based on etags/last-modified "
|
" Cache based on etags/last-modified "
|
||||||
|
|
||||||
private_cache = False # False to behave like a CDN (or if you just don't care), True like a PC
|
private_cache = False # Websites can indicate whether the page should be
|
||||||
|
# cached by CDNs (e.g. shouldn't be the case for
|
||||||
|
# private/confidential/user-specific pages.
|
||||||
|
# With this setting, decide whether (False) you want
|
||||||
|
# the cache to behave like a CDN (i.e. don't cache
|
||||||
|
# private pages), or (True) to behave like a end-cache
|
||||||
|
# private pages. If unsure, False is the safest bet.
|
||||||
handler_order = 499
|
handler_order = 499
|
||||||
|
|
||||||
def __init__(self, cache=None, force_min=None):
|
def __init__(self, cache=None, force_min=None):
|
||||||
self.cache = cache or default_cache
|
self.cache = cache or default_cache
|
||||||
self.force_min = force_min # force_min (seconds) to bypass http headers, -1 forever, 0 never, -2 do nothing if not in cache
|
self.force_min = force_min
|
||||||
|
# Servers indicate how long they think their content is "valid".
|
||||||
|
# With this parameter (force_min, expressed in seconds), we can
|
||||||
|
# override the validity period (i.e. bypassing http headers)
|
||||||
|
# Special values:
|
||||||
|
# -1: valid forever, i.e. use the cache no matter what (and fetch
|
||||||
|
# the page online if not present in cache)
|
||||||
|
# 0: valid zero second, i.e. force refresh
|
||||||
|
# -2: same as -1, i.e. use the cache no matter what, but do NOT
|
||||||
|
# fetch the page online if not present in cache, throw an
|
||||||
|
# error instead
|
||||||
|
|
||||||
def load(self, url):
|
def load(self, url):
|
||||||
try:
|
try:
|
||||||
@@ -338,6 +434,10 @@ class CacheHandler(BaseHandler):
|
|||||||
return req
|
return req
|
||||||
|
|
||||||
def http_open(self, req):
|
def http_open(self, req):
|
||||||
|
# Reminder of how/when this function is called by urllib2:
|
||||||
|
# If 'None' is returned, try your chance with the next-available handler
|
||||||
|
# If a 'resp' is returned, stop there, and proceed with 'http_response'
|
||||||
|
|
||||||
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
|
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
|
||||||
|
|
||||||
# some info needed to process everything
|
# some info needed to process everything
|
||||||
@@ -360,6 +460,7 @@ class CacheHandler(BaseHandler):
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
else:
|
else:
|
||||||
|
# raise an error, via urllib handlers
|
||||||
headers['Morss'] = 'from_cache'
|
headers['Morss'] = 'from_cache'
|
||||||
resp = addinfourl(BytesIO(), headers, req.get_full_url(), 409)
|
resp = addinfourl(BytesIO(), headers, req.get_full_url(), 409)
|
||||||
resp.msg = 'Conflict'
|
resp.msg = 'Conflict'
|
||||||
@@ -378,14 +479,18 @@ class CacheHandler(BaseHandler):
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
elif code == 301 and cache_age < 7*24*3600:
|
elif code == 301 and cache_age < 7*24*3600:
|
||||||
# "301 Moved Permanently" has to be cached...as long as we want (awesome HTTP specs), let's say a week (why not?)
|
# "301 Moved Permanently" has to be cached...as long as we want
|
||||||
# use force_min=0 if you want to bypass this (needed for a proper refresh)
|
# (awesome HTTP specs), let's say a week (why not?). Use force_min=0
|
||||||
|
# if you want to bypass this (needed for a proper refresh)
|
||||||
pass
|
pass
|
||||||
|
|
||||||
elif self.force_min is None and ('no-cache' in cc_list
|
elif self.force_min is None and ('no-cache' in cc_list
|
||||||
or 'no-store' in cc_list
|
or 'no-store' in cc_list
|
||||||
or ('private' in cc_list and not self.private)):
|
or ('private' in cc_list and not self.private_cache)):
|
||||||
# kindly follow web servers indications, refresh
|
# kindly follow web servers indications, refresh
|
||||||
|
# if the same settings are used all along, this section shouldn't be
|
||||||
|
# of any use, since the page woudln't be cached in the first place
|
||||||
|
# the check is only performed "just in case"
|
||||||
return None
|
return None
|
||||||
|
|
||||||
elif 'max-age' in cc_values and int(cc_values['max-age']) > cache_age:
|
elif 'max-age' in cc_values and int(cc_values['max-age']) > cache_age:
|
||||||
@@ -400,7 +505,7 @@ class CacheHandler(BaseHandler):
|
|||||||
# according to the www, we have to refresh when nothing is said
|
# according to the www, we have to refresh when nothing is said
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# return the cache as a response
|
# return the cache as a response. This code is reached with 'pass' above
|
||||||
headers['morss'] = 'from_cache' # TODO delete the morss header from incoming pages, to avoid websites messing up with us
|
headers['morss'] = 'from_cache' # TODO delete the morss header from incoming pages, to avoid websites messing up with us
|
||||||
resp = addinfourl(BytesIO(data), headers, req.get_full_url(), code)
|
resp = addinfourl(BytesIO(data), headers, req.get_full_url(), code)
|
||||||
resp.msg = msg
|
resp.msg = msg
|
||||||
@@ -419,7 +524,7 @@ class CacheHandler(BaseHandler):
|
|||||||
|
|
||||||
cc_list = [x for x in cache_control if '=' not in x]
|
cc_list = [x for x in cache_control if '=' not in x]
|
||||||
|
|
||||||
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private):
|
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private_cache):
|
||||||
# kindly follow web servers indications
|
# kindly follow web servers indications
|
||||||
return resp
|
return resp
|
||||||
|
|
||||||
@@ -431,6 +536,8 @@ class CacheHandler(BaseHandler):
|
|||||||
data = resp.read()
|
data = resp.read()
|
||||||
self.save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
|
self.save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
|
||||||
|
|
||||||
|
# the below is only needed because of 'resp.read()' above, as we can't
|
||||||
|
# seek(0) on arbitraty file-like objects (e.g. sockets)
|
||||||
fp = BytesIO(data)
|
fp = BytesIO(data)
|
||||||
old_resp = resp
|
old_resp = resp
|
||||||
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
|
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
|
||||||
@@ -450,10 +557,14 @@ class CacheHandler(BaseHandler):
|
|||||||
unverifiable=True)
|
unverifiable=True)
|
||||||
|
|
||||||
new.add_unredirected_header('Morss', 'from_304')
|
new.add_unredirected_header('Morss', 'from_304')
|
||||||
|
# create a "fake" new request to just re-run through the various
|
||||||
|
# handlers
|
||||||
|
|
||||||
return self.parent.open(new, timeout=req.timeout)
|
return self.parent.open(new, timeout=req.timeout)
|
||||||
|
|
||||||
return None
|
return None # when returning 'None', the next-available handler is used
|
||||||
|
# the 'HTTPRedirectHandler' has no 'handler_order', i.e.
|
||||||
|
# uses the default of 500, therefore executed after this
|
||||||
|
|
||||||
https_request = http_request
|
https_request = http_request
|
||||||
https_open = http_open
|
https_open = http_open
|
||||||
@@ -461,6 +572,8 @@ class CacheHandler(BaseHandler):
|
|||||||
|
|
||||||
|
|
||||||
class BaseCache:
|
class BaseCache:
|
||||||
|
""" Subclasses must behave like a dict """
|
||||||
|
|
||||||
def __contains__(self, url):
|
def __contains__(self, url):
|
||||||
try:
|
try:
|
||||||
self[url]
|
self[url]
|
||||||
@@ -477,7 +590,7 @@ import sqlite3
|
|||||||
|
|
||||||
class SQLiteCache(BaseCache):
|
class SQLiteCache(BaseCache):
|
||||||
def __init__(self, filename=':memory:'):
|
def __init__(self, filename=':memory:'):
|
||||||
self.con = sqlite3.connect(filename or sqlite_default, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
|
self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
|
||||||
|
|
||||||
with self.con:
|
with self.con:
|
||||||
self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
|
self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
|
||||||
@@ -499,32 +612,28 @@ class SQLiteCache(BaseCache):
|
|||||||
value[3] = sqlite3.Binary(value[3]) # data
|
value[3] = sqlite3.Binary(value[3]) # data
|
||||||
value = tuple(value)
|
value = tuple(value)
|
||||||
|
|
||||||
if url in self:
|
|
||||||
with self.con:
|
with self.con:
|
||||||
self.con.execute('UPDATE data SET code=?, msg=?, headers=?, data=?, timestamp=? WHERE url=?',
|
self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?) ON CONFLICT(url) DO UPDATE SET code=?, msg=?, headers=?, data=?, timestamp=?', (url,) + value + value)
|
||||||
value + (url,))
|
|
||||||
|
|
||||||
else:
|
|
||||||
with self.con:
|
|
||||||
self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?)', (url,) + value)
|
|
||||||
|
|
||||||
|
|
||||||
import pymysql.cursors
|
import pymysql.cursors
|
||||||
|
|
||||||
|
|
||||||
class MySQLCacheHandler(BaseCache):
|
class MySQLCacheHandler(BaseCache):
|
||||||
" NB. Requires mono-threading, as pymysql isn't thread-safe "
|
|
||||||
def __init__(self, user, password, database, host='localhost'):
|
def __init__(self, user, password, database, host='localhost'):
|
||||||
self.con = pymysql.connect(host=host, user=user, password=password, database=database, charset='utf8', autocommit=True)
|
self.user = user
|
||||||
|
self.password = password
|
||||||
|
self.database = database
|
||||||
|
self.host = host
|
||||||
|
|
||||||
with self.con.cursor() as cursor:
|
with self.cursor() as cursor:
|
||||||
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
|
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
|
||||||
|
|
||||||
def __del__(self):
|
def cursor(self):
|
||||||
self.con.close()
|
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
|
||||||
|
|
||||||
def __getitem__(self, url):
|
def __getitem__(self, url):
|
||||||
cursor = self.con.cursor()
|
cursor = self.cursor()
|
||||||
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
|
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
|
||||||
row = cursor.fetchone()
|
row = cursor.fetchone()
|
||||||
|
|
||||||
@@ -534,11 +643,16 @@ class MySQLCacheHandler(BaseCache):
|
|||||||
return row[1:]
|
return row[1:]
|
||||||
|
|
||||||
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
|
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
|
||||||
if url in self:
|
with self.cursor() as cursor:
|
||||||
with self.con.cursor() as cursor:
|
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE code=%s, msg=%s, headers=%s, data=%s, timestamp=%s',
|
||||||
cursor.execute('UPDATE data SET code=%s, msg=%s, headers=%s, data=%s, timestamp=%s WHERE url=%s',
|
(url,) + value + value)
|
||||||
value + (url,))
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
req = adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
|
||||||
|
|
||||||
|
if sys.flags.interactive:
|
||||||
|
print('>>> Interactive shell: try using `req`')
|
||||||
|
|
||||||
else:
|
else:
|
||||||
with self.con.cursor() as cursor:
|
print(req['data'].decode(req['encoding']))
|
||||||
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s)', (url,) + value)
|
|
||||||
|
@@ -90,8 +90,11 @@ item_updated = updated
|
|||||||
[html]
|
[html]
|
||||||
mode = html
|
mode = html
|
||||||
|
|
||||||
|
path =
|
||||||
|
http://localhost/
|
||||||
|
|
||||||
title = //div[@id='header']/h1
|
title = //div[@id='header']/h1
|
||||||
desc = //div[@id='header']/h2
|
desc = //div[@id='header']/p
|
||||||
items = //div[@id='content']/div
|
items = //div[@id='content']/div
|
||||||
|
|
||||||
item_title = ./a
|
item_title = ./a
|
||||||
@@ -99,7 +102,7 @@ item_link = ./a/@href
|
|||||||
item_desc = ./div[class=desc]
|
item_desc = ./div[class=desc]
|
||||||
item_content = ./div[class=content]
|
item_content = ./div[class=content]
|
||||||
|
|
||||||
base = <!DOCTYPE html> <html> <head> <title>Feed reader by morss</title> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" /> </head> <body> <div id="header"> <h1>@feed.title</h1> <h2>@feed.desc</h2> <p>- via morss</p> </div> <div id="content"> <div class="item"> <a class="title link" href="@item.link" target="_blank">@item.title</a> <div class="desc">@item.desc</div> <div class="content">@item.content</div> </div> </div> <script> var items = document.getElementsByClassName('item') for (var i in items) items[i].onclick = function() { this.classList.toggle('active') document.body.classList.toggle('noscroll') } </script> </body> </html>
|
base = file:sheet.xsl
|
||||||
|
|
||||||
[twitter]
|
[twitter]
|
||||||
mode = html
|
mode = html
|
||||||
|
@@ -1,28 +0,0 @@
|
|||||||
import re
|
|
||||||
import json
|
|
||||||
|
|
||||||
from . import crawler
|
|
||||||
|
|
||||||
try:
|
|
||||||
basestring
|
|
||||||
except NameError:
|
|
||||||
basestring = str
|
|
||||||
|
|
||||||
|
|
||||||
def pre_worker(url):
|
|
||||||
if url.startswith('http://itunes.apple.com/') or url.startswith('https://itunes.apple.com/'):
|
|
||||||
match = re.search('/id([0-9]+)(\?.*)?$', url)
|
|
||||||
if match:
|
|
||||||
iid = match.groups()[0]
|
|
||||||
redirect = 'https://itunes.apple.com/lookup?id=%s' % iid
|
|
||||||
|
|
||||||
try:
|
|
||||||
con = crawler.custom_handler(basic=True).open(redirect, timeout=4)
|
|
||||||
data = con.read()
|
|
||||||
|
|
||||||
except (IOError, HTTPException):
|
|
||||||
raise
|
|
||||||
|
|
||||||
return json.loads(data.decode('utf-8', 'replace'))['results'][0]['feedUrl']
|
|
||||||
|
|
||||||
return None
|
|
134
morss/feeds.py
134
morss/feeds.py
@@ -15,6 +15,7 @@ import dateutil.parser
|
|||||||
from copy import deepcopy
|
from copy import deepcopy
|
||||||
|
|
||||||
import lxml.html
|
import lxml.html
|
||||||
|
from .readabilite import parse as html_parse
|
||||||
|
|
||||||
json.encoder.c_make_encoder = None
|
json.encoder.c_make_encoder = None
|
||||||
|
|
||||||
@@ -45,14 +46,32 @@ def parse_rules(filename=None):
|
|||||||
rules = dict([(x, dict(config.items(x))) for x in config.sections()])
|
rules = dict([(x, dict(config.items(x))) for x in config.sections()])
|
||||||
|
|
||||||
for section in rules.keys():
|
for section in rules.keys():
|
||||||
|
# for each ruleset
|
||||||
|
|
||||||
for arg in rules[section].keys():
|
for arg in rules[section].keys():
|
||||||
if '\n' in rules[section][arg]:
|
# for each rule
|
||||||
|
|
||||||
|
if rules[section][arg].startswith('file:'):
|
||||||
|
paths = [os.path.join(sys.prefix, 'share/morss/www', rules[section][arg][5:]),
|
||||||
|
os.path.join(os.path.dirname(__file__), '../www', rules[section][arg][5:]),
|
||||||
|
os.path.join(os.path.dirname(__file__), '../..', rules[section][arg][5:])]
|
||||||
|
|
||||||
|
for path in paths:
|
||||||
|
try:
|
||||||
|
file_raw = open(path).read()
|
||||||
|
file_clean = re.sub('<[/?]?(xsl|xml)[^>]+?>', '', file_raw)
|
||||||
|
rules[section][arg] = file_clean
|
||||||
|
|
||||||
|
except IOError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
elif '\n' in rules[section][arg]:
|
||||||
rules[section][arg] = rules[section][arg].split('\n')[1:]
|
rules[section][arg] = rules[section][arg].split('\n')[1:]
|
||||||
|
|
||||||
return rules
|
return rules
|
||||||
|
|
||||||
|
|
||||||
def parse(data, url=None, mimetype=None):
|
def parse(data, url=None, encoding=None):
|
||||||
" Determine which ruleset to use "
|
" Determine which ruleset to use "
|
||||||
|
|
||||||
rulesets = parse_rules()
|
rulesets = parse_rules()
|
||||||
@@ -66,28 +85,22 @@ def parse(data, url=None, mimetype=None):
|
|||||||
for path in ruleset['path']:
|
for path in ruleset['path']:
|
||||||
if fnmatch(url, path):
|
if fnmatch(url, path):
|
||||||
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
|
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
|
||||||
return parser(data, ruleset)
|
return parser(data, ruleset, encoding=encoding)
|
||||||
|
|
||||||
# 2) Look for a parser based on mimetype
|
# 2) Try each and every parser
|
||||||
|
|
||||||
if mimetype is not None:
|
|
||||||
parser_candidates = [x for x in parsers if mimetype in x.mimetype]
|
|
||||||
|
|
||||||
if mimetype is None or parser_candidates is None:
|
|
||||||
parser_candidates = parsers
|
|
||||||
|
|
||||||
# 3) Look for working ruleset for given parser
|
# 3) Look for working ruleset for given parser
|
||||||
# 3a) See if parsing works
|
# 3a) See if parsing works
|
||||||
# 3b) See if .items matches anything
|
# 3b) See if .items matches anything
|
||||||
|
|
||||||
for parser in parser_candidates:
|
for parser in parsers:
|
||||||
ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
|
ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
|
||||||
# 'path' as they should have been caught beforehands
|
# 'path' as they should have been caught beforehands
|
||||||
|
|
||||||
try:
|
try:
|
||||||
feed = parser(data)
|
feed = parser(data, encoding=encoding)
|
||||||
|
|
||||||
except (ValueError):
|
except (ValueError, SyntaxError):
|
||||||
# parsing did not work
|
# parsing did not work
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@@ -112,7 +125,7 @@ def parse(data, url=None, mimetype=None):
|
|||||||
|
|
||||||
|
|
||||||
class ParserBase(object):
|
class ParserBase(object):
|
||||||
def __init__(self, data=None, rules=None, parent=None):
|
def __init__(self, data=None, rules=None, parent=None, encoding=None):
|
||||||
if rules is None:
|
if rules is None:
|
||||||
rules = parse_rules()[self.default_ruleset]
|
rules = parse_rules()[self.default_ruleset]
|
||||||
|
|
||||||
@@ -121,9 +134,10 @@ class ParserBase(object):
|
|||||||
if data is None:
|
if data is None:
|
||||||
data = rules['base']
|
data = rules['base']
|
||||||
|
|
||||||
self.root = self.parse(data)
|
|
||||||
self.parent = parent
|
self.parent = parent
|
||||||
|
self.encoding = encoding
|
||||||
|
|
||||||
|
self.root = self.parse(data)
|
||||||
|
|
||||||
def parse(self, raw):
|
def parse(self, raw):
|
||||||
pass
|
pass
|
||||||
@@ -148,15 +162,15 @@ class ParserBase(object):
|
|||||||
c = csv.writer(out, dialect=csv.excel)
|
c = csv.writer(out, dialect=csv.excel)
|
||||||
|
|
||||||
for item in self.items:
|
for item in self.items:
|
||||||
row = [getattr(item, x) for x in item.dic]
|
c.writerow([getattr(item, x) for x in item.dic])
|
||||||
|
|
||||||
if encoding != 'unicode':
|
|
||||||
row = [x.encode(encoding) if isinstance(x, unicode) else x for x in row]
|
|
||||||
|
|
||||||
c.writerow(row)
|
|
||||||
|
|
||||||
out.seek(0)
|
out.seek(0)
|
||||||
return out.read()
|
out = out.read()
|
||||||
|
|
||||||
|
if encoding != 'unicode':
|
||||||
|
out = out.encode(encoding)
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
def tohtml(self, **k):
|
def tohtml(self, **k):
|
||||||
return self.convert(FeedHTML).tostring(**k)
|
return self.convert(FeedHTML).tostring(**k)
|
||||||
@@ -267,7 +281,14 @@ class ParserBase(object):
|
|||||||
|
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
# does not exist, have to create it
|
# does not exist, have to create it
|
||||||
|
try:
|
||||||
self.rule_create(self.rules[rule_name])
|
self.rule_create(self.rules[rule_name])
|
||||||
|
|
||||||
|
except AttributeError:
|
||||||
|
# no way to create it, give up
|
||||||
|
pass
|
||||||
|
|
||||||
|
else:
|
||||||
self.rule_set(self.rules[rule_name], value)
|
self.rule_set(self.rules[rule_name], value)
|
||||||
|
|
||||||
def rmv(self, rule_name):
|
def rmv(self, rule_name):
|
||||||
@@ -286,10 +307,7 @@ class ParserXML(ParserBase):
|
|||||||
|
|
||||||
NSMAP = {'atom': 'http://www.w3.org/2005/Atom',
|
NSMAP = {'atom': 'http://www.w3.org/2005/Atom',
|
||||||
'atom03': 'http://purl.org/atom/ns#',
|
'atom03': 'http://purl.org/atom/ns#',
|
||||||
'media': 'http://search.yahoo.com/mrss/',
|
|
||||||
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
|
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
|
||||||
'slash': 'http://purl.org/rss/1.0/modules/slash/',
|
|
||||||
'dc': 'http://purl.org/dc/elements/1.1/',
|
|
||||||
'content': 'http://purl.org/rss/1.0/modules/content/',
|
'content': 'http://purl.org/rss/1.0/modules/content/',
|
||||||
'rssfake': 'http://purl.org/rss/1.0/'}
|
'rssfake': 'http://purl.org/rss/1.0/'}
|
||||||
|
|
||||||
@@ -301,7 +319,7 @@ class ParserXML(ParserBase):
|
|||||||
return self.root.getparent().remove(self.root)
|
return self.root.getparent().remove(self.root)
|
||||||
|
|
||||||
def tostring(self, encoding='unicode', **k):
|
def tostring(self, encoding='unicode', **k):
|
||||||
return etree.tostring(self.root, encoding=encoding, **k)
|
return etree.tostring(self.root, encoding=encoding, method='xml', **k)
|
||||||
|
|
||||||
def _rule_parse(self, rule):
|
def _rule_parse(self, rule):
|
||||||
test = re.search(r'^(.*)/@([a-z]+)$', rule) # to match //div/a/@href
|
test = re.search(r'^(.*)/@([a-z]+)$', rule) # to match //div/a/@href
|
||||||
@@ -383,7 +401,8 @@ class ParserXML(ParserBase):
|
|||||||
return
|
return
|
||||||
|
|
||||||
elif key is not None:
|
elif key is not None:
|
||||||
del x.attrib[key]
|
if key in match.attrib:
|
||||||
|
del match.attrib[key]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
match.getparent().remove(match)
|
match.getparent().remove(match)
|
||||||
@@ -401,13 +420,14 @@ class ParserXML(ParserBase):
|
|||||||
|
|
||||||
else:
|
else:
|
||||||
if html_rich:
|
if html_rich:
|
||||||
# atom stuff
|
|
||||||
if 'atom' in rule:
|
|
||||||
match.attrib['type'] = 'xhtml'
|
|
||||||
|
|
||||||
self._clean_node(match)
|
self._clean_node(match)
|
||||||
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
|
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
|
||||||
match.find('div').drop_tag()
|
|
||||||
|
if self.rules['mode'] == 'html':
|
||||||
|
match.find('div').drop_tag() # not supported by lxml.etree
|
||||||
|
|
||||||
|
else: # i.e. if atom
|
||||||
|
match.attrib['type'] = 'xhtml'
|
||||||
|
|
||||||
else:
|
else:
|
||||||
if match is not None and len(match):
|
if match is not None and len(match):
|
||||||
@@ -440,11 +460,10 @@ class ParserHTML(ParserXML):
|
|||||||
mimetype = ['text/html', 'application/xhtml+xml']
|
mimetype = ['text/html', 'application/xhtml+xml']
|
||||||
|
|
||||||
def parse(self, raw):
|
def parse(self, raw):
|
||||||
parser = etree.HTMLParser(remove_blank_text=True) # remove_blank_text needed for pretty_print
|
return html_parse(raw, encoding=self.encoding)
|
||||||
return etree.fromstring(raw, parser)
|
|
||||||
|
|
||||||
def tostring(self, encoding='unicode', **k):
|
def tostring(self, encoding='unicode', **k):
|
||||||
return lxml.html.tostring(self.root, encoding=encoding, **k)
|
return lxml.html.tostring(self.root, encoding=encoding, method='html', **k)
|
||||||
|
|
||||||
def rule_search_all(self, rule):
|
def rule_search_all(self, rule):
|
||||||
try:
|
try:
|
||||||
@@ -467,6 +486,9 @@ class ParserHTML(ParserXML):
|
|||||||
element = deepcopy(match)
|
element = deepcopy(match)
|
||||||
match.getparent().append(element)
|
match.getparent().append(element)
|
||||||
|
|
||||||
|
else:
|
||||||
|
raise AttributeError('no way to create item')
|
||||||
|
|
||||||
|
|
||||||
def parse_time(value):
|
def parse_time(value):
|
||||||
if value is None or value == 0:
|
if value is None or value == 0:
|
||||||
@@ -474,13 +496,13 @@ def parse_time(value):
|
|||||||
|
|
||||||
elif isinstance(value, basestring):
|
elif isinstance(value, basestring):
|
||||||
if re.match(r'^[0-9]+$', value):
|
if re.match(r'^[0-9]+$', value):
|
||||||
return datetime.fromtimestamp(int(value), tz.UTC)
|
return datetime.fromtimestamp(int(value), tz.tzutc())
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return dateutil.parser.parse(value)
|
return dateutil.parser.parse(value).replace(tzinfo=tz.tzutc())
|
||||||
|
|
||||||
elif isinstance(value, int):
|
elif isinstance(value, int):
|
||||||
return datetime.fromtimestamp(value, tz.UTC)
|
return datetime.fromtimestamp(value, tz.tzutc())
|
||||||
|
|
||||||
elif isinstance(value, datetime):
|
elif isinstance(value, datetime):
|
||||||
return value
|
return value
|
||||||
@@ -696,13 +718,31 @@ class Item(Uniq):
|
|||||||
class FeedXML(Feed, ParserXML):
|
class FeedXML(Feed, ParserXML):
|
||||||
itemsClass = 'ItemXML'
|
itemsClass = 'ItemXML'
|
||||||
|
|
||||||
|
def root_siblings(self):
|
||||||
|
out = []
|
||||||
|
current = self.root.getprevious()
|
||||||
|
|
||||||
|
while current is not None:
|
||||||
|
out.append(current)
|
||||||
|
current = current.getprevious()
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
def tostring(self, encoding='unicode', **k):
|
def tostring(self, encoding='unicode', **k):
|
||||||
# override needed due to "getroottree" inclusion
|
# override needed due to "getroottree" inclusion
|
||||||
|
# and to add stylesheet
|
||||||
|
|
||||||
|
stylesheets = [x for x in self.root_siblings() if isinstance(x, etree.PIBase) and x.target == 'xml-stylesheet']
|
||||||
|
|
||||||
|
if len(stylesheets):
|
||||||
|
# remove all stylesheets present (be that ours or others')
|
||||||
|
for stylesheet in stylesheets:
|
||||||
|
self.root.append(stylesheet) # needed as we can't delete root siblings https://stackoverflow.com/a/60232366
|
||||||
|
self.root.remove(stylesheet)
|
||||||
|
|
||||||
if self.root.getprevious() is None:
|
|
||||||
self.root.addprevious(etree.PI('xml-stylesheet', 'type="text/xsl" href="/sheet.xsl"'))
|
self.root.addprevious(etree.PI('xml-stylesheet', 'type="text/xsl" href="/sheet.xsl"'))
|
||||||
|
|
||||||
return etree.tostring(self.root.getroottree(), encoding=encoding, **k)
|
return etree.tostring(self.root.getroottree(), encoding=encoding, method='xml', **k)
|
||||||
|
|
||||||
|
|
||||||
class ItemXML(Item, ParserXML):
|
class ItemXML(Item, ParserXML):
|
||||||
@@ -732,3 +772,17 @@ class ItemJSON(Item, ParserJSON):
|
|||||||
return
|
return
|
||||||
|
|
||||||
cur = cur[node]
|
cur = cur[node]
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
from . import crawler
|
||||||
|
|
||||||
|
req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://www.nytimes.com/', follow='rss')
|
||||||
|
feed = parse(req['data'], url=req['url'], encoding=req['encoding'])
|
||||||
|
|
||||||
|
if sys.flags.interactive:
|
||||||
|
print('>>> Interactive shell: try using `feed`')
|
||||||
|
|
||||||
|
else:
|
||||||
|
for item in feed.items:
|
||||||
|
print(item.title, item.link)
|
||||||
|
405
morss/morss.py
405
morss/morss.py
@@ -1,9 +1,10 @@
|
|||||||
import sys
|
import sys
|
||||||
import os
|
import os
|
||||||
import os.path
|
import os.path
|
||||||
import time
|
|
||||||
|
|
||||||
import threading
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
from dateutil import tz
|
||||||
|
|
||||||
from fnmatch import fnmatch
|
from fnmatch import fnmatch
|
||||||
import re
|
import re
|
||||||
@@ -12,47 +13,44 @@ import lxml.etree
|
|||||||
import lxml.html
|
import lxml.html
|
||||||
|
|
||||||
from . import feeds
|
from . import feeds
|
||||||
from . import feedify
|
|
||||||
from . import crawler
|
from . import crawler
|
||||||
from . import readabilite
|
from . import readabilite
|
||||||
|
|
||||||
import wsgiref.simple_server
|
import wsgiref.simple_server
|
||||||
import wsgiref.handlers
|
import wsgiref.handlers
|
||||||
|
import cgitb
|
||||||
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# python 2
|
# python 2
|
||||||
from Queue import Queue
|
|
||||||
from httplib import HTTPException
|
from httplib import HTTPException
|
||||||
from urllib import quote_plus
|
from urllib import unquote
|
||||||
from urlparse import urlparse, urljoin, parse_qs
|
from urlparse import urlparse, urljoin, parse_qs
|
||||||
except ImportError:
|
except ImportError:
|
||||||
# python 3
|
# python 3
|
||||||
from queue import Queue
|
|
||||||
from http.client import HTTPException
|
from http.client import HTTPException
|
||||||
from urllib.parse import quote_plus
|
from urllib.parse import unquote
|
||||||
from urllib.parse import urlparse, urljoin, parse_qs
|
from urllib.parse import urlparse, urljoin, parse_qs
|
||||||
|
|
||||||
LIM_ITEM = 100 # deletes what's beyond
|
MAX_ITEM = 5 # cache-only beyond
|
||||||
LIM_TIME = 7 # deletes what's after
|
MAX_TIME = 2 # cache-only after (in sec)
|
||||||
MAX_ITEM = 50 # cache-only beyond
|
|
||||||
MAX_TIME = 7 # cache-only after (in sec)
|
LIM_ITEM = 10 # deletes what's beyond
|
||||||
|
LIM_TIME = 2.5 # deletes what's after
|
||||||
|
|
||||||
DELAY = 10 * 60 # xml cache & ETag cache (in sec)
|
DELAY = 10 * 60 # xml cache & ETag cache (in sec)
|
||||||
TIMEOUT = 4 # http timeout (in sec)
|
TIMEOUT = 4 # http timeout (in sec)
|
||||||
THREADS = 10 # number of threads (1 for single-threaded)
|
|
||||||
|
|
||||||
DEBUG = False
|
DEBUG = False
|
||||||
PORT = 8080
|
PORT = 8080
|
||||||
|
|
||||||
PROTOCOL = ['http', 'https', 'ftp']
|
|
||||||
|
|
||||||
|
|
||||||
def filterOptions(options):
|
def filterOptions(options):
|
||||||
return options
|
return options
|
||||||
|
|
||||||
# example of filtering code below
|
# example of filtering code below
|
||||||
|
|
||||||
#allowed = ['proxy', 'clip', 'keep', 'cache', 'force', 'silent', 'pro', 'debug']
|
#allowed = ['proxy', 'clip', 'cache', 'force', 'silent', 'pro', 'debug']
|
||||||
#filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
|
#filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
|
||||||
|
|
||||||
#return filtered
|
#return filtered
|
||||||
@@ -66,6 +64,7 @@ def log(txt, force=False):
|
|||||||
if DEBUG or force:
|
if DEBUG or force:
|
||||||
if 'REQUEST_URI' in os.environ:
|
if 'REQUEST_URI' in os.environ:
|
||||||
open('morss.log', 'a').write("%s\n" % repr(txt))
|
open('morss.log', 'a').write("%s\n" % repr(txt))
|
||||||
|
|
||||||
else:
|
else:
|
||||||
print(repr(txt))
|
print(repr(txt))
|
||||||
|
|
||||||
@@ -73,6 +72,7 @@ def log(txt, force=False):
|
|||||||
def len_html(txt):
|
def len_html(txt):
|
||||||
if len(txt):
|
if len(txt):
|
||||||
return len(lxml.html.fromstring(txt).text_content())
|
return len(lxml.html.fromstring(txt).text_content())
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
@@ -80,6 +80,7 @@ def len_html(txt):
|
|||||||
def count_words(txt):
|
def count_words(txt):
|
||||||
if len(txt):
|
if len(txt):
|
||||||
return len(lxml.html.fromstring(txt).text_content().split())
|
return len(lxml.html.fromstring(txt).text_content().split())
|
||||||
|
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
@@ -88,12 +89,14 @@ class Options:
|
|||||||
if len(args):
|
if len(args):
|
||||||
self.options = args
|
self.options = args
|
||||||
self.options.update(options or {})
|
self.options.update(options or {})
|
||||||
|
|
||||||
else:
|
else:
|
||||||
self.options = options or {}
|
self.options = options or {}
|
||||||
|
|
||||||
def __getattr__(self, key):
|
def __getattr__(self, key):
|
||||||
if key in self.options:
|
if key in self.options:
|
||||||
return self.options[key]
|
return self.options[key]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -107,25 +110,31 @@ class Options:
|
|||||||
def parseOptions(options):
|
def parseOptions(options):
|
||||||
""" Turns ['md=True'] into {'md':True} """
|
""" Turns ['md=True'] into {'md':True} """
|
||||||
out = {}
|
out = {}
|
||||||
|
|
||||||
for option in options:
|
for option in options:
|
||||||
split = option.split('=', 1)
|
split = option.split('=', 1)
|
||||||
|
|
||||||
if len(split) > 1:
|
if len(split) > 1:
|
||||||
if split[0].lower() == 'true':
|
if split[0].lower() == 'true':
|
||||||
out[split[0]] = True
|
out[split[0]] = True
|
||||||
|
|
||||||
elif split[0].lower() == 'false':
|
elif split[0].lower() == 'false':
|
||||||
out[split[0]] = False
|
out[split[0]] = False
|
||||||
|
|
||||||
else:
|
else:
|
||||||
out[split[0]] = split[1]
|
out[split[0]] = split[1]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
out[split[0]] = True
|
out[split[0]] = True
|
||||||
|
|
||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
def ItemFix(item, feedurl='/'):
|
def ItemFix(item, options, feedurl='/'):
|
||||||
""" Improves feed items (absolute links, resolve feedburner links, etc) """
|
""" Improves feed items (absolute links, resolve feedburner links, etc) """
|
||||||
|
|
||||||
# check unwanted uppercase title
|
# check unwanted uppercase title
|
||||||
if len(item.title) > 20 and item.title.isupper():
|
if item.title is not None and len(item.title) > 20 and item.title.isupper():
|
||||||
item.title = item.title.title()
|
item.title = item.title.title()
|
||||||
|
|
||||||
# check if it includes link
|
# check if it includes link
|
||||||
@@ -140,6 +149,13 @@ def ItemFix(item, feedurl='/'):
|
|||||||
item.link = match[0]
|
item.link = match[0]
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
|
# at user's election, use first <a>
|
||||||
|
if options.firstlink and (item.desc or item.content):
|
||||||
|
match = lxml.html.fromstring(item.desc or item.content).xpath('//a/@href')
|
||||||
|
if len(match):
|
||||||
|
item.link = match[0]
|
||||||
|
log(item.link)
|
||||||
|
|
||||||
# check relative urls
|
# check relative urls
|
||||||
item.link = urljoin(feedurl, item.link)
|
item.link = urljoin(feedurl, item.link)
|
||||||
|
|
||||||
@@ -158,6 +174,11 @@ def ItemFix(item, feedurl='/'):
|
|||||||
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
|
# pocket
|
||||||
|
if fnmatch(item.link, 'https://getpocket.com/redirect?url=*'):
|
||||||
|
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||||
|
log(item.link)
|
||||||
|
|
||||||
# facebook
|
# facebook
|
||||||
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
|
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
|
||||||
item.link = parse_qs(urlparse(item.link).query)['u'][0]
|
item.link = parse_qs(urlparse(item.link).query)['u'][0]
|
||||||
@@ -183,7 +204,7 @@ def ItemFix(item, feedurl='/'):
|
|||||||
|
|
||||||
# reddit
|
# reddit
|
||||||
if urlparse(feedurl).netloc == 'www.reddit.com':
|
if urlparse(feedurl).netloc == 'www.reddit.com':
|
||||||
match = lxml.html.fromstring(item.desc).xpath('//a[text()="[link]"]/@href')
|
match = lxml.html.fromstring(item.content).xpath('//a[text()="[link]"]/@href')
|
||||||
if len(match):
|
if len(match):
|
||||||
item.link = match[0]
|
item.link = match[0]
|
||||||
log(item.link)
|
log(item.link)
|
||||||
@@ -196,55 +217,36 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
|
|
||||||
if not item.link:
|
if not item.link:
|
||||||
log('no link')
|
log('no link')
|
||||||
return item
|
return True
|
||||||
|
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
link = item.link
|
|
||||||
|
|
||||||
# twitter
|
|
||||||
if urlparse(feedurl).netloc == 'twitter.com':
|
|
||||||
match = lxml.html.fromstring(item.desc).xpath('//a/@data-expanded-url')
|
|
||||||
if len(match):
|
|
||||||
link = match[0]
|
|
||||||
log(link)
|
|
||||||
else:
|
|
||||||
link = None
|
|
||||||
|
|
||||||
# facebook
|
|
||||||
if urlparse(feedurl).netloc == 'graph.facebook.com':
|
|
||||||
match = lxml.html.fromstring(item.content).xpath('//a/@href')
|
|
||||||
if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
|
|
||||||
link = match[0]
|
|
||||||
log(link)
|
|
||||||
else:
|
|
||||||
link = None
|
|
||||||
|
|
||||||
if link is None:
|
|
||||||
log('no used link')
|
|
||||||
return True
|
|
||||||
|
|
||||||
# download
|
# download
|
||||||
delay = -1
|
delay = -1
|
||||||
|
|
||||||
if fast:
|
if fast or options.fast:
|
||||||
# super-fast mode
|
# force cache, don't fetch
|
||||||
delay = -2
|
delay = -2
|
||||||
|
|
||||||
|
elif options.force:
|
||||||
|
# force refresh
|
||||||
|
delay = 0
|
||||||
|
|
||||||
|
else:
|
||||||
|
delay = 24*60*60 # 24h
|
||||||
|
|
||||||
try:
|
try:
|
||||||
con = crawler.custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
|
req = crawler.adv_get(url=item.link, delay=delay, timeout=TIMEOUT)
|
||||||
data = con.read()
|
|
||||||
|
|
||||||
except (IOError, HTTPException) as e:
|
except (IOError, HTTPException) as e:
|
||||||
log('http error')
|
log('http error')
|
||||||
return False # let's just delete errors stuff when in cache mode
|
return False # let's just delete errors stuff when in cache mode
|
||||||
|
|
||||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
if req['contenttype'] not in crawler.MIMETYPE['html'] and req['contenttype'] != 'text/plain':
|
||||||
if contenttype not in crawler.MIMETYPE['html'] and contenttype != 'text/plain':
|
|
||||||
log('non-text page')
|
log('non-text page')
|
||||||
return True
|
return True
|
||||||
|
|
||||||
out = readabilite.get_article(data, link, options.encoding or crawler.detect_encoding(data, con))
|
out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
|
||||||
|
|
||||||
if out is not None:
|
if out is not None:
|
||||||
item.content = out
|
item.content = out
|
||||||
@@ -265,10 +267,7 @@ def ItemBefore(item, options):
|
|||||||
|
|
||||||
def ItemAfter(item, options):
|
def ItemAfter(item, options):
|
||||||
if options.clip and item.desc and item.content:
|
if options.clip and item.desc and item.content:
|
||||||
item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
|
item.content = item.desc + "<br/><br/><hr/><br/><br/>" + item.content
|
||||||
del item.desc
|
|
||||||
|
|
||||||
if not options.keep and not options.proxy:
|
|
||||||
del item.desc
|
del item.desc
|
||||||
|
|
||||||
if options.nolink and item.content:
|
if options.nolink and item.content:
|
||||||
@@ -276,7 +275,7 @@ def ItemAfter(item, options):
|
|||||||
for link in content.xpath('//a'):
|
for link in content.xpath('//a'):
|
||||||
log(link.text_content())
|
log(link.text_content())
|
||||||
link.drop_tag()
|
link.drop_tag()
|
||||||
item.content = lxml.etree.tostring(content)
|
item.content = lxml.etree.tostring(content, method='html')
|
||||||
|
|
||||||
if options.noref:
|
if options.noref:
|
||||||
item.link = ''
|
item.link = ''
|
||||||
@@ -285,71 +284,50 @@ def ItemAfter(item, options):
|
|||||||
|
|
||||||
|
|
||||||
def FeedFetch(url, options):
|
def FeedFetch(url, options):
|
||||||
# basic url clean-up
|
|
||||||
if url is None:
|
|
||||||
raise MorssException('No url provided')
|
|
||||||
|
|
||||||
if urlparse(url).scheme not in PROTOCOL:
|
|
||||||
url = 'http://' + url
|
|
||||||
log(url)
|
|
||||||
|
|
||||||
url = url.replace(' ', '%20')
|
|
||||||
|
|
||||||
if isinstance(url, bytes):
|
|
||||||
url = url.decode()
|
|
||||||
|
|
||||||
# allow for code execution for feedify
|
|
||||||
pre = feedify.pre_worker(url)
|
|
||||||
if pre:
|
|
||||||
url = pre
|
|
||||||
log('url redirect')
|
|
||||||
log(url)
|
|
||||||
|
|
||||||
# fetch feed
|
# fetch feed
|
||||||
delay = DELAY
|
delay = DELAY
|
||||||
|
|
||||||
if options.theforce:
|
if options.force:
|
||||||
delay = 0
|
delay = 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
con = crawler.custom_handler(accept='xml', strict=True, delay=delay,
|
req = crawler.adv_get(url=url, follow=('rss' if not options.items else None), delay=delay, timeout=TIMEOUT * 2)
|
||||||
encoding=options.encoding, basic=not options.items) \
|
|
||||||
.open(url, timeout=TIMEOUT * 2)
|
|
||||||
xml = con.read()
|
|
||||||
|
|
||||||
except (IOError, HTTPException):
|
except (IOError, HTTPException):
|
||||||
raise MorssException('Error downloading feed')
|
raise MorssException('Error downloading feed')
|
||||||
|
|
||||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
|
||||||
|
|
||||||
if options.items:
|
if options.items:
|
||||||
# using custom rules
|
# using custom rules
|
||||||
rss = feeds.FeedHTML(xml, url, contenttype)
|
rss = feeds.FeedHTML(req['data'], encoding=req['encoding'])
|
||||||
feed.rule
|
|
||||||
|
rss.rules['title'] = options.title if options.title else '//head/title'
|
||||||
|
rss.rules['desc'] = options.desc if options.desc else '//head/meta[@name="description"]/@content'
|
||||||
|
|
||||||
rss.rules['items'] = options.items
|
rss.rules['items'] = options.items
|
||||||
|
|
||||||
if options.item_title:
|
rss.rules['item_title'] = options.item_title if options.item_title else './/a|.'
|
||||||
rss.rules['item_title'] = options.item_title
|
rss.rules['item_link'] = options.item_link if options.item_link else './@href|.//a/@href'
|
||||||
if options.item_link:
|
|
||||||
rss.rules['item_link'] = options.item_link
|
|
||||||
if options.item_content:
|
if options.item_content:
|
||||||
rss.rules['item_content'] = options.item_content
|
rss.rules['item_content'] = options.item_content
|
||||||
|
|
||||||
if options.item_time:
|
if options.item_time:
|
||||||
rss.rules['item_time'] = options.item_time
|
rss.rules['item_time'] = options.item_time
|
||||||
|
|
||||||
|
rss = rss.convert(feeds.FeedXML)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
try:
|
try:
|
||||||
rss = feeds.parse(xml, url, contenttype)
|
rss = feeds.parse(req['data'], url=url, encoding=req['encoding'])
|
||||||
rss = rss.convert(feeds.FeedXML)
|
rss = rss.convert(feeds.FeedXML)
|
||||||
# contains all fields, otherwise much-needed data can be lost
|
# contains all fields, otherwise much-needed data can be lost
|
||||||
|
|
||||||
except TypeError:
|
except TypeError:
|
||||||
log('random page')
|
log('random page')
|
||||||
log(contenttype)
|
log(req['contenttype'])
|
||||||
raise MorssException('Link provided is not a valid feed')
|
raise MorssException('Link provided is not a valid feed')
|
||||||
|
|
||||||
return rss
|
return req['url'], rss
|
||||||
|
|
||||||
|
|
||||||
def FeedGather(rss, url, options):
|
def FeedGather(rss, url, options):
|
||||||
@@ -361,42 +339,30 @@ def FeedGather(rss, url, options):
|
|||||||
lim_time = LIM_TIME
|
lim_time = LIM_TIME
|
||||||
max_item = MAX_ITEM
|
max_item = MAX_ITEM
|
||||||
max_time = MAX_TIME
|
max_time = MAX_TIME
|
||||||
threads = THREADS
|
|
||||||
|
|
||||||
if options.cache:
|
if options.cache:
|
||||||
max_time = 0
|
max_time = 0
|
||||||
|
|
||||||
if options.mono:
|
now = datetime.now(tz.tzutc())
|
||||||
threads = 1
|
sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
|
||||||
|
for i, item in enumerate(sorted_items):
|
||||||
# set
|
|
||||||
def runner(queue):
|
|
||||||
while True:
|
|
||||||
value = queue.get()
|
|
||||||
try:
|
|
||||||
worker(*value)
|
|
||||||
except Exception as e:
|
|
||||||
log('Thread Error: %s' % e.message)
|
|
||||||
queue.task_done()
|
|
||||||
|
|
||||||
def worker(i, item):
|
|
||||||
if time.time() - start_time > lim_time >= 0 or i + 1 > lim_item >= 0:
|
if time.time() - start_time > lim_time >= 0 or i + 1 > lim_item >= 0:
|
||||||
log('dropped')
|
log('dropped')
|
||||||
item.remove()
|
item.remove()
|
||||||
return
|
continue
|
||||||
|
|
||||||
item = ItemBefore(item, options)
|
item = ItemBefore(item, options)
|
||||||
|
|
||||||
if item is None:
|
if item is None:
|
||||||
return
|
continue
|
||||||
|
|
||||||
item = ItemFix(item, url)
|
item = ItemFix(item, options, url)
|
||||||
|
|
||||||
if time.time() - start_time > max_time >= 0 or i + 1 > max_item >= 0:
|
if time.time() - start_time > max_time >= 0 or i + 1 > max_item >= 0:
|
||||||
if not options.proxy:
|
if not options.proxy:
|
||||||
if ItemFill(item, options, url, True) is False:
|
if ItemFill(item, options, url, True) is False:
|
||||||
item.remove()
|
item.remove()
|
||||||
return
|
continue
|
||||||
|
|
||||||
else:
|
else:
|
||||||
if not options.proxy:
|
if not options.proxy:
|
||||||
@@ -404,22 +370,6 @@ def FeedGather(rss, url, options):
|
|||||||
|
|
||||||
item = ItemAfter(item, options)
|
item = ItemAfter(item, options)
|
||||||
|
|
||||||
queue = Queue()
|
|
||||||
|
|
||||||
for i in range(threads):
|
|
||||||
t = threading.Thread(target=runner, args=(queue,))
|
|
||||||
t.daemon = True
|
|
||||||
t.start()
|
|
||||||
|
|
||||||
for i, item in enumerate(list(rss.items)):
|
|
||||||
if threads == 1:
|
|
||||||
worker(*[i, item])
|
|
||||||
else:
|
|
||||||
queue.put([i, item])
|
|
||||||
|
|
||||||
if threads != 1:
|
|
||||||
queue.join()
|
|
||||||
|
|
||||||
if options.ad:
|
if options.ad:
|
||||||
new = rss.items.append()
|
new = rss.items.append()
|
||||||
new.title = "Are you hungry?"
|
new.title = "Are you hungry?"
|
||||||
@@ -433,37 +383,38 @@ def FeedGather(rss, url, options):
|
|||||||
return rss
|
return rss
|
||||||
|
|
||||||
|
|
||||||
def FeedFormat(rss, options):
|
def FeedFormat(rss, options, encoding='utf-8'):
|
||||||
if options.callback:
|
if options.callback:
|
||||||
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
|
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
|
||||||
return '%s(%s)' % (options.callback, rss.tojson())
|
out = '%s(%s)' % (options.callback, rss.tojson(encoding='unicode'))
|
||||||
|
return out if encoding == 'unicode' else out.encode(encoding)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
raise MorssException('Invalid callback var name')
|
raise MorssException('Invalid callback var name')
|
||||||
|
|
||||||
elif options.json:
|
elif options.json:
|
||||||
if options.indent:
|
if options.indent:
|
||||||
return rss.tojson(encoding='UTF-8', indent=4)
|
return rss.tojson(encoding=encoding, indent=4)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return rss.tojson(encoding='UTF-8')
|
return rss.tojson(encoding=encoding)
|
||||||
|
|
||||||
elif options.csv:
|
elif options.csv:
|
||||||
return rss.tocsv(encoding='UTF-8')
|
return rss.tocsv(encoding=encoding)
|
||||||
|
|
||||||
elif options.reader:
|
elif options.html:
|
||||||
if options.indent:
|
if options.indent:
|
||||||
return rss.tohtml(encoding='UTF-8', pretty_print=True)
|
return rss.tohtml(encoding=encoding, pretty_print=True)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return rss.tohtml(encoding='UTF-8')
|
return rss.tohtml(encoding=encoding)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
if options.indent:
|
if options.indent:
|
||||||
return rss.torss(xml_declaration=True, encoding='UTF-8', pretty_print=True)
|
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding, pretty_print=True)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return rss.torss(xml_declaration=True, encoding='UTF-8')
|
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding)
|
||||||
|
|
||||||
|
|
||||||
def process(url, cache=None, options=None):
|
def process(url, cache=None, options=None):
|
||||||
@@ -475,14 +426,15 @@ def process(url, cache=None, options=None):
|
|||||||
if cache:
|
if cache:
|
||||||
crawler.default_cache = crawler.SQLiteCache(cache)
|
crawler.default_cache = crawler.SQLiteCache(cache)
|
||||||
|
|
||||||
rss = FeedFetch(url, options)
|
url, rss = FeedFetch(url, options)
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
|
|
||||||
return FeedFormat(rss, options)
|
return FeedFormat(rss, options, 'unicode')
|
||||||
|
|
||||||
|
|
||||||
def cgi_app(environ, start_response):
|
def cgi_parse_environ(environ):
|
||||||
# get options
|
# get options
|
||||||
|
|
||||||
if 'REQUEST_URI' in environ:
|
if 'REQUEST_URI' in environ:
|
||||||
url = environ['REQUEST_URI'][1:]
|
url = environ['REQUEST_URI'][1:]
|
||||||
else:
|
else:
|
||||||
@@ -496,7 +448,7 @@ def cgi_app(environ, start_response):
|
|||||||
if url.startswith(':'):
|
if url.startswith(':'):
|
||||||
split = url.split('/', 1)
|
split = url.split('/', 1)
|
||||||
|
|
||||||
options = split[0].replace('|', '/').replace('\\\'', '\'').split(':')[1:]
|
raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
|
||||||
|
|
||||||
if len(split) > 1:
|
if len(split) > 1:
|
||||||
url = split[1]
|
url = split[1]
|
||||||
@@ -504,23 +456,31 @@ def cgi_app(environ, start_response):
|
|||||||
url = ''
|
url = ''
|
||||||
|
|
||||||
else:
|
else:
|
||||||
options = []
|
raw_options = []
|
||||||
|
|
||||||
# init
|
# init
|
||||||
options = Options(filterOptions(parseOptions(options)))
|
options = Options(filterOptions(parseOptions(raw_options)))
|
||||||
headers = {}
|
|
||||||
|
|
||||||
global DEBUG
|
global DEBUG
|
||||||
DEBUG = options.debug
|
DEBUG = options.debug
|
||||||
|
|
||||||
|
return (url, options)
|
||||||
|
|
||||||
|
|
||||||
|
def cgi_app(environ, start_response):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
headers = {}
|
||||||
|
|
||||||
# headers
|
# headers
|
||||||
headers['status'] = '200 OK'
|
headers['status'] = '200 OK'
|
||||||
headers['cache-control'] = 'max-age=%s' % DELAY
|
headers['cache-control'] = 'max-age=%s' % DELAY
|
||||||
|
headers['x-content-type-options'] = 'nosniff' # safari work around
|
||||||
|
|
||||||
if options.cors:
|
if options.cors:
|
||||||
headers['access-control-allow-origin'] = '*'
|
headers['access-control-allow-origin'] = '*'
|
||||||
|
|
||||||
if options.html or options.reader:
|
if options.html:
|
||||||
headers['content-type'] = 'text/html'
|
headers['content-type'] = 'text/html'
|
||||||
elif options.txt or options.silent:
|
elif options.txt or options.silent:
|
||||||
headers['content-type'] = 'text/plain'
|
headers['content-type'] = 'text/plain'
|
||||||
@@ -534,31 +494,54 @@ def cgi_app(environ, start_response):
|
|||||||
else:
|
else:
|
||||||
headers['content-type'] = 'text/xml'
|
headers['content-type'] = 'text/xml'
|
||||||
|
|
||||||
|
headers['content-type'] += '; charset=utf-8'
|
||||||
|
|
||||||
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
|
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
|
||||||
|
|
||||||
# get the work done
|
# get the work done
|
||||||
rss = FeedFetch(url, options)
|
url, rss = FeedFetch(url, options)
|
||||||
|
|
||||||
if headers['content-type'] == 'text/xml':
|
|
||||||
headers['content-type'] = rss.mimetype[0]
|
|
||||||
|
|
||||||
start_response(headers['status'], list(headers.items()))
|
start_response(headers['status'], list(headers.items()))
|
||||||
|
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
out = FeedFormat(rss, options)
|
out = FeedFormat(rss, options)
|
||||||
|
|
||||||
if not options.silent:
|
if options.silent:
|
||||||
return out
|
return ['']
|
||||||
|
|
||||||
|
else:
|
||||||
|
return [out]
|
||||||
|
|
||||||
|
|
||||||
def cgi_wrapper(environ, start_response):
|
def middleware(func):
|
||||||
# simple http server for html and css
|
" Decorator to turn a function into a wsgi middleware "
|
||||||
|
# This is called when parsing the "@middleware" code
|
||||||
|
|
||||||
|
def app_builder(app):
|
||||||
|
# This is called when doing app = cgi_wrapper(app)
|
||||||
|
|
||||||
|
def app_wrap(environ, start_response):
|
||||||
|
# This is called when a http request is being processed
|
||||||
|
|
||||||
|
return func(environ, start_response, app)
|
||||||
|
|
||||||
|
return app_wrap
|
||||||
|
|
||||||
|
return app_builder
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_file_handler(environ, start_response, app):
|
||||||
|
" Simple HTTP server to serve static files (.html, .css, etc.) "
|
||||||
|
|
||||||
files = {
|
files = {
|
||||||
'': 'text/html',
|
'': 'text/html',
|
||||||
'index.html': 'text/html'}
|
'index.html': 'text/html',
|
||||||
|
'sheet.xsl': 'text/xsl'}
|
||||||
|
|
||||||
if 'REQUEST_URI' in environ:
|
if 'REQUEST_URI' in environ:
|
||||||
url = environ['REQUEST_URI'][1:]
|
url = environ['REQUEST_URI'][1:]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
url = environ['PATH_INFO'][1:]
|
url = environ['PATH_INFO'][1:]
|
||||||
|
|
||||||
@@ -568,12 +551,10 @@ def cgi_wrapper(environ, start_response):
|
|||||||
if url == '':
|
if url == '':
|
||||||
url = 'index.html'
|
url = 'index.html'
|
||||||
|
|
||||||
if '--root' in sys.argv[1:]:
|
paths = [os.path.join(sys.prefix, 'share/morss/www', url),
|
||||||
path = os.path.join(sys.argv[-1], url)
|
os.path.join(os.path.dirname(__file__), '../www', url)]
|
||||||
|
|
||||||
else:
|
|
||||||
path = url
|
|
||||||
|
|
||||||
|
for path in paths:
|
||||||
try:
|
try:
|
||||||
body = open(path, 'rb').read()
|
body = open(path, 'rb').read()
|
||||||
|
|
||||||
@@ -583,20 +564,90 @@ def cgi_wrapper(environ, start_response):
|
|||||||
return [body]
|
return [body]
|
||||||
|
|
||||||
except IOError:
|
except IOError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
else:
|
||||||
|
# the for loop did not return, so here we are, i.e. no file found
|
||||||
headers['status'] = '404 Not found'
|
headers['status'] = '404 Not found'
|
||||||
start_response(headers['status'], list(headers.items()))
|
start_response(headers['status'], list(headers.items()))
|
||||||
return ['Error %s' % headers['status']]
|
return ['Error %s' % headers['status']]
|
||||||
|
|
||||||
# actual morss use
|
else:
|
||||||
|
return app(environ, start_response)
|
||||||
|
|
||||||
|
|
||||||
|
def cgi_get(environ, start_response):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
# get page
|
||||||
|
req = crawler.adv_get(url=url, timeout=TIMEOUT)
|
||||||
|
|
||||||
|
if req['contenttype'] in ['text/html', 'application/xhtml+xml', 'application/xml']:
|
||||||
|
if options.get == 'page':
|
||||||
|
html = readabilite.parse(req['data'], encoding=req['encoding'])
|
||||||
|
html.make_links_absolute(req['url'])
|
||||||
|
|
||||||
|
kill_tags = ['script', 'iframe', 'noscript']
|
||||||
|
|
||||||
|
for tag in kill_tags:
|
||||||
|
for elem in html.xpath('//'+tag):
|
||||||
|
elem.getparent().remove(elem)
|
||||||
|
|
||||||
|
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8', method='html')
|
||||||
|
|
||||||
|
elif options.get == 'article':
|
||||||
|
output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
|
||||||
|
|
||||||
|
else:
|
||||||
|
raise MorssException('no :get option passed')
|
||||||
|
|
||||||
|
else:
|
||||||
|
output = req['data']
|
||||||
|
|
||||||
|
# return html page
|
||||||
|
headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8'}
|
||||||
|
start_response(headers['status'], list(headers.items()))
|
||||||
|
return [output]
|
||||||
|
|
||||||
|
|
||||||
|
dispatch_table = {
|
||||||
|
'get': cgi_get,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_dispatcher(environ, start_response, app):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
for key in dispatch_table.keys():
|
||||||
|
if key in options:
|
||||||
|
return dispatch_table[key](environ, start_response)
|
||||||
|
|
||||||
|
return app(environ, start_response)
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_error_handler(environ, start_response, app):
|
||||||
try:
|
try:
|
||||||
return [cgi_app(environ, start_response) or '(empty)']
|
return app(environ, start_response)
|
||||||
|
|
||||||
except (KeyboardInterrupt, SystemExit):
|
except (KeyboardInterrupt, SystemExit):
|
||||||
raise
|
raise
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
headers = {'status': '500 Oops', 'content-type': 'text/plain'}
|
headers = {'status': '500 Oops', 'content-type': 'text/html'}
|
||||||
start_response(headers['status'], list(headers.items()), sys.exc_info())
|
start_response(headers['status'], list(headers.items()), sys.exc_info())
|
||||||
log('ERROR <%s>: %s' % (url, e.message), force=True)
|
log('ERROR: %s' % repr(e), force=True)
|
||||||
return ['An error happened:\n%s' % e.message]
|
return [cgitb.html(sys.exc_info())]
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_encode(environ, start_response, app):
|
||||||
|
out = app(environ, start_response)
|
||||||
|
return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
|
||||||
|
|
||||||
|
|
||||||
|
cgi_standalone_app = cgi_encode(cgi_error_handler(cgi_dispatcher(cgi_file_handler(cgi_app))))
|
||||||
|
|
||||||
|
|
||||||
def cli_app():
|
def cli_app():
|
||||||
@@ -608,12 +659,12 @@ def cli_app():
|
|||||||
|
|
||||||
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
|
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
|
||||||
|
|
||||||
rss = FeedFetch(url, options)
|
url, rss = FeedFetch(url, options)
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
out = FeedFormat(rss, options)
|
out = FeedFormat(rss, options, 'unicode')
|
||||||
|
|
||||||
if not options.silent:
|
if not options.silent:
|
||||||
print(out.decode('utf-8', 'replace') if isinstance(out, bytes) else out)
|
print(out)
|
||||||
|
|
||||||
log('done')
|
log('done')
|
||||||
|
|
||||||
@@ -622,6 +673,7 @@ def isInt(string):
|
|||||||
try:
|
try:
|
||||||
int(string)
|
int(string)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
except ValueError:
|
except ValueError:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -629,31 +681,46 @@ def isInt(string):
|
|||||||
def main():
|
def main():
|
||||||
if 'REQUEST_URI' in os.environ:
|
if 'REQUEST_URI' in os.environ:
|
||||||
# mod_cgi
|
# mod_cgi
|
||||||
wsgiref.handlers.CGIHandler().run(cgi_wrapper)
|
|
||||||
|
|
||||||
elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
|
app = cgi_app
|
||||||
|
app = cgi_dispatcher(app)
|
||||||
|
app = cgi_error_handler(app)
|
||||||
|
app = cgi_encode(app)
|
||||||
|
|
||||||
|
wsgiref.handlers.CGIHandler().run(app)
|
||||||
|
|
||||||
|
elif len(sys.argv) <= 1 or isInt(sys.argv[1]):
|
||||||
# start internal (basic) http server
|
# start internal (basic) http server
|
||||||
|
|
||||||
if len(sys.argv) > 1 and isInt(sys.argv[1]):
|
if len(sys.argv) > 1 and isInt(sys.argv[1]):
|
||||||
argPort = int(sys.argv[1])
|
argPort = int(sys.argv[1])
|
||||||
if argPort > 0:
|
if argPort > 0:
|
||||||
port = argPort
|
port = argPort
|
||||||
|
|
||||||
else:
|
else:
|
||||||
raise MorssException('Port must be positive integer')
|
raise MorssException('Port must be positive integer')
|
||||||
|
|
||||||
else:
|
else:
|
||||||
port = PORT
|
port = PORT
|
||||||
|
|
||||||
|
app = cgi_app
|
||||||
|
app = cgi_file_handler(app)
|
||||||
|
app = cgi_dispatcher(app)
|
||||||
|
app = cgi_error_handler(app)
|
||||||
|
app = cgi_encode(app)
|
||||||
|
|
||||||
print('Serving http://localhost:%s/' % port)
|
print('Serving http://localhost:%s/' % port)
|
||||||
httpd = wsgiref.simple_server.make_server('', port, cgi_wrapper)
|
httpd = wsgiref.simple_server.make_server('', port, app)
|
||||||
httpd.serve_forever()
|
httpd.serve_forever()
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# as a CLI app
|
# as a CLI app
|
||||||
try:
|
try:
|
||||||
cli_app()
|
cli_app()
|
||||||
|
|
||||||
except (KeyboardInterrupt, SystemExit):
|
except (KeyboardInterrupt, SystemExit):
|
||||||
raise
|
raise
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print('ERROR: %s' % e.message)
|
print('ERROR: %s' % e.message)
|
||||||
|
|
||||||
|
@@ -1,13 +1,17 @@
|
|||||||
import lxml.etree
|
import lxml.etree
|
||||||
import lxml.html
|
import lxml.html
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
|
||||||
def parse(data, encoding=None):
|
def parse(data, encoding=None):
|
||||||
if encoding:
|
if encoding:
|
||||||
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True, encoding=encoding)
|
data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
|
||||||
|
|
||||||
else:
|
else:
|
||||||
parser = lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True)
|
data = BeautifulSoup(data, 'lxml').prettify('utf-8')
|
||||||
|
|
||||||
|
parser = lxml.html.HTMLParser(remove_comments=True, encoding='utf-8')
|
||||||
|
|
||||||
return lxml.html.fromstring(data, parser=parser)
|
return lxml.html.fromstring(data, parser=parser)
|
||||||
|
|
||||||
@@ -43,6 +47,12 @@ def count_content(node):
|
|||||||
return count_words(node.text_content()) + len(node.findall('.//img'))
|
return count_words(node.text_content()) + len(node.findall('.//img'))
|
||||||
|
|
||||||
|
|
||||||
|
def percentile(N, P):
|
||||||
|
# https://stackoverflow.com/a/7464107
|
||||||
|
n = max(int(round(P * len(N) + 0.5)), 2)
|
||||||
|
return N[n-2]
|
||||||
|
|
||||||
|
|
||||||
class_bad = ['comment', 'community', 'extra', 'foot',
|
class_bad = ['comment', 'community', 'extra', 'foot',
|
||||||
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
|
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
|
||||||
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
|
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
|
||||||
@@ -60,9 +70,10 @@ class_good = ['and', 'article', 'body', 'column', 'main',
|
|||||||
regex_good = re.compile('|'.join(class_good), re.I)
|
regex_good = re.compile('|'.join(class_good), re.I)
|
||||||
|
|
||||||
|
|
||||||
tags_junk = ['script', 'head', 'iframe', 'object', 'noscript',
|
tags_dangerous = ['script', 'head', 'iframe', 'object', 'style', 'link', 'meta']
|
||||||
'param', 'embed', 'layer', 'applet', 'style', 'form', 'input', 'textarea',
|
|
||||||
'button', 'footer']
|
tags_junk = tags_dangerous + ['noscript', 'param', 'embed', 'layer', 'applet',
|
||||||
|
'form', 'input', 'textarea', 'button', 'footer']
|
||||||
|
|
||||||
tags_bad = tags_junk + ['a', 'aside']
|
tags_bad = tags_junk + ['a', 'aside']
|
||||||
|
|
||||||
@@ -90,13 +101,24 @@ def score_node(node):
|
|||||||
" Score individual node "
|
" Score individual node "
|
||||||
|
|
||||||
score = 0
|
score = 0
|
||||||
class_id = node.get('class', '') + node.get('id', '')
|
class_id = (node.get('class') or '') + (node.get('id') or '')
|
||||||
|
|
||||||
if (isinstance(node, lxml.html.HtmlComment)
|
if (isinstance(node, lxml.html.HtmlComment)
|
||||||
or node.tag in tags_bad
|
or isinstance(node, lxml.html.HtmlProcessingInstruction)):
|
||||||
or regex_bad.search(class_id)):
|
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
if node.tag in tags_dangerous:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if node.tag in tags_junk:
|
||||||
|
score += -1 # actuall -2 as tags_junk is included tags_bad
|
||||||
|
|
||||||
|
if node.tag in tags_bad:
|
||||||
|
score += -1
|
||||||
|
|
||||||
|
if regex_bad.search(class_id):
|
||||||
|
score += -1
|
||||||
|
|
||||||
if node.tag in tags_good:
|
if node.tag in tags_good:
|
||||||
score += 4
|
score += 4
|
||||||
|
|
||||||
@@ -114,33 +136,42 @@ def score_node(node):
|
|||||||
return score
|
return score
|
||||||
|
|
||||||
|
|
||||||
def score_all(node, grades=None):
|
def score_all(node):
|
||||||
" Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
|
" Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
|
||||||
|
|
||||||
if grades is None:
|
|
||||||
grades = {}
|
|
||||||
|
|
||||||
for child in node:
|
for child in node:
|
||||||
score = score_node(child)
|
score = score_node(child)
|
||||||
child.attrib['seen'] = 'yes, ' + str(int(score))
|
child.attrib['morss_own_score'] = str(float(score))
|
||||||
|
|
||||||
if score > 0:
|
if score > 0 or len(list(child.iterancestors())) <= 2:
|
||||||
spread_score(child, score, grades)
|
spread_score(child, score)
|
||||||
score_all(child, grades)
|
score_all(child)
|
||||||
|
|
||||||
return grades
|
|
||||||
|
|
||||||
|
|
||||||
def spread_score(node, score, grades):
|
def set_score(node, value):
|
||||||
|
node.attrib['morss_score'] = str(float(value))
|
||||||
|
|
||||||
|
|
||||||
|
def get_score(node):
|
||||||
|
return float(node.attrib.get('morss_score', 0))
|
||||||
|
|
||||||
|
|
||||||
|
def incr_score(node, delta):
|
||||||
|
set_score(node, get_score(node) + delta)
|
||||||
|
|
||||||
|
|
||||||
|
def get_all_scores(node):
|
||||||
|
return {x:get_score(x) for x in list(node.iter()) if get_score(x) != 0}
|
||||||
|
|
||||||
|
|
||||||
|
def spread_score(node, score):
|
||||||
" Spread the node's score to its parents, on a linear way "
|
" Spread the node's score to its parents, on a linear way "
|
||||||
|
|
||||||
delta = score / 2
|
delta = score / 2
|
||||||
|
|
||||||
for ancestor in [node,] + list(node.iterancestors()):
|
for ancestor in [node,] + list(node.iterancestors()):
|
||||||
if score >= 1 or ancestor is node:
|
if score >= 1 or ancestor is node:
|
||||||
try:
|
incr_score(ancestor, score)
|
||||||
grades[ancestor] += score
|
|
||||||
except KeyError:
|
|
||||||
grades[ancestor] = score
|
|
||||||
|
|
||||||
score -= delta
|
score -= delta
|
||||||
|
|
||||||
@@ -148,26 +179,29 @@ def spread_score(node, score, grades):
|
|||||||
break
|
break
|
||||||
|
|
||||||
|
|
||||||
def write_score_all(root, grades):
|
def clean_root(root, keep_threshold=None):
|
||||||
" Useful for debugging "
|
|
||||||
|
|
||||||
for node in root.iter():
|
|
||||||
node.attrib['score'] = str(int(grades.get(node, 0)))
|
|
||||||
|
|
||||||
|
|
||||||
def clean_root(root):
|
|
||||||
for node in list(root):
|
for node in list(root):
|
||||||
clean_root(node)
|
# bottom-up approach, i.e. starting with children before cleaning current node
|
||||||
clean_node(node)
|
clean_root(node, keep_threshold)
|
||||||
|
clean_node(node, keep_threshold)
|
||||||
|
|
||||||
|
|
||||||
def clean_node(node):
|
def clean_node(node, keep_threshold=None):
|
||||||
parent = node.getparent()
|
parent = node.getparent()
|
||||||
|
|
||||||
if parent is None:
|
if parent is None:
|
||||||
# this is <html/> (or a removed element waiting for GC)
|
# this is <html/> (or a removed element waiting for GC)
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# remove dangerous tags, no matter what
|
||||||
|
if node.tag in tags_dangerous:
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
if keep_threshold is not None and get_score(node) >= keep_threshold:
|
||||||
|
# high score, so keep
|
||||||
|
return
|
||||||
|
|
||||||
gdparent = parent.getparent()
|
gdparent = parent.getparent()
|
||||||
|
|
||||||
# remove shitty tags
|
# remove shitty tags
|
||||||
@@ -266,41 +300,59 @@ def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
|
|||||||
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
|
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
|
||||||
|
|
||||||
|
|
||||||
def rank_nodes(grades):
|
def rank_grades(grades):
|
||||||
|
# largest score to smallest
|
||||||
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
|
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
|
||||||
|
|
||||||
|
|
||||||
def get_best_node(grades):
|
def get_best_node(ranked_grades):
|
||||||
" To pick the best (raw) node. Another function will clean it "
|
" To pick the best (raw) node. Another function will clean it "
|
||||||
|
|
||||||
if len(grades) == 1:
|
if len(ranked_grades) == 1:
|
||||||
return grades[0]
|
return ranked_grades[0]
|
||||||
|
|
||||||
top = rank_nodes(grades)
|
lowest = lowest_common_ancestor(ranked_grades[0][0], ranked_grades[1][0], 3)
|
||||||
lowest = lowest_common_ancestor(top[0][0], top[1][0], 3)
|
|
||||||
|
|
||||||
return lowest
|
return lowest
|
||||||
|
|
||||||
|
|
||||||
def get_article(data, url=None, encoding=None):
|
def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5):
|
||||||
" Input a raw html string, returns a raw html string of the article "
|
" Input a raw html string, returns a raw html string of the article "
|
||||||
|
|
||||||
html = parse(data, encoding)
|
html = parse(data, encoding_in)
|
||||||
scores = score_all(html)
|
score_all(html)
|
||||||
|
scores = rank_grades(get_all_scores(html))
|
||||||
|
|
||||||
if not len(scores):
|
if not len(scores) or scores[0][1] < threshold:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
best = get_best_node(scores)
|
best = get_best_node(scores)
|
||||||
|
|
||||||
|
if not debug:
|
||||||
|
keep_threshold = percentile([x[1] for x in scores], 0.1)
|
||||||
|
clean_root(best, keep_threshold)
|
||||||
|
|
||||||
wc = count_words(best.text_content())
|
wc = count_words(best.text_content())
|
||||||
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
|
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
|
||||||
|
|
||||||
if wc - wca < 50 or float(wca) / wc > 0.3:
|
if not debug and (wc - wca < 50 or float(wca) / wc > 0.3):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
if url:
|
if url:
|
||||||
best.make_links_absolute(url)
|
best.make_links_absolute(url)
|
||||||
|
|
||||||
clean_root(best)
|
return lxml.etree.tostring(best if not debug else html, method='html', encoding=encoding_out)
|
||||||
|
|
||||||
return lxml.etree.tostring(best, pretty_print=True)
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import sys
|
||||||
|
from . import crawler
|
||||||
|
|
||||||
|
req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
|
||||||
|
article = get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
|
||||||
|
|
||||||
|
if sys.flags.interactive:
|
||||||
|
print('>>> Interactive shell: try using `article`')
|
||||||
|
|
||||||
|
else:
|
||||||
|
print(article)
|
||||||
|
@@ -1,210 +0,0 @@
|
|||||||
@require(feed)
|
|
||||||
<!DOCTYPE html>
|
|
||||||
<html>
|
|
||||||
<head>
|
|
||||||
<title>@feed.title – via morss</title>
|
|
||||||
<meta charset="UTF-8" />
|
|
||||||
<meta name="description" content="@feed.desc (via morss)" />
|
|
||||||
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
|
||||||
|
|
||||||
<style type="text/css">
|
|
||||||
/* columns - from https://thisisdallas.github.io/Simple-Grid/simpleGrid.css */
|
|
||||||
|
|
||||||
* {
|
|
||||||
box-sizing: border-box;
|
|
||||||
}
|
|
||||||
|
|
||||||
#content {
|
|
||||||
width: 100%;
|
|
||||||
max-width: 1140px;
|
|
||||||
min-width: 755px;
|
|
||||||
margin: 0 auto;
|
|
||||||
overflow: hidden;
|
|
||||||
|
|
||||||
padding-top: 20px;
|
|
||||||
padding-left: 20px; /* grid-space to left */
|
|
||||||
padding-right: 0px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-20px=0 */
|
|
||||||
}
|
|
||||||
|
|
||||||
.item {
|
|
||||||
width: 33.33%;
|
|
||||||
float: left;
|
|
||||||
padding-right: 20px; /* column-space */
|
|
||||||
}
|
|
||||||
|
|
||||||
@@media handheld, only screen and (max-width: 767px) { /* @@ to escape from the template engine */
|
|
||||||
#content {
|
|
||||||
width: 100%;
|
|
||||||
min-width: 0;
|
|
||||||
margin-left: 0px;
|
|
||||||
margin-right: 0px;
|
|
||||||
padding-left: 20px; /* grid-space to left */
|
|
||||||
padding-right: 10px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-10px=10px */
|
|
||||||
}
|
|
||||||
|
|
||||||
.item {
|
|
||||||
width: auto;
|
|
||||||
float: none;
|
|
||||||
margin-left: 0px;
|
|
||||||
margin-right: 0px;
|
|
||||||
margin-top: 10px;
|
|
||||||
margin-bottom: 10px;
|
|
||||||
padding-left: 0px;
|
|
||||||
padding-right: 10px; /* column-space */
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/* design */
|
|
||||||
|
|
||||||
#header h1, #header h2, #header p {
|
|
||||||
font-family: sans;
|
|
||||||
text-align: center;
|
|
||||||
margin: 0;
|
|
||||||
padding: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
#header h1 {
|
|
||||||
font-size: 2.5em;
|
|
||||||
font-weight: bold;
|
|
||||||
padding: 1em 0 0.25em;
|
|
||||||
}
|
|
||||||
|
|
||||||
#header h2 {
|
|
||||||
font-size: 1em;
|
|
||||||
font-weight: normal;
|
|
||||||
}
|
|
||||||
|
|
||||||
#header p {
|
|
||||||
color: gray;
|
|
||||||
font-style: italic;
|
|
||||||
font-size: 0.75em;
|
|
||||||
}
|
|
||||||
|
|
||||||
#content {
|
|
||||||
text-align: justify;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item .title {
|
|
||||||
font-weight: bold;
|
|
||||||
display: block;
|
|
||||||
text-align: center;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item .link {
|
|
||||||
color: inherit;
|
|
||||||
text-decoration: none;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item:not(.active) {
|
|
||||||
cursor: pointer;
|
|
||||||
|
|
||||||
height: 20em;
|
|
||||||
margin-bottom: 20px;
|
|
||||||
overflow: hidden;
|
|
||||||
text-overflow: ellpisps;
|
|
||||||
|
|
||||||
padding: 0.25em;
|
|
||||||
position: relative;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item:not(.active) .title {
|
|
||||||
padding-bottom: 0.1em;
|
|
||||||
margin-bottom: 0.1em;
|
|
||||||
border-bottom: 1px solid silver;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item:not(.active):before {
|
|
||||||
content: " ";
|
|
||||||
display: block;
|
|
||||||
width: 100%;
|
|
||||||
position: absolute;
|
|
||||||
top: 18.5em;
|
|
||||||
height: 1.5em;
|
|
||||||
background: linear-gradient(to bottom, rgba(255,255,255,0) 0%, rgba(255,255,255,1) 100%);
|
|
||||||
}
|
|
||||||
|
|
||||||
.item:not(.active) .article * {
|
|
||||||
max-width: 100%;
|
|
||||||
font-size: 1em !important;
|
|
||||||
font-weight: normal;
|
|
||||||
display: inline;
|
|
||||||
margin: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item.active {
|
|
||||||
background: white;
|
|
||||||
position: fixed;
|
|
||||||
overflow: auto;
|
|
||||||
top: 0;
|
|
||||||
left: 0;
|
|
||||||
height: 100%;
|
|
||||||
width: 100%;
|
|
||||||
z-index: 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
body.noscroll {
|
|
||||||
overflow: hidden;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item.active > * {
|
|
||||||
max-width: 700px;
|
|
||||||
margin: auto;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item.active .title {
|
|
||||||
font-size: 2em;
|
|
||||||
padding: 0.5em 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item.active .article object,
|
|
||||||
.item.active .article video,
|
|
||||||
.item.active .article audio {
|
|
||||||
display: none;
|
|
||||||
}
|
|
||||||
|
|
||||||
.item.active .article img {
|
|
||||||
max-height: 20em;
|
|
||||||
max-width: 100%;
|
|
||||||
}
|
|
||||||
</style>
|
|
||||||
</head>
|
|
||||||
|
|
||||||
<body>
|
|
||||||
<div id="header">
|
|
||||||
<h1>@feed.title</h1>
|
|
||||||
@if feed.desc:
|
|
||||||
<h2>@feed.desc</h2>
|
|
||||||
@end
|
|
||||||
<p>- via morss</p>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<div id="content">
|
|
||||||
@for item in feed.items:
|
|
||||||
<div class="item">
|
|
||||||
@if item.link:
|
|
||||||
<a class="title link" href="@item.link" target="_blank">@item.title</a>
|
|
||||||
@else:
|
|
||||||
<span class="title">@item.title</span>
|
|
||||||
@end
|
|
||||||
<div class="article">
|
|
||||||
@if item.content:
|
|
||||||
@item.content
|
|
||||||
@else:
|
|
||||||
@item.desc
|
|
||||||
@end
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
@end
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
var items = document.getElementsByClassName('item')
|
|
||||||
for (var i in items)
|
|
||||||
items[i].onclick = function()
|
|
||||||
{
|
|
||||||
this.classList.toggle('active')
|
|
||||||
document.body.classList.toggle('noscroll')
|
|
||||||
}
|
|
||||||
</script>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
@@ -1,4 +0,0 @@
|
|||||||
lxml
|
|
||||||
python-dateutil <= 1.5
|
|
||||||
chardet
|
|
||||||
pymysql
|
|
20
setup.py
20
setup.py
@@ -1,14 +1,24 @@
|
|||||||
from setuptools import setup, find_packages
|
from setuptools import setup
|
||||||
|
from glob import glob
|
||||||
|
|
||||||
package_name = 'morss'
|
package_name = 'morss'
|
||||||
|
|
||||||
setup(
|
setup(
|
||||||
name = package_name,
|
name = package_name,
|
||||||
description = 'Get full-text RSS feeds',
|
description = 'Get full-text RSS feeds',
|
||||||
author = 'pictuga, Samuel Marks',
|
author = 'pictuga, Samuel Marks',
|
||||||
author_email = 'contact at pictuga dot com',
|
author_email = 'contact at pictuga dot com',
|
||||||
url = 'http://morss.it/',
|
url = 'http://morss.it/',
|
||||||
|
download_url = 'https://git.pictuga.com/pictuga/morss',
|
||||||
license = 'AGPL v3',
|
license = 'AGPL v3',
|
||||||
package_dir={package_name: package_name},
|
packages = [package_name],
|
||||||
packages=find_packages(),
|
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
|
||||||
package_data={package_name: ['feedify.ini', 'reader.html.template']},
|
package_data = {package_name: ['feedify.ini']},
|
||||||
test_suite=package_name + '.tests')
|
data_files = [
|
||||||
|
('share/' + package_name, ['README.md', 'LICENSE']),
|
||||||
|
('share/' + package_name + '/www', glob('www/*.*')),
|
||||||
|
('share/' + package_name + '/www/cgi', [])
|
||||||
|
],
|
||||||
|
entry_points = {
|
||||||
|
'console_scripts': [package_name + '=' + package_name + ':main']
|
||||||
|
})
|
||||||
|
@@ -35,8 +35,8 @@
|
|||||||
<input type="text" id="url" name="url" placeholder="Feed url (http://example.com/feed.xml)" />
|
<input type="text" id="url" name="url" placeholder="Feed url (http://example.com/feed.xml)" />
|
||||||
</form>
|
</form>
|
||||||
|
|
||||||
<code>Copyright: pictuga 2013-2014<br/>
|
<code>Copyright: pictuga 2013-2020<br/>
|
||||||
Source code: https://github.com/pictuga/morss</code>
|
Source code: https://git.pictuga.com/pictuga/morss</code>
|
||||||
|
|
||||||
<script>
|
<script>
|
||||||
form = document.forms[0]
|
form = document.forms[0]
|
||||||
|
332
www/sheet.xsl
332
www/sheet.xsl
@@ -1,5 +1,12 @@
|
|||||||
<?xml version="1.0" encoding="utf-8"?>
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
|
<xsl:stylesheet version="1.1"
|
||||||
|
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
|
||||||
|
xmlns:atom="http://www.w3.org/2005/Atom"
|
||||||
|
xmlns:atom03="http://purl.org/atom/ns#"
|
||||||
|
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
|
||||||
|
xmlns:content="http://purl.org/rss/1.0/modules/content/"
|
||||||
|
xmlns:rssfake="http://purl.org/rss/1.0/"
|
||||||
|
>
|
||||||
|
|
||||||
<xsl:output method="html"/>
|
<xsl:output method="html"/>
|
||||||
|
|
||||||
@@ -7,116 +14,273 @@
|
|||||||
<html>
|
<html>
|
||||||
<head>
|
<head>
|
||||||
<title>RSS feed by morss</title>
|
<title>RSS feed by morss</title>
|
||||||
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
<meta name="viewport" content="width=device-width; initial-scale=1.0;" />
|
||||||
|
<meta name="robots" content="noindex" />
|
||||||
|
|
||||||
<style type="text/css">
|
<style type="text/css">
|
||||||
body {
|
body {
|
||||||
overflow-wrap: anywhere;
|
overflow-wrap: anywhere;
|
||||||
word-wrap: anywhere;
|
word-wrap: anywhere;
|
||||||
|
word-break: break-word;
|
||||||
|
|
||||||
|
font-family: sans-serif;
|
||||||
}
|
}
|
||||||
|
|
||||||
#url {
|
input, select {
|
||||||
background-color: rgba(255, 165, 0, 0.25);
|
font-family: inherit;
|
||||||
padding: 1% 5%;
|
font-size: inherit;
|
||||||
display: inline-block;
|
text-align: inherit;
|
||||||
|
}
|
||||||
|
|
||||||
|
header {
|
||||||
|
text-align: justify;
|
||||||
|
text-align-last: center;
|
||||||
|
border-bottom: 1px solid silver;
|
||||||
|
}
|
||||||
|
|
||||||
|
.input-combo {
|
||||||
|
display: flex;
|
||||||
|
flex-flow: row;
|
||||||
|
align-items: stretch;
|
||||||
|
|
||||||
|
width: 800px;
|
||||||
max-width: 100%;
|
max-width: 100%;
|
||||||
}
|
margin: auto;
|
||||||
|
|
||||||
body > ul {
|
border: 1px solid grey;
|
||||||
|
|
||||||
|
padding: .5em .5em;
|
||||||
background-color: #FFFAF4;
|
background-color: #FFFAF4;
|
||||||
|
}
|
||||||
|
|
||||||
|
.input-combo * {
|
||||||
|
display: inline-block;
|
||||||
|
line-height: 2em;
|
||||||
|
border: 0;
|
||||||
|
background: transparent;
|
||||||
|
}
|
||||||
|
|
||||||
|
.input-combo > :not(.button) {
|
||||||
|
max-width: 100%;
|
||||||
|
flex-grow: 1;
|
||||||
|
flex-shrink 0;
|
||||||
|
|
||||||
|
white-space: nowrap;
|
||||||
|
text-overflow: ellipsis;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
.input-combo .button {
|
||||||
|
flex-grow: 0;
|
||||||
|
flex-shrink 1;
|
||||||
|
|
||||||
|
cursor: pointer;
|
||||||
|
min-width: 2em;
|
||||||
|
text-align: center;
|
||||||
|
border-left: 1px solid silver;
|
||||||
|
color: #06f;
|
||||||
|
}
|
||||||
|
|
||||||
|
[onclick_title] {
|
||||||
|
cursor: pointer;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
[onclick_title]::before {
|
||||||
|
opacity: 0;
|
||||||
|
|
||||||
|
content: attr(onclick_title);
|
||||||
|
font-weight: normal;
|
||||||
|
|
||||||
|
position: absolute;
|
||||||
|
left: -300%;
|
||||||
|
|
||||||
|
z-index: 1;
|
||||||
|
|
||||||
|
background: grey;
|
||||||
|
color: white;
|
||||||
|
|
||||||
|
border-radius: 0.5em;
|
||||||
|
padding: 0 1em;
|
||||||
|
}
|
||||||
|
|
||||||
|
[onclick_title]:not(:active)::before {
|
||||||
|
transition: opacity 1s ease-in-out;
|
||||||
|
}
|
||||||
|
|
||||||
|
[onclick_title]:active::before {
|
||||||
|
opacity: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
header > form {
|
||||||
|
margin: 1%;
|
||||||
|
}
|
||||||
|
|
||||||
|
header a {
|
||||||
|
text-decoration: inherit;
|
||||||
|
color: #FF7B0A;
|
||||||
|
font-weight: bold;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item {
|
||||||
|
background-color: #FFFAF4;
|
||||||
|
border: 1px solid silver;
|
||||||
|
margin: 1%;
|
||||||
|
max-width: 100%;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item > * {
|
||||||
padding: 1%;
|
padding: 1%;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item > :not(:last-child) {
|
||||||
|
border-bottom: 1px solid silver;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item > a {
|
||||||
|
|
||||||
|
display: block;
|
||||||
|
font-weight: bold;
|
||||||
|
font-size: 1.5em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.desc, .content {
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
.desc *, .content * {
|
||||||
max-width: 100%;
|
max-width: 100%;
|
||||||
}
|
}
|
||||||
|
|
||||||
ul {
|
|
||||||
list-style-type: none;
|
|
||||||
}
|
|
||||||
|
|
||||||
.tag {
|
|
||||||
color: darkred;
|
|
||||||
}
|
|
||||||
|
|
||||||
.attr {
|
|
||||||
color: darksalmon;
|
|
||||||
}
|
|
||||||
|
|
||||||
.value {
|
|
||||||
color: darkblue;
|
|
||||||
}
|
|
||||||
|
|
||||||
.comment {
|
|
||||||
color: lightgrey;
|
|
||||||
}
|
|
||||||
|
|
||||||
pre {
|
|
||||||
margin: 0;
|
|
||||||
max-width: 100%;
|
|
||||||
white-space: normal;
|
|
||||||
}
|
|
||||||
</style>
|
</style>
|
||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
|
<header>
|
||||||
<h1>RSS feed by morss</h1>
|
<h1>RSS feed by morss</h1>
|
||||||
|
|
||||||
<p>Your RSS feed is <strong style="color: green">ready</strong>. You
|
<p>Your RSS feed is <strong style="color: green">ready</strong>. You
|
||||||
can enter the following url in your newsreader:</p>
|
can enter the following url in your newsreader:</p>
|
||||||
|
|
||||||
<div id="url"></div>
|
<div class="input-combo">
|
||||||
|
<input id="url" readonly="readonly"/>
|
||||||
|
<span class="button" onclick="copy_link()" title="Copy" onclick_title="Copied">
|
||||||
|
<svg width="16px" height="16px" viewBox="0 0 16 16" fill="currentColor" xmlns="http://www.w3.org/2000/svg">
|
||||||
|
<path fill-rule="evenodd" d="M4 1.5H3a2 2 0 00-2 2V14a2 2 0 002 2h10a2 2 0 002-2V3.5a2 2 0 00-2-2h-1v1h1a1 1 0 011 1V14a1 1 0 01-1 1H3a1 1 0 01-1-1V3.5a1 1 0 011-1h1v-1z" clip-rule="evenodd"/>
|
||||||
|
<path fill-rule="evenodd" d="M9.5 1h-3a.5.5 0 00-.5.5v1a.5.5 0 00.5.5h3a.5.5 0 00.5-.5v-1a.5.5 0 00-.5-.5zm-3-1A1.5 1.5 0 005 1.5v1A1.5 1.5 0 006.5 4h3A1.5 1.5 0 0011 2.5v-1A1.5 1.5 0 009.5 0h-3z" clip-rule="evenodd"/>
|
||||||
|
</svg>
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
<ul>
|
<form onchange="open_feed()">
|
||||||
<xsl:apply-templates/>
|
More options: Output the
|
||||||
</ul>
|
<select>
|
||||||
|
<option value="">full-text</option>
|
||||||
|
<option value=":proxy">original</option>
|
||||||
|
<option value=":clip" title="original + full-text: keep the original description above the full article. Useful for reddit feeds for example, to keep the comment links">combined (?)</option>
|
||||||
|
</select>
|
||||||
|
feed as
|
||||||
|
<select>
|
||||||
|
<option value="">RSS</option>
|
||||||
|
<option value=":json:cors">JSON</option>
|
||||||
|
<option value=":html">HTML</option>
|
||||||
|
<option value=":csv">CSV</option>
|
||||||
|
</select>
|
||||||
|
using the
|
||||||
|
<select>
|
||||||
|
<option value="">standard</option>
|
||||||
|
<option value=":firstlink" title="Pull the article from the first available link in the description, instead of the standard link. Useful for Twitter feeds for example, to get the articles referred to in tweets rather than the tweet itself">first (?)</option>
|
||||||
|
</select>
|
||||||
|
link and
|
||||||
|
<select>
|
||||||
|
<option value="">keep</option>
|
||||||
|
<option value=":nolink:noref">remove</option>
|
||||||
|
</select>
|
||||||
|
links
|
||||||
|
<input type="hidden" value="" name="extra_options"/>
|
||||||
|
</form>
|
||||||
|
|
||||||
|
<p>You can find a <em>preview</em> of the feed below. You need a <em>feed reader</em> for optimal use</p>
|
||||||
|
<p>Click <a href="/">here</a> to go back to morss and/or to use the tool on another feed</p>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<div id="header" dir="auto">
|
||||||
|
<h1>
|
||||||
|
<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:title|rss/channel/title|atom:feed/atom:title|atom03:feed/atom03:title"/>
|
||||||
|
</h1>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<xsl:value-of select="rdf:RDF/rssfake:channel/rssfake:description|rss/channel/description|atom:feed/atom:subtitle|atom03:feed/atom03:subtitle"/>
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div id="content">
|
||||||
|
<xsl:for-each select="rdf:RDF/rssfake:channel/rssfake:item|rss/channel/item|atom:feed/atom:entry|atom03:feed/atom03:entry">
|
||||||
|
<div class="item" dir="auto">
|
||||||
|
<a href="/" target="_blank"><xsl:attribute name="href"><xsl:value-of select="rssfake:link|link|atom:link/@href|atom03:link/@href"/></xsl:attribute>
|
||||||
|
<xsl:value-of select="rssfake:title|title|atom:title|atom03:title"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<div class="desc">
|
||||||
|
<xsl:copy-of select="rssfake:description|description|atom:summary|atom03:summary"/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="content">
|
||||||
|
<xsl:copy-of select="content:encoded|atom:content|atom03:content"/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</xsl:for-each>
|
||||||
|
</div>
|
||||||
|
|
||||||
<script>
|
<script>
|
||||||
document.getElementById("url").innerHTML = window.location.href;
|
//<![CDATA[
|
||||||
|
document.getElementById("url").value = window.location.href
|
||||||
|
|
||||||
|
if (!/:html/.test(window.location.href))
|
||||||
|
for (var content of document.querySelectorAll(".desc,.content"))
|
||||||
|
content.innerHTML = (content.innerText.match(/>/g) || []).length > 10 ? content.innerText : content.innerHTML
|
||||||
|
|
||||||
|
var options = parse_location()[0]
|
||||||
|
|
||||||
|
if (options) {
|
||||||
|
for (var select of document.forms[0].elements)
|
||||||
|
if (select.tagName == 'SELECT')
|
||||||
|
for (var option of select)
|
||||||
|
if (option.value && options.match(option.value)) {
|
||||||
|
select.value = option.value
|
||||||
|
options = options.replace(option.value, '')
|
||||||
|
break
|
||||||
|
}
|
||||||
|
|
||||||
|
document.forms[0]['extra_options'].value = options
|
||||||
|
}
|
||||||
|
|
||||||
|
function copy_content(input) {
|
||||||
|
input.focus()
|
||||||
|
input.select()
|
||||||
|
document.execCommand('copy')
|
||||||
|
input.blur()
|
||||||
|
}
|
||||||
|
|
||||||
|
function copy_link() {
|
||||||
|
copy_content(document.getElementById("url"))
|
||||||
|
}
|
||||||
|
|
||||||
|
function parse_location() {
|
||||||
|
return (window.location.pathname + window.location.search).match(/^\/(?:(:[^\/]+)\/)?(.*$)$/).slice(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
function open_feed() {
|
||||||
|
var url = parse_location()[1]
|
||||||
|
var options = Array.from(document.forms[0].elements).map(x=>x.value).join('')
|
||||||
|
|
||||||
|
var target = '/' + (options ? options + '/' : '') + url
|
||||||
|
|
||||||
|
if (target != window.location.pathname)
|
||||||
|
window.location.href = target
|
||||||
|
}
|
||||||
|
//]]>
|
||||||
</script>
|
</script>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
</xsl:template>
|
</xsl:template>
|
||||||
|
|
||||||
<xsl:template match="*">
|
|
||||||
<li>
|
|
||||||
<span class="element">
|
|
||||||
<
|
|
||||||
<span class="tag"><xsl:value-of select="name()"/></span>
|
|
||||||
|
|
||||||
<xsl:for-each select="@*">
|
|
||||||
<span class="attr"> <xsl:value-of select="name()"/></span>
|
|
||||||
=
|
|
||||||
"<span class="value"><xsl:value-of select="."/></span>"
|
|
||||||
</xsl:for-each>
|
|
||||||
>
|
|
||||||
</span>
|
|
||||||
|
|
||||||
<xsl:if test="node()">
|
|
||||||
<ul>
|
|
||||||
<xsl:apply-templates/>
|
|
||||||
</ul>
|
|
||||||
</xsl:if>
|
|
||||||
|
|
||||||
<span class="element">
|
|
||||||
</
|
|
||||||
<span class="tag"><xsl:value-of select="name()"/></span>
|
|
||||||
>
|
|
||||||
</span>
|
|
||||||
</li>
|
|
||||||
</xsl:template>
|
|
||||||
|
|
||||||
<xsl:template match="comment()">
|
|
||||||
<li>
|
|
||||||
<pre class="comment"><![CDATA[<!--]]><xsl:value-of select="."/><![CDATA[-->]]></pre>
|
|
||||||
</li>
|
|
||||||
</xsl:template>
|
|
||||||
|
|
||||||
<xsl:template match="text()">
|
|
||||||
<li>
|
|
||||||
<pre>
|
|
||||||
<xsl:value-of select="normalize-space(.)"/>
|
|
||||||
</pre>
|
|
||||||
</li>
|
|
||||||
</xsl:template>
|
|
||||||
|
|
||||||
<xsl:template match="text()[not(normalize-space())]"/>
|
|
||||||
|
|
||||||
</xsl:stylesheet>
|
</xsl:stylesheet>
|
||||||
|
Reference in New Issue
Block a user