Compare commits

...

276 Commits

Author SHA1 Message Date
c5b2df754e Save auto version number
All checks were successful
default / test-lint (push) Successful in 1m31s
default / python-publish (push) Successful in 35s
default / docker-publish-deploy (push) Successful in 1m56s
Fixed #108
2023-06-27 22:36:29 +02:00
6529fdbdd8 Clean up sqlite code
All checks were successful
default / test-lint (push) Successful in 1m26s
default / python-publish (push) Successful in 30s
default / docker-publish-deploy (push) Successful in 1m35s
2023-06-26 01:30:47 +02:00
f4da40fffb actions: fix deploy 2023-06-26 01:29:00 +02:00
d27fc93f75 actions: clean up 2023-06-26 01:28:33 +02:00
dfb2b83c06 actions: fix python setup
Some checks reported warnings
default / publish-deploy (push) Has been cancelled
default / docker-publish (push) Has been cancelled
default / test-lint (push) Successful in 1m33s
2023-06-24 01:50:12 +02:00
4340b678d0 actions: change image
Some checks failed
default / test-lint (push) Failing after 23s
default / publish-deploy (push) Failing after 12s
default / docker-publish (push) Successful in 2m19s
2023-06-23 23:14:32 +02:00
ff9503b0d0 Switch from Drone to Gitea Actions
Some checks failed
default / publish-deploy (push) Failing after 45s
default / docker-publish (push) Failing after 10s
default / test-lint (push) Failing after 11m42s
2023-05-17 22:54:05 +02:00
Nesswit
8bdcd8f386 Add mode option 2023-05-04 16:01:52 +09:00
ea2ebedfcb Added systemd service file
Some checks failed
continuous-integration/drone/push Build is failing
Fixing #94
2022-12-13 23:01:42 +01:00
438c32a312 Remove sqlite & mysql cache backends
Some checks failed
continuous-integration/drone/push Build is failing
Obsoleted since the introduction of diskcache & redis
2022-12-13 22:40:13 +01:00
8b26797e93 README: add recommended install way
Some checks reported errors
continuous-integration/drone/push Build was killed
continuous-integration/drone Build is passing
Part of discussions on #94
2022-12-13 22:07:21 +01:00
e1ed33f320 crawler: improve html iter code
All checks were successful
continuous-integration/drone/push Build is passing
Ignores tags without attributes. Avoids bug with unclosed tags.
2022-02-09 15:57:12 +01:00
b65272daab crawler: accept more meta redirects
All checks were successful
continuous-integration/drone/push Build is passing
2022-02-01 23:32:49 +01:00
4d64afe9cb crawler: fix regression from d6b90448f3
Some checks failed
continuous-integration/drone/push Build is failing
2022-02-01 23:18:16 +01:00
d3b623482d pytest: crawler 2022-02-01 23:16:43 +01:00
32645548c2 pytest: first batch with test_feeds
Some checks failed
continuous-integration/drone/push Build is failing
And multiple related fixes
2022-01-31 08:32:34 +01:00
d6b90448f3 crawler: improve handling of non-ascii urls 2022-01-30 23:27:49 +01:00
da81edc651 log to stderr
Some checks failed
continuous-integration/drone/push Build is failing
2022-01-26 07:57:57 +01:00
4f2895f931 cli: update --help
Some checks failed
continuous-integration/drone/push Build is failing
2022-01-25 22:36:57 +01:00
b2b04691d6 Ability to pass custom data_files location 2022-01-25 22:36:34 +01:00
bfaf7b0fac feeds: clean up default item_link
Some checks failed
continuous-integration/drone/push Build is failing
To be supported by feeds' `_rule_parse`
2022-01-24 16:16:14 +00:00
32d9bc9d9d feeds: proceed with conversion when rules do not match
Some checks failed
continuous-integration/drone/push Build is failing
2022-01-24 09:34:57 +00:00
b138f11771 util: support more data_files location
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-23 12:40:18 +01:00
a01258700d More ordering options
Some checks reported errors
continuous-integration/drone/push Build was killed
2022-01-23 12:27:07 +01:00
4d6d3c9239 wsgi: limit supported mimetypes & return actual mimetype
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-23 11:44:07 +01:00
e81f6b173f readabilite: remove code duplicate 2022-01-23 11:41:32 +01:00
fe5dbf1ce0 wsgi: reuse mimetype table from crawler 2022-01-22 13:22:39 +01:00
fdf9acd32b helper: fix reload code
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-19 13:44:15 +01:00
d05706e056 crawler: fix typo
Some checks reported errors
continuous-integration/drone/push Build was killed
2022-01-19 13:41:12 +01:00
e88a823ada feeds: better handle rulesets without a 'mode' specified
Some checks failed
continuous-integration/drone/push Build is failing
2022-01-19 13:08:33 +01:00
750850c162 crawler: avoid too many .append() 2022-01-19 13:04:33 +01:00
c8669002e4 feeds: exotic xpath in html as well
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-17 14:22:48 +00:00
c524e54d2d feeds: support some exotic xpath rules returning a single string
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-17 13:59:58 +00:00
ef14567d87 Handle morss-helper with setup.py
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-08 16:10:51 +01:00
fb643f5ef1 readabilite: remove unneeded reference to features (overriden by builder)
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-03 18:01:12 +00:00
dbdca910d8 readabilite: fix new parser code & drop PIs
Some checks reported errors
continuous-integration/drone/push Build was killed
2022-01-03 17:51:49 +00:00
9eb19fac04 readabilite: use custom html parser within bs4's lxml parser
All checks were successful
continuous-integration/drone/push Build is passing
Solves the following obscure error:
ValueError: Invalid PI name 'b'xml''
2022-01-03 16:26:17 +00:00
d424e394d1 readabilite: use lxml bs4 parser for speed
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-01 14:52:48 +01:00
3f92787b38 readabilite: limit html comments related issues
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-01 13:58:42 +01:00
afc31eb6e9 readabilite: avoid double parsing of html
All checks were successful
continuous-integration/drone/push Build is passing
2022-01-01 12:51:30 +01:00
87d2fe772d wsgi: fix py2 compatibility 2022-01-01 12:35:41 +01:00
917aa0fbc5 crawler: do not re-save cached response
All checks were successful
continuous-integration/drone/push Build is passing
Otherwise cache never gets invalidated!
2021-12-31 19:28:11 +01:00
3e2b81286f xsl: add link to favicon
To limit error output when failing to fetch favicon.ico
2021-12-31 19:25:53 +01:00
15430a2b83 helper: restore run if no param passed
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-29 23:35:16 +01:00
ecdb74812d Make helper & main.py executable
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-29 15:47:05 +01:00
2c7844942c drone: re order deploy commands
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-29 15:41:29 +01:00
e12cb4567a helper: more debug options 2021-12-29 15:41:03 +01:00
b74365b121 Make helper more posix compliant 2021-12-29 15:40:43 +01:00
2020543469 Make morss-helper executable 2021-12-29 15:37:12 +01:00
676be4a4fe helper: work around for systems only having py3 binary
Some checks are pending
continuous-integration/drone/push Build is running
2021-12-29 14:07:12 +01:00
8870400a6e Clean up morss-helper
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-28 16:30:20 +01:00
8e9cc541b0 Turns out exec array is not supported in HEALTHCHECK
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-28 15:23:40 +01:00
2a7a1b83ec Use alpine:edge to have up-to-date py packages
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-28 13:41:42 +01:00
106f59afa1 docker: shift HEALTHCHECK to helper
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-27 16:08:55 +01:00
ee514e2da3 helper: remove unneeded sudo
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-25 22:00:22 +00:00
e7578e859a Clean up install/exec
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-25 18:21:55 +01:00
3bcb8db974 Improve cloud-init (append & env var) 2021-12-25 11:02:27 +01:00
7751792942 Shift htaccess to README 2021-12-24 18:03:55 +01:00
6e2e5ffa00 README: cloud-init indication for env var
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-24 11:44:56 +01:00
f6da0e1e9b Make use of GUNICORN_CMD_ARGS 2021-12-24 11:44:24 +01:00
2247ba13c5 drone: clean up file
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-23 12:03:50 +01:00
d17b9a2f27 Fix typo in DISKCACHE_DIR var name
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-23 12:02:24 +01:00
5ab45e60af README: scale back on logos
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-16 09:03:43 +00:00
368e4683d6 util: clean paths code
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-16 08:53:18 +00:00
9fd8c7d6af drone: add back install command on deploy
All checks were successful
continuous-integration/drone/push Build is passing
Was lost on the way
2021-12-14 15:42:02 +00:00
89f5d07408 drone: use docker for ssh
Some checks reported errors
continuous-integration/drone/push Build was killed
ssh pipelines require a separate runner
2021-12-14 15:33:38 +00:00
495bd44893 drone: escape full command
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-14 15:16:21 +00:00
ff12dbea39 drone: escape $ sign 2021-12-14 15:12:22 +00:00
7885ab48df drone: deploy 2021-12-14 15:10:46 +00:00
7cdcbd23e1 wsgi: fix another typo
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-14 12:06:08 +00:00
25f283da1f wsgi: fix bug following the removal of the loop
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-14 11:56:55 +00:00
727d14e539 wsgi: use data_files helper
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-14 11:47:10 +00:00
3392ae3973 util: try one more path for data_files
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-14 11:10:26 +00:00
0111ea1749 README: add py stats
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-14 09:45:38 +00:00
def397de5e Dockerfile: add setuptools for gunicorn
All checks were successful
continuous-integration/drone/push Build is passing
Otherwise gets removed with pip
2021-12-12 21:11:38 +00:00
d07aa566ed Dockerfile: keep source files
All checks were successful
continuous-integration/drone/push Build is passing
Will need to be sorted out. `docker-entry.sh` was also deleted.
2021-12-12 18:38:39 +00:00
0ee16d4a7d Install setproctitle from pkg mgrs
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-11 18:50:28 +01:00
ac9859d955 setup: add setproctitle to full install for gunicorn
Some checks failed
continuous-integration/drone/push Build is failing
Also update the readme section regarding the full install
2021-12-11 18:37:24 +01:00
580565da77 Dockerfile: reduce # of steps & image size
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-11 18:28:35 +01:00
b2600152ea docker: remove unneeded git dep
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-10 15:14:50 +00:00
27d8f06308 README: badge for one click deployment
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-10 15:11:57 +00:00
79c4477cfc cloud-init: simplify install
All checks were successful
continuous-integration/drone/push Build is passing
Use pypi package, fix typo in command
2021-12-10 14:05:58 +00:00
c09aa8400a README: add badge for docker architectures
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-10 13:39:21 +00:00
861c275f5b README: badge time
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-10 11:58:20 +00:00
99a855c8fc drone: add arm/v7 docker builds
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-10 00:51:22 +01:00
bef7899cdd setup.py: min python version
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-10 00:44:29 +01:00
7513a3e74d setup: source code & bug tracker link 2021-12-10 00:44:18 +01:00
5bf93b83df drone: warn about qemu deps
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-08 23:11:01 +01:00
e7ecc018c5 drone: working multi arch docker build
Some checks reported errors
continuous-integration/drone/push Build encountered an error
2021-12-07 12:53:06 +01:00
34b7468ba5 drone: fix bug
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-06 21:28:06 +01:00
5336d26204 drone: separate pipeline for py/docker 2021-12-06 21:23:03 +01:00
c7082dcf6c README: typo in link 2021-12-06 20:50:43 +01:00
c785adb4c3 README: nicer links
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-06 08:31:17 +01:00
73798d2fc1 README: ref to pypi & docker hub
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 20:31:44 +01:00
18daf378e8 README: docker hub instructions
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 20:28:03 +01:00
aa2b747c5e drone: mono platform docker
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 20:06:32 +01:00
d390ed9715 drone: docker buildx
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-05 19:51:42 +01:00
0a5a8ceb7f README: add pip pkg instructions
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 19:25:02 +01:00
d2d9d7f22e setup: long desc md
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 19:14:57 +01:00
29ae99c24d setup: long desc from readme
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-05 18:59:14 +01:00
ed06ae6398 setup: clean up author string
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-05 18:48:47 +01:00
c3318d4af0 setup: auto version based on date (yyyymmdd.hhmm) 2021-12-05 18:48:29 +01:00
4e577d3266 setup: proper email to suit pypi
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-05 18:04:24 +01:00
22fc0e076b cloud-init per boot
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-05 16:49:40 +01:00
856be36769 pypi typo
Some checks reported errors
continuous-integration/drone/push Build is failing
continuous-integration/drone Build was killed
2021-12-05 16:30:59 +01:00
397bd61374 drone: lint once pylint is installed...
Some checks reported errors
continuous-integration/drone/push Build was killed
2021-12-05 16:29:08 +01:00
25d63f2aee drone: pypy publish
Some checks failed
continuous-integration/drone/push Build is failing
2021-12-05 16:26:53 +01:00
4a8dca1fbf drone: simplify lint 2021-12-05 16:25:49 +01:00
51f1d330a4 Fn to access data_files & pkg files
Some checks are pending
continuous-integration/drone Build is running
continuous-integration/drone/push Build is passing
2021-12-05 12:09:01 +01:00
11bc9f643e setup: add [dev] for pylint
All checks were successful
continuous-integration/drone Build is passing
continuous-integration/drone/push Build is passing
2021-12-04 14:26:40 +01:00
b600bbc256 drone: Remove buggy pylint pkg
Some checks failed
continuous-integration/drone Build is failing
continuous-integration/drone/push Build is failing
2021-12-04 12:01:10 +01:00
502366db10 cloud-init open port
Some checks failed
continuous-integration/drone Build is failing
continuous-integration/drone/push Build is failing
2021-11-30 23:08:37 +01:00
296b69f40e cloud-init fix typo in pkg name 2021-11-30 22:43:05 +01:00
a2deb90185 drone: remove gevent leftover 2021-11-30 22:42:26 +01:00
72024f2864 Remove default settings for gunicorn 2021-11-26 07:27:05 +01:00
440f7d6797 gunicorn: broader customization
Some checks failed
continuous-integration/drone/push Build is failing
2021-11-25 22:43:40 +01:00
eb47aac6f1 morss: respect timeout settings in all cases
Some checks failed
continuous-integration/drone/push Build is failing
Special treatment of feed fetch not justified and not documented
2021-11-25 22:13:38 +01:00
eca546b890 Change HTTP error code to 404
Some checks failed
continuous-integration/drone/push Build is failing
To tell them apart from 'true' 500 errors
2021-11-25 21:34:46 +01:00
5422d4e14c Move away from gevent
Some checks failed
continuous-integration/drone/push Build is failing
Might not be that reliable
2021-11-25 21:21:59 +01:00
1837eda25f heroku: add WORKERS to env vars
Some checks failed
continuous-integration/drone/push Build is failing
2021-11-24 21:40:46 +01:00
321763710d heroku: make env var customizable 2021-11-24 21:40:34 +01:00
e79c426c6e Add ability to change workers count for gunicorn 2021-11-24 21:36:28 +01:00
92a28be0b0 drone: add gevent dep
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-24 21:25:29 +01:00
70db524664 docker: allow sh run
Some checks failed
continuous-integration/drone/push Build is failing
2021-11-24 21:23:16 +01:00
d8cc07223e readabilite: fix bug when nothing above threshold
Some checks failed
continuous-integration/drone/push Build is failing
2021-11-23 20:53:00 +01:00
37e08f8b4c Include gunicorn and gevent in [full]
Some checks failed
continuous-integration/drone/push Build is failing
2021-11-23 20:32:22 +01:00
8f576adb64 Add gevents deps
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-23 20:22:47 +01:00
528b3448e4 Get gevents from pkgs (long to build)
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-23 20:14:22 +01:00
f627f1b12b cloud-init: rename py3 packages for ubuntu 2021-11-23 20:13:40 +01:00
53fd97651e Fix [full] install instructions 2021-11-23 20:09:52 +01:00
4dd77b4bcc Add gevent to for deployment to get the latest one
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-23 20:07:56 +01:00
deffeebd85 gunicorn: use more aggressive multi-threading settings
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-23 20:00:10 +01:00
765e0ba728 Pass py error msg in http headers
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-22 23:22:13 +01:00
12073ac7d8 Simplify cloud-init code
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-22 21:53:18 +01:00
6d049935e3 More cloud instructions
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-21 22:38:26 +01:00
7b64c963c4 README: nicer links 2021-11-21 22:37:25 +01:00
6900b9053c Heroku click to deploy
All checks were successful
continuous-integration/drone/push Build is passing
incl. workaround for their weird use of entrypoint
2021-11-21 21:57:06 +01:00
6ec3fb47d1 readabilite: .strip() first to save time
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-15 21:54:07 +01:00
1083f3ffbc crawler: make sure to use HTTPMessage
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-11 10:21:48 +01:00
7eeb1d696c crawler: clean up code
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-10 23:25:03 +01:00
e42df98f83 crawler: fix regression brought with 44a6b2591
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-10 23:08:31 +01:00
cb21871c35 crawler: clean up caching code
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-08 22:02:23 +01:00
c71cf5d5ce caching: fix diskcache implementation 2021-11-08 21:57:43 +01:00
44a6b2591d crawler: cleaner http header object import 2021-11-07 19:44:36 +01:00
a890536601 morss: comment code a bit 2021-11-07 18:26:07 +01:00
8de309f2d4 caching: add diskcache backend 2021-11-07 18:15:20 +01:00
cbf7b3f77b caching: simplify sqlite code 2021-11-07 18:14:18 +01:00
1ff7e4103c Docker: make it possible to use it as cli
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-31 17:15:08 +01:00
3f12258e98 Docker: default to non-root exec
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-19 22:20:44 +02:00
d023ec8d73 Change default port to 8000 2021-10-19 22:19:59 +02:00
5473b77416 Post-clean up isort
All checks were successful
continuous-integration/drone/push Build is passing
2021-09-21 08:11:04 +02:00
0365232a73 readabilite: custom xpath for article detection
Some checks failed
continuous-integration/drone/push Build is failing
2021-09-21 08:04:45 +02:00
a523518ae8 cache: avoid name collision 2021-09-21 08:04:45 +02:00
52c48b899f readability: better var names 2021-09-21 08:04:45 +02:00
9649cabb1b morss: do not crash on empty pages 2021-09-21 08:04:45 +02:00
0c29102788 drone: full module install 2021-09-21 08:04:45 +02:00
10535a17c5 cache: fix isort 2021-09-21 08:04:45 +02:00
7d86972e58 Add Redis cache backend 2021-09-21 08:04:45 +02:00
62e04549ac Optdeps in REAMDE and Dockerfile 2021-09-21 08:04:45 +02:00
5da7121a77 Fix Options class behaviour 2021-09-21 08:04:45 +02:00
bb82902ad1 Move cache code to its own file 2021-09-21 08:04:45 +02:00
04afa28fe7 crawler: cache pickle'd array 2021-09-21 08:04:45 +02:00
75bb69f0fd Make mysql optdep 2021-09-21 08:04:45 +02:00
97d9dda547 crawler: support 308 redirects 2021-09-21 08:04:45 +02:00
0c31d9f6db ci: add spell check dict 2021-09-21 08:04:45 +02:00
49e29208ef ci: fix spell check 2021-09-21 08:04:45 +02:00
d8d608a4de ci: fix pylint install 2021-09-21 08:04:45 +02:00
5437e40a15 drone: use alpine image (to benefit from pkgs) 2021-09-21 08:04:45 +02:00
6c1f8da692 ci: added pylint (triggered upon error w/ score < 8 only) 2021-09-21 08:04:45 +02:00
a1a26d8209 README: add ci badge 2021-09-21 08:04:45 +02:00
edbb580f33 ci/cd: fix isort args 2021-09-21 08:04:45 +02:00
4fd730b983 Further isort implementation 2021-09-21 08:04:45 +02:00
198353d6b9 ci/cd attempt 2021-09-21 08:04:45 +02:00
0b3e6d7749 Apply isort 2021-09-21 08:04:23 +02:00
06e0ada95b Allow POST requests 2021-09-08 20:43:21 +02:00
71d9c7a027 README: updated ttrss link 2021-09-07 20:30:18 +02:00
37f5a92b05 wsgi: fix apache / workload 2021-09-06 22:01:48 +02:00
24c26d3850 wsgi: handle url decoding 2021-08-31 21:51:21 +02:00
8f24214915 crawler: better name for custom fns 2021-08-29 00:22:40 +02:00
d5942fe5a7 feeds: fix issues when mode not explicited in ruleset 2021-08-29 00:20:29 +02:00
6f50443995 morss: Options return None instead of False if no match
Better for default fn values
2021-08-29 00:19:09 +02:00
5582fbef31 crawler: comment 2021-08-29 00:18:50 +02:00
da5442a1dc feedify: support any type (json, xml, html) 2021-08-29 00:17:28 +02:00
f9d7794bcc feedifi.ini: json time zone handling 2021-04-23 20:43:00 +02:00
e37c8346d0 feeds: add fallback for time parser 2021-04-22 21:57:16 +02:00
3a1d564992 feeds: fix time zone handling 2021-04-22 21:51:00 +02:00
6880a443e0 crawler: improve CacheHandler code 2021-03-25 23:54:08 +01:00
7342ab26d2 crawler: comment on how urllib works 2021-03-25 23:49:58 +01:00
981da9e66a crawler: SQLITE_PATH point to .db file instead of folder 2021-03-25 23:48:21 +01:00
6ea9d012a2 readme: fix newsreader hook 2021-01-23 21:40:46 +01:00
95d6143636 docker: log to stdout 2021-01-15 23:31:27 +01:00
03cad120d0 README: comment on http timeout 2021-01-14 22:34:57 +01:00
01a7667032 Fix error due to remaining log force code 2021-01-14 00:51:47 +01:00
3e886caaab crawler: drop encoding setting 2020-10-30 22:41:16 +01:00
ad927e03a7 crawler: use regex instead of lxml
Less reliable but should be faster
2020-10-30 22:21:19 +01:00
0efb096fa7 crawler: shift gzip & encoding-fix to intermediary handler 2020-10-30 22:16:51 +01:00
9ab2e488ef crawler: add intermediary handlers 2020-10-30 22:15:35 +01:00
b525ab0d26 crawler: fix typo 2020-10-30 22:12:43 +01:00
fb19b1241f sheet.xsl: update to :format 2020-10-11 22:49:33 +02:00
9d062ef24b README: indicate time unit for env vars 2020-10-03 22:22:20 +02:00
447f62dc45 README: indicate to use build --no-cache --pull by default 2020-10-03 22:18:51 +02:00
18ec10fe44 Use hopefully more-up-to-date pip gunicorn 2020-10-03 21:59:43 +02:00
891c385b69 Dockerfile: no cache for pip to save space 2020-10-03 21:59:17 +02:00
0629bb98fb dockerfile: add python wheel 2020-10-03 21:16:04 +02:00
ae7ba458ce README fixes 2020-10-03 20:56:40 +02:00
bd0bca69fc crawler: ignore ssl via env var 2020-10-03 19:57:08 +02:00
8abd951d40 More sensible default values for cache autotrim (1k entries, 1min) 2020-10-03 19:55:57 +02:00
2514fabd38 Replace memory-leak-prone Uniq with @uniq_wrapper 2020-10-03 19:43:55 +02:00
8cb7002fe6 feeds: make it possible to append empty items
And return the newly appended items, to make it easy to edit them
2020-10-03 16:56:07 +02:00
6966e03bef Clean up itemClass code
To avoid globals()
2020-10-03 16:25:29 +02:00
03a122c41f Dockerfile: add --no-cache to save some space 2020-10-01 22:33:29 +02:00
5cd6c22d73 Reorganise the README file 2020-10-01 22:25:53 +02:00
e1b41b5f64 Typo in README 2020-10-01 00:18:48 +02:00
9ce6acba20 Fix gunicorn related typo 2020-10-01 00:07:41 +02:00
6192ff4081 gunicorn with --preload
To only load the code once (and start autotrim once)
2020-10-01 00:05:39 +02:00
056a1b143f crawler: autotrim: make ctrl+c working 2020-10-01 00:04:36 +02:00
eed949736a crawler: add ability to limit cache size 2020-09-30 23:59:55 +02:00
2fc7cd391c Shift __main__'s wsgi code where it belongs 2020-09-30 23:24:51 +02:00
d9f46b23a6 crawler: default value for MYSQL_HOST (localhost) 2020-09-30 13:17:02 +02:00
bbada0436a Quick guide to ignore SSL certs 2020-09-27 16:48:22 +02:00
039a672f4e wsgi: clean up url reconstruction 2020-09-27 16:28:26 +02:00
b290568e14 README: decent line length
Obtained from the output of:
python -m morss --help | cat
2020-09-15 23:01:42 +02:00
9ecf856f10 Add :resolve to remove (some?) tracking links 2020-09-15 22:57:52 +02:00
504ede624d Logo CC BY-NC-SA 4.0 2020-09-03 13:17:58 +02:00
0d89f0e6f2 Add a logo
B&W edition of the logo at https://morss.it/
2020-08-28 21:28:07 +02:00
56e0c2391d Missing import for served files 2020-08-28 20:53:03 +02:00
679f406a12 Default mimetype for served files 2020-08-28 20:52:43 +02:00
f6d641eeef Serve any file in www/
Also fixes #41
2020-08-28 20:45:39 +02:00
2456dd9bbc Fix broken pieces
Including #43
2020-08-28 19:38:48 +02:00
0f33db248a Add license info in each file 2020-08-26 20:08:22 +02:00
d57f543c7b README: remove todo 2020-08-24 21:17:31 +02:00
fba112147c README: make it clear that the internal server is _very_ basic 2020-08-24 21:14:48 +02:00
8697c3f0df Remove remaining --debug from README 2020-08-24 19:39:27 +02:00
75935114e4 Remove leftover code 2020-08-23 19:07:12 +02:00
5bd2557619 Fix typo in provided .htaccess 2020-08-23 19:01:34 +02:00
598a2591f1 Dockerfile: remove confusing one-liner code 2020-08-23 18:59:16 +02:00
e76ab2b631 Update gunicorn instructions 2020-08-23 18:59:02 +02:00
aa9143302b Remove now-unused isInt code 2020-08-23 18:51:09 +02:00
0d62a7625b Define http port via env vars as well 2020-08-23 18:50:18 +02:00
bd0efb1529 crawler: missing os import 2020-08-23 18:45:44 +02:00
47a17614ef Rename morss/cgi.py into morss/wsgi.py
To avoid name collision with the built-in cgi lib
2020-08-23 18:44:49 +02:00
4dfebe78f7 Pick caching backend via env vars 2020-08-23 18:43:18 +02:00
dcd3e4a675 cgi.py: add missing impots 2020-08-23 18:31:05 +02:00
e968b2ea7f Remove leftover :debug code 2020-08-23 16:59:34 +02:00
0ac590c798 Set MAX_/LIM_* settings via env var 2020-08-23 16:09:58 +02:00
fa1b5aef09 Instructions for DEBUG= use 2020-08-23 15:31:11 +02:00
7f6309f618 README: :silent was explained twice 2020-08-23 14:34:04 +02:00
f65fb45030 :debug completely deprecated in favour of DEBUG= 2020-08-23 14:33:32 +02:00
6dd40e5cc4 cli.py: fix Options code 2020-08-23 14:25:09 +02:00
0acfce5a22 cli.py: remove log 2020-08-23 14:24:57 +02:00
97ccc15db0 cgi.py: rename parseOptions to parse_options 2020-08-23 14:24:23 +02:00
7a560181f7 Use env var for DEBUG 2020-08-23 14:23:45 +02:00
baccd3b22b Move parseOptions to cgi.py
As it is no longer used in cli.py
2020-08-22 00:37:34 +02:00
f79938ab11 Add :silent to readme & argparse 2020-08-22 00:02:08 +02:00
5b8bd47829 cli.py: remove draft code 2020-08-21 23:59:12 +02:00
b5b355aa6e readabilite: increase penalty for high link density 2020-08-21 23:55:04 +02:00
94097f481a sheet.xsl: better handle some corner cases 2020-08-21 23:54:35 +02:00
8161baa7ae sheet.xsl: improve css 2020-08-21 23:54:12 +02:00
bd182bcb85 Move cli code to argParse
Related code changes (incl. :format=xyz)
2020-08-21 23:52:56 +02:00
c7c2c5d749 Removed unused filterOptions code 2020-08-21 23:23:33 +02:00
c6b52e625f split morss.py into __main__/cgi/cli.py
Should hopefully allow cleaner code in the future
2020-08-21 22:17:55 +02:00
c6d3a0eb53 readabilite: clean up code 2020-07-15 00:49:34 +02:00
c628ee802c README: add docker-compose instructions 2020-07-13 20:50:39 +02:00
6021b912ff morss: fix item removal
Usual issue when editing a list while looping over it
2020-07-06 19:25:48 +02:00
f18a128ee6 Change :first for :newest
i.e. toggle default for the more-obvious option
2020-07-06 19:25:17 +02:00
64af86c11e crawler: catch html parsing errors 2020-07-06 12:25:38 +02:00
15951d228c Add :first to NOT sort items by date 2020-07-06 11:39:08 +02:00
c1b1f5f58a morss: restrict iframe use from :get to avoid abuse 2020-06-09 12:33:37 +02:00
985185f47f morss: more flexible feed creator auto-detection 2020-06-08 13:03:24 +02:00
3190d1ec5a feeds: remove useless if(len) before loop 2020-06-02 13:57:45 +02:00
9815794a97 sheet.xsl: make text more self explanatory 2020-05-27 21:42:00 +02:00
758b6861b9 sheet.xsl: fix text alignment 2020-05-27 21:36:11 +02:00
ce4cf01aa6 crawler: clean up encoding detection code 2020-05-27 21:35:24 +02:00
dcfdb75a15 crawler: fix chinese encoding support 2020-05-27 21:34:43 +02:00
4ccc0dafcd Basic help for sub-lib interactive use 2020-05-26 19:34:20 +02:00
2fe3e0b8ee feeds: clean up other stylesheets before putting ours 2020-05-26 19:26:36 +02:00
51 changed files with 11627 additions and 1046 deletions

78
.github/workflows/default.yml vendored Normal file
View File

@@ -0,0 +1,78 @@
name: default
on:
push:
branches:
- master
jobs:
test-lint:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip libenchant-2-2 aspell-en
- name: Install dependencies
run: pip3 install .[full] .[dev]
- run: isort --check-only --diff .
- run: pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8
- run: pytest --cov=morss tests
python-publish:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip python3-build
- name: Build package
run: python3 -m build
- name: Publish package
uses: https://github.com/pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.pypi_api_token }}
docker-publish-deploy:
runs-on: ubuntu-latest
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up QEMU
uses: https://github.com/docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: https://github.com/docker/setup-buildx-action@v2
- name: Login to Docker Hub
uses: https://github.com/docker/login-action@v2
with:
username: ${{ secrets.docker_user }}
password: ${{ secrets.docker_pwd }}
- name: Build and push
uses: https://github.com/docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64,linux/arm/v7
push: true
tags: ${{ secrets.docker_repo }}
- name: Deploy on server
uses: https://github.com/appleboy/ssh-action@v0.1.10
with:
host: ${{ secrets.ssh_host }}
username: ${{ secrets.ssh_user }}
key: ${{ secrets.ssh_key }}
script: morss-update

50
.pylintrc Normal file
View File

@@ -0,0 +1,50 @@
[MASTER]
ignore=CVS
suggestion-mode=yes
extension-pkg-allow-list=lxml.etree
[MESSAGES CONTROL]
disable=missing-function-docstring,
missing-class-docstring,
missing-module-docstring,
wrong-spelling-in-comment,
[REPORTS]
reports=yes
score=yes
[SPELLING]
spelling-dict=en_GB
spelling-ignore-words=morss
[STRING]
check-quote-consistency=yes
check-str-concat-over-line-jumps=yes
[VARIABLES]
allow-global-unused-variables=no
init-import=no
[FORMAT]
expected-line-ending-format=LF
indent-string=' '
max-line-length=120
max-module-lines=1000
[BASIC]
argument-naming-style=snake_case
attr-naming-style=snake_case
class-attribute-naming-style=snake_case
class-const-naming-style=UPPER_CASE
class-naming-style=PascalCase
const-naming-style=UPPER_CASE
function-naming-style=snake_case
inlinevar-naming-style=snake_case
method-naming-style=snake_case
module-naming-style=snake_case
variable-naming-style=snake_case
include-naming-hint=yes
bad-names=foo, bar
good-names=i, j, k

View File

@@ -1,8 +1,16 @@
FROM alpine:latest
RUN apk add python3 py3-lxml py3-gunicorn py3-pip git
FROM alpine:edge
ADD . /app
RUN pip3 install /app
CMD gunicorn --bind 0.0.0.0:8080 -w 4 morss:cgi_standalone_app
RUN set -ex; \
apk add --no-cache --virtual .run-deps python3 py3-lxml py3-setproctitle py3-setuptools; \
apk add --no-cache --virtual .build-deps py3-pip py3-wheel; \
pip3 install --no-cache-dir /app[full]; \
apk del .build-deps
USER 1000:1000
ENTRYPOINT ["/bin/sh", "/app/morss-helper"]
CMD ["run"]
HEALTHCHECK CMD /bin/sh /app/morss-helper check

505
README.md
View File

@@ -1,10 +1,14 @@
# Morss - Get full-text RSS feeds
_GNU AGPLv3 code_
[Homepage](https://morss.it/) •
[Upstream source code](https://git.pictuga.com/pictuga/morss) •
[Github mirror](https://github.com/pictuga/morss) (for Issues & Pull requests)
Upstream source code: https://git.pictuga.com/pictuga/morss
Github mirror (for Issues & Pull requests): https://github.com/pictuga/morss
Homepage: https://morss.it/
[![Build Status](https://ci.pictuga.com/api/badges/pictuga/morss/status.svg)](https://ci.pictuga.com/pictuga/morss)
[![Github Stars](https://img.shields.io/github/stars/pictuga/morss?logo=github)](https://github.com/pictuga/morss/stargazers)
[![Github Forks](https://img.shields.io/github/forks/pictuga/morss?logo=github)](https://github.com/pictuga/morss/network/members)
[![GNU AGPLv3 code](https://img.shields.io/static/v1?label=license&message=AGPLv3)](https://git.pictuga.com/pictuga/morss/src/branch/master/LICENSE)
[![Logo is CC BY-NC-SA 4.0](https://img.shields.io/static/v1?label=CC&message=BY-NC-SA%204.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
commonly available on internet. Indeed most newspapers only make a small
@@ -18,7 +22,7 @@ Morss also provides additional features, such as: .csv and json export, extended
control over output. A strength of morss is its ability to deal with broken
feeds, and to replace tracking links with direct links to the actual content.
Morss can also generate feeds from html and json files (see `feedify.py`), which
Morss can also generate feeds from html and json files (see `feeds.py`), which
for instance makes it possible to get feeds for Facebook or Twitter, using
hand-written rules (ie. there's no automatic detection of links to build feeds).
Please mind that feeds based on html files may stop working unexpectedly, due to
@@ -29,6 +33,7 @@ Additionally morss can detect rss feeds in html pages' `<meta>`.
You can use this program online for free at **[morss.it](https://morss.it/)**.
Some features of morss:
- Read RSS/Atom feeds
- Create RSS feeds from json/html pages
- Export feeds as RSS/JSON/CSV/HTML
@@ -36,77 +41,213 @@ Some features of morss:
- Follow 301/meta redirects
- Recover xml feeds with corrupt encoding
- Supports gzip-compressed http content
- HTTP caching with 3 different backends (in-memory/sqlite/mysql)
- HTTP caching with different backends (in-memory/redis/diskcache)
- Works as server/cli tool
- Deobfuscate various tracking links
## Dependencies
## Install
You do need:
### Python package
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
- [lxml](http://lxml.de/) for xml parsing
- [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
- [chardet](https://pypi.python.org/pypi/chardet)
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
- pymysql
![Build Python](https://img.shields.io/badge/dynamic/json?label=build%20python&query=$.stages[?(@.name=='python')].status&url=https://ci.pictuga.com/api/repos/pictuga/morss/builds/latest)
[![PyPI](https://img.shields.io/pypi/v/morss)](https://pypi.org/project/morss/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/morss)](https://pypistats.org/packages/morss)
Simplest way to get these:
Simple install (without optional dependencies)
From pip
```shell
pip install git+https://git.pictuga.com/pictuga/morss.git@master
pip install morss
```
From git
```shell
pip install git+https://git.pictuga.com/pictuga/morss.git
```
Full installation (including optional dependencies)
From pip
```shell
pip install morss[full]
```
From git
```shell
pip install git+https://git.pictuga.com/pictuga/morss.git#egg=morss[full]
```
The full install includes all the cache backends. Otherwise, only in-memory
cache is available. The full install also includes gunicorn (for more efficient
HTTP handling).
The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
C code needs to be compiled). If possible on your distribution, try installing
it with the system package manager.
You may also need:
### Docker
- Apache, with python-cgi support, to run on a server
- a fast internet connection
![Build Docker](https://img.shields.io/badge/dynamic/json?label=build%20docker&query=$.stages[?(@.name=='docker')].status&url=https://ci.pictuga.com/api/repos/pictuga/morss/builds/latest)
[![Docker Hub](https://img.shields.io/docker/pulls/pictuga/morss)](https://hub.docker.com/r/pictuga/morss)
[![Docker Arch](https://img.shields.io/badge/dynamic/json?color=blue&label=docker%20arch&query=$.results[0].images[*].architecture&url=https://hub.docker.com/v2/repositories/pictuga/morss/tags)](https://hub.docker.com/r/pictuga/morss/tags)
## Arguments
From docker hub
morss accepts some arguments, to lightly alter the output of morss. Arguments
may need to have a value (usually a string or a number). In the different "Use
cases" below is detailed how to pass those arguments to morss.
With cli
The arguments are:
```shell
docker pull pictuga/morss
```
- Change what morss does
- `json`: output as JSON
- `html`: outpout as HTML
- `csv`: outpout as CSV
- `proxy`: doesn't fill the articles
- `clip`: stick the full article content under the original feed content (useful for twitter)
- `search=STRING`: does a basic case-sensitive search in the feed
- Advanced
- `csv`: export to csv
- `indent`: returns indented XML or JSON, takes more place, but human-readable
- `nolink`: drop links, but keeps links' inner text
- `noref`: drop items' link
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
- `debug`: to have some feedback from the script execution. Useful for debugging
- `force`: force refetch the rss feed and articles
- `silent`: don't output the final RSS (useless on its own, but can be nice when debugging)
- http server only
- `callback=NAME`: for JSONP calls
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
- `item_link`: xpath rule relative to `items` to point to the entry's link
- `item_title`: entry's title
- `item_content`: entry's description
- `item_time`: entry's date & time (accepts a wide range of time formats)
With docker-compose **(recommended)**
## Use cases
```yml
services:
app:
image: pictuga/morss
ports:
- '8000:8000'
```
Build from source
With cli
```shell
docker build --tag morss https://git.pictuga.com/pictuga/morss.git --no-cache --pull
```
With docker-compose
```yml
services:
app:
build: https://git.pictuga.com/pictuga/morss.git
image: morss
ports:
- '8000:8000'
```
Then execute
```shell
docker-compose build --no-cache --pull
```
### Cloud providers
One-click deployment:
[![Heroku](https://img.shields.io/static/v1?label=deploy%20to&message=heroku&logo=heroku&color=79589F)](https://heroku.com/deploy?template=https://github.com/pictuga/morss)
[![Google Cloud](https://img.shields.io/static/v1?label=deploy%20to&message=google&logo=google&color=4285F4)](https://deploy.cloud.run/?git_repo=https://github.com/pictuga/morss.git)
Providers supporting `cloud-init` (AWS, Oracle Cloud Infrastructure), based on Ubuntu:
``` yml
#cloud-config
packages:
- python3-pip
- python3-wheel
- python3-lxml
- python3-setproctitle
- ca-certificates
write_files:
- path: /etc/environment
append: true
content: |
DEBUG=1
CACHE=diskcache
CACHE_SIZE=1073741824 # 1GiB
- path: /var/lib/cloud/scripts/per-boot/morss.sh
permissions: 744
content: |
#!/bin/sh
/usr/local/bin/morss-helper daemon
runcmd:
- source /etc/environment
- update-ca-certificates
- iptables -I INPUT 6 -m state --state NEW -p tcp --dport ${PORT:-8000} -j ACCEPT
- netfilter-persistent save
- pip install morss[full]
```
## Run
morss will auto-detect what "mode" to use.
### Running on a server
### Running on/as a server
Set up the server as indicated below, then visit:
```
http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL
```
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
*(Brackets indicate optional text)*
The `main.py` part is only needed if your server doesn't support the Apache
redirect rule set in the provided `.htaccess`.
Works like a charm with [Tiny Tiny RSS](https://tt-rss.org/), and most probably
other clients.
#### Using Docker
From docker hub
```shell
docker run -p 8000:8000 pictuga/morss
```
From source
```shell
docker run -p 8000:8000 morss
```
With docker-compose **(recommended)**
```shell
docker-compose up
```
#### Using Gunicorn
```shell
gunicorn --preload morss
```
#### Using uWSGI
Running this command should do:
```shell
uwsgi --http :8000 --plugin python --wsgi-file main.py
```
#### Using morss' internal HTTP server
Morss can run its own, **very basic**, HTTP server, meant for debugging mostly.
The latter should start when you run morss without any argument, on port 8000.
I'd highly recommend you to use gunicorn or something similar for better
performance.
```shell
morss
```
You can change the port using environment variables like this `PORT=9000 morss`.
#### Via mod_cgi/FastCGI with Apache/nginx
For this, you'll want to change a bit the architecture of the files, for example
@@ -135,73 +276,49 @@ For this, you need to make sure your host allows python script execution. This
method uses HTTP calls to fetch the RSS feeds, which will be handled through
`mod_cgi` for example on Apache severs.
Please pay attention to `main.py` permissions for it to be executable. Also
ensure that the provided `/www/.htaccess` works well with your server.
Please pay attention to `main.py` permissions for it to be executable. See below
some tips for the `.htaccess` file.
#### Using uWSGI
```htaccess
Options -Indexes
Running this command should do:
ErrorDocument 404 /cgi/main.py
```shell
uwsgi --http :8080 --plugin python --wsgi-file main.py
# Turn debug on for all requests
SetEnv DEBUG 1
# Turn debug on for requests with :debug in the url
SetEnvIf Request_URI :debug DEBUG=1
<Files ~ "\.(py|pyc|db|log)$">
deny from all
</Files>
<Files main.py>
allow from all
AddHandler cgi-script .py
Options +ExecCGI
</Files>
```
#### Using Gunicorn
```shell
gunicorn morss:cgi_standalone_app
```
#### Using docker
Build & run
```shell
docker build https://git.pictuga.com/pictuga/morss.git -t morss
docker run -p 8080:8080 morss
```
In one line
```shell
docker run -p 8080:8080 $(docker build -q https://git.pictuga.com/pictuga/morss.git)
```
#### Using morss' internal HTTP server
Morss can run its own HTTP server. The later should start when you run morss
without any argument, on port 8080.
```shell
morss
```
You can change the port like this `morss 9000`.
#### Passing arguments
Then visit:
```
http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL
```
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
*(Brackets indicate optional text)*
The `main.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.
Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rss/wiki), and most probably other clients.
### As a CLI application
Run:
```
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL
```
For example: `morss debug http://feeds.bbci.co.uk/news/rss.xml`
For example: `morss --clip http://feeds.bbci.co.uk/news/rss.xml`
*(Brackets indicate optional text)*
If using Docker:
```shell
docker run morss --clip http://feeds.bbci.co.uk/news/rss.xml
```
### As a newsreader hook
To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required
@@ -209,10 +326,13 @@ To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required
scripts can be run on top of the RSS feed, using its
[output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
To use this script, you have to enable "(Unix) command" in liferea feed
settings, and use the command:
```
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL
```
For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
*(Brackets indicate optional text)*
@@ -220,6 +340,7 @@ For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
### As a python library
Quickly get a full-text feed:
```python
>>> import morss
>>> xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')
@@ -228,10 +349,11 @@ Quickly get a full-text feed:
```
Using cache and passing arguments:
```python
>>> import morss
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
>>> cache = '/tmp/morss-cache' # diskcache cache location
>>> options = {'csv':True}
>>> xml_string = morss.process(url, cache, options)
>>> xml_string[:50]
@@ -243,12 +365,12 @@ possible to call the simpler functions, to have more control on what's happening
under the hood.
Doing it step-by-step:
```python
import morss, morss.crawler
import morss
url = 'http://newspaper.example/feed.xml'
options = morss.Options(csv=True) # arguments
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
@@ -256,40 +378,152 @@ rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
output = morss.FeedFormat(rss, options, 'unicode') # formats final feed
```
## Cache information
## Arguments and settings
morss uses caching to make loading faster. There are 3 possible cache backends
(visible in `morss/crawler.py`):
### Arguments
- `{}`: a simple python in-memory dict() object
- `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
be cleared every time the program is run
- `MySQLCacheHandler`
morss accepts some arguments, to lightly alter the output of morss. Arguments
may need to have a value (usually a string or a number). How to pass those
arguments to morss is explained in Run above.
## Configuration
### Length limitation
The list of arguments can be obtained by running `morss --help`
```
usage: morss [-h] [--post STRING] [--xpath XPATH]
[--format {rss,json,html,csv}] [--search STRING] [--clip]
[--indent] [--cache] [--force] [--proxy]
[--order {first,last,newest,oldest}] [--firstlink] [--resolve]
[--items XPATH] [--item_link XPATH] [--item_title XPATH]
[--item_content XPATH] [--item_time XPATH]
[--mode {xml,html,json}] [--nolink] [--noref] [--silent]
url
Get full-text RSS feeds
positional arguments:
url feed url
options:
-h, --help show this help message and exit
--post STRING POST request
--xpath XPATH xpath rule to manually detect the article
output:
--format {rss,json,html,csv}
output format
--search STRING does a basic case-sensitive search in the feed
--clip stick the full article content under the original feed
content (useful for twitter)
--indent returns indented XML or JSON, takes more place, but
human-readable
action:
--cache only take articles from the cache (ie. don't grab new
articles' content), so as to save time
--force force refetch the rss feed and articles
--proxy doesn't fill the articles
--order {first,last,newest,oldest}
order in which to process items (which are however NOT
sorted in the output)
--firstlink pull the first article mentioned in the description
instead of the default link
--resolve replace tracking links with direct links to articles
(not compatible with --proxy)
custom feeds:
--items XPATH (mandatory to activate the custom feeds function)
xpath rule to match all the RSS entries
--item_link XPATH xpath rule relative to items to point to the entry's
link
--item_title XPATH entry's title
--item_content XPATH entry's content
--item_time XPATH entry's date & time (accepts a wide range of time
formats)
--mode {xml,html,json}
parser to use for the custom feeds
misc:
--nolink drop links, but keeps links' inner text
--noref drop items' link
--silent don't output the final RSS (useless on its own, but
can be nice when debugging)
GNU AGPLv3 code
```
Further HTTP-only options:
- `callback=NAME`: for JSONP calls
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other
servers)
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
### Environment variables
To pass environment variables:
- Docker-cli: `docker run -p 8000:8000 morss --env KEY=value`
- docker-compose: add an `environment:` section in the .yml file
- Gunicorn/uWSGI/CLI: prepend `KEY=value` before the command
- Apache: via the `SetEnv` instruction (see sample `.htaccess` provided)
- cloud-init: in the `/etc/environment` file
Generic:
- `DEBUG=1`: to have some feedback from the script execution. Useful for
debugging.
- `IGNORE_SSL=1`: to ignore SSL certs when fetch feeds and articles
- `DELAY` (seconds) sets the browser cache delay, only for HTTP clients
- `TIMEOUT` (seconds) sets the HTTP timeout when fetching rss feeds and articles
- `DATA_PATH`: to set custom file location for the `www` folder
When parsing long feeds, with a lot of items (100+), morss might take a lot of
time to parse it, or might even run into a memory overflow on some shared
hosting plans (limits around 10Mb), in which case you might want to adjust the
different values at the top of the script.
below settings via environment variables.
- `MAX_TIME` sets the maximum amount of time spent *fetching* articles, more time might be spent taking older articles from cache. `-1` for unlimited.
- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited. More articles will be taken from cache following the nexts settings.
- `LIM_TIME` sets the maximum amount of time spent working on the feed (whether or not it's already cached). Articles beyond that limit will be dropped from the feed. `-1` for unlimited.
- `LIM_ITEM` sets the maximum number of article checked, limiting both the number of articles fetched and taken from cache. Articles beyond that limit will be dropped from the feed, even if they're cached. `-1` for unlimited.
Also, if the request takes too long to process, the http request might be
discarded. See relevant config for
[gunicorn](https://docs.gunicorn.org/en/stable/settings.html#timeout) or
[nginx](http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout).
### Other settings
- `MAX_TIME` (seconds) sets the maximum amount of time spent *fetching*
articles, more time might be spent taking older articles from cache. `-1` for
unlimited.
- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited.
More articles will be taken from cache following the nexts settings.
- `LIM_TIME` (seconds) sets the maximum amount of time spent working on the feed
(whether or not it's already cached). Articles beyond that limit will be dropped
from the feed. `-1` for unlimited.
- `LIM_ITEM` sets the maximum number of article checked, limiting both the
number of articles fetched and taken from cache. Articles beyond that limit will
be dropped from the feed, even if they're cached. `-1` for unlimited.
- `DELAY` sets the browser cache delay, only for HTTP clients
- `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
morss uses caching to make loading faster. There are 3 possible cache backends:
- `(nothing/default)`: a simple python in-memory dict-like object.
- `CACHE=redis`: Redis cache. Connection can be defined with the following
environment variables: `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, `REDIS_PWD`
- `CACHE=diskcache`: disk-based cache. Target directory canbe defined with
`DISKCACHE_DIR`.
To limit the size of the cache:
- `CACHE_SIZE` sets the target number of items in the cache (further items will
be deleted but the cache might be temporarily bigger than that). Defaults to 1k
entries. NB. When using `diskcache`, this is the cache max size in Bytes.
- `CACHE_LIFESPAN` (seconds) sets how often the cache must be trimmed (i.e. cut
down to the number of items set in `CACHE_SIZE`). Defaults to 1min.
Gunicorn also accepts command line arguments via the `GUNICORN_CMD_ARGS`
environment variable.
### Content matching
The content of articles is grabbed with our own readability fork. This means
that most of the time the right content is matched. However sometimes it fails,
therefore some tweaking is required. Most of the time, what has to be done is to
add some "rules" in the main script file in *readability* (not in morss).
add some "rules" in the main script file in `readabilite.py` (not in morss).
Most of the time when hardly nothing is matched, it means that the main content
of the article is made of images, videos, pictures, etc., which readability
@@ -300,14 +534,3 @@ morss will also try to figure out whether the full content is already in place
(for those websites which understood the whole point of RSS feeds). However this
detection is very simple, and only works if the actual content is put in the
"content" section in the feed and not in the "summary" section.
***
## Todo
You can contribute to this project. If you're not sure what to do, you can pick
from this list:
- Add ability to run morss.py as an update daemon
- Add ability to use custom xpath rule instead of readability
- More ideas here <https://github.com/pictuga/morss/issues/15>

21
app.json Normal file
View File

@@ -0,0 +1,21 @@
{
"stack": "container",
"env": {
"DEBUG": {
"value": 1,
"required": false
},
"GUNICORN_CMD_ARGS": {
"value": "",
"required": false
},
"CACHE": {
"value": "diskcache",
"required": false
},
"CACHE_SIZE": {
"value": 1073741824,
"required": false
}
}
}

3
heroku.yml Normal file
View File

@@ -0,0 +1,3 @@
build:
docker:
web: Dockerfile

20
main.py Normal file → Executable file
View File

@@ -1,6 +1,24 @@
#!/usr/bin/env python
from morss import main, cgi_standalone_app as application
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
from morss.__main__ import main
from morss.wsgi import application
if __name__ == '__main__':
main()

47
morss-helper Executable file
View File

@@ -0,0 +1,47 @@
#! /bin/sh
set -ex
if ! command -v python && command -v python3 ; then
alias python='python3'
fi
run() {
gunicorn --bind 0.0.0.0:${PORT:-8000} --preload --access-logfile - morss
}
daemon() {
gunicorn --bind 0.0.0.0:${PORT:-8000} --preload --access-logfile - --daemon morss
}
reload() {
pid=$(pidof 'gunicorn: master [morss]' || true)
# NB. requires python-setproctitle
# `|| true` due to `set -e`
if [ -z "$pid" ]; then
# if gunicorn is not currently running
daemon
else
kill -s USR2 $pid
kill -s WINCH $pid
sleep 1 # give gunicorn some time to reload
kill -s TERM $pid
fi
}
check() {
python -m morss.crawler http://localhost:${PORT:-8000}/ > /dev/null 2>&1
}
if [ -z "$1" ]; then
run
elif [ "$1" = "sh" ] || [ "$1" = "bash" ] || command -v "$1" ; then
$@
else
python -m morss $@
fi

13
morss.service Normal file
View File

@@ -0,0 +1,13 @@
[Unit]
Description=morss server (gunicorn)
After=network.target
[Service]
ExecStart=/usr/local/bin/morss-helper run
ExecReload=/usr/local/bin/morss-helper reload
KillMode=process
Restart=always
User=http
[Install]
WantedBy=multi-user.target

View File

@@ -1,2 +1,25 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
# ran on `import morss`
# pylint: disable=unused-import,unused-variable
__version__ = ""
from .morss import *
from .wsgi import application

View File

@@ -1,5 +1,48 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
# ran on `python -m morss`
from .morss import main
import os
import sys
from . import cli, wsgi
from .morss import MorssException
def main():
if 'REQUEST_URI' in os.environ:
# mod_cgi (w/o file handler)
wsgi.cgi_handle_request()
elif len(sys.argv) <= 1:
# start internal (basic) http server (w/ file handler)
wsgi.cgi_start_server()
else:
# as a CLI app
try:
cli.cli_app()
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
print('ERROR: %s' % e.message)
if __name__ == '__main__':
main()

122
morss/caching.py Normal file
View File

@@ -0,0 +1,122 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import os
import threading
import time
from collections import OrderedDict
CACHE_SIZE = int(os.getenv('CACHE_SIZE', 1000)) # max number of items in cache (default: 1k items)
CACHE_LIFESPAN = int(os.getenv('CACHE_LIFESPAN', 60)) # how often to auto-clear the cache (default: 1min)
class BaseCache:
""" Subclasses must behave like a dict """
def trim(self):
pass
def autotrim(self, delay=CACHE_LIFESPAN):
# trim the cache every so often
self.trim()
t = threading.Timer(delay, self.autotrim)
t.daemon = True
t.start()
def __contains__(self, url):
try:
self[url]
except KeyError:
return False
else:
return True
class CappedDict(OrderedDict, BaseCache):
def trim(self):
if CACHE_SIZE >= 0:
for i in range( max( len(self) - CACHE_SIZE , 0 )):
self.popitem(False)
def __setitem__(self, key, data):
# https://docs.python.org/2/library/collections.html#ordereddict-examples-and-recipes
if key in self:
del self[key]
OrderedDict.__setitem__(self, key, data)
try:
import redis # isort:skip
except ImportError:
pass
class RedisCacheHandler(BaseCache):
def __init__(self, host='localhost', port=6379, db=0, password=None):
self.r = redis.Redis(host=host, port=port, db=db, password=password)
def __getitem__(self, key):
return self.r.get(key)
def __setitem__(self, key, data):
self.r.set(key, data)
try:
import diskcache # isort:skip
except ImportError:
pass
class DiskCacheHandler(BaseCache):
def __init__(self, directory=None, **kwargs):
self.cache = diskcache.Cache(directory=directory, eviction_policy='least-frequently-used', **kwargs)
def __del__(self):
self.cache.close()
def trim(self):
self.cache.cull()
def __getitem__(self, key):
return self.cache[key]
def __setitem__(self, key, data):
self.cache.set(key, data)
if 'CACHE' in os.environ:
if os.environ['CACHE'] == 'redis':
default_cache = RedisCacheHandler(
host = os.getenv('REDIS_HOST', 'localhost'),
port = int(os.getenv('REDIS_PORT', 6379)),
db = int(os.getenv('REDIS_DB', 0)),
password = os.getenv('REDIS_PWD', None)
)
elif os.environ['CACHE'] == 'diskcache':
default_cache = DiskCacheHandler(
directory = os.getenv('DISKCACHE_DIR', '/tmp/morss-diskcache'),
size_limit = CACHE_SIZE # in Bytes
)
else:
default_cache = CappedDict()

72
morss/cli.py Normal file
View File

@@ -0,0 +1,72 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import argparse
import os.path
import sys
from .morss import FeedFetch, FeedFormat, FeedGather, Options
def cli_app():
parser = argparse.ArgumentParser(
prog='morss',
description='Get full-text RSS feeds',
epilog='GNU AGPLv3 code'
)
parser.add_argument('url', help='feed url')
parser.add_argument('--post', action='store', type=str, metavar='STRING', help='POST request')
parser.add_argument('--xpath', action='store', type=str, metavar='XPATH', help='xpath rule to manually detect the article')
group = parser.add_argument_group('output')
group.add_argument('--format', default='rss', choices=('rss', 'json', 'html', 'csv'), help='output format')
group.add_argument('--search', action='store', type=str, metavar='STRING', help='does a basic case-sensitive search in the feed')
group.add_argument('--clip', action='store_true', help='stick the full article content under the original feed content (useful for twitter)')
group.add_argument('--indent', action='store_true', help='returns indented XML or JSON, takes more place, but human-readable')
group = parser.add_argument_group('action')
group.add_argument('--cache', action='store_true', help='only take articles from the cache (ie. don\'t grab new articles\' content), so as to save time')
group.add_argument('--force', action='store_true', help='force refetch the rss feed and articles')
group.add_argument('--proxy', action='store_true', help='doesn\'t fill the articles')
group.add_argument('--order', default='first', choices=('first', 'last', 'newest', 'oldest'), help='order in which to process items (which are however NOT sorted in the output)')
group.add_argument('--firstlink', action='store_true', help='pull the first article mentioned in the description instead of the default link')
group.add_argument('--resolve', action='store_true', help='replace tracking links with direct links to articles (not compatible with --proxy)')
group = parser.add_argument_group('custom feeds')
group.add_argument('--items', action='store', type=str, metavar='XPATH', help='(mandatory to activate the custom feeds function) xpath rule to match all the RSS entries')
group.add_argument('--item_link', action='store', type=str, metavar='XPATH', help='xpath rule relative to items to point to the entry\'s link')
group.add_argument('--item_title', action='store', type=str, metavar='XPATH', help='entry\'s title')
group.add_argument('--item_content', action='store', type=str, metavar='XPATH', help='entry\'s content')
group.add_argument('--item_time', action='store', type=str, metavar='XPATH', help='entry\'s date & time (accepts a wide range of time formats)')
group.add_argument('--mode', default=None, choices=('xml', 'html', 'json'), help='parser to use for the custom feeds')
group = parser.add_argument_group('misc')
group.add_argument('--nolink', action='store_true', help='drop links, but keeps links\' inner text')
group.add_argument('--noref', action='store_true', help='drop items\' link')
group.add_argument('--silent', action='store_true', help='don\'t output the final RSS (useless on its own, but can be nice when debugging)')
options = Options(vars(parser.parse_args()))
url = options.url
url, rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options, 'unicode')
if not options.silent:
print(out)

View File

@@ -1,26 +1,52 @@
import sys
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import zlib
from io import BytesIO, StringIO
import re
import chardet
from cgi import parse_header
import lxml.html
import time
import os
import pickle
import random
import re
import sys
import time
import zlib
from cgi import parse_header
from collections import OrderedDict
from io import BytesIO, StringIO
import chardet
from .caching import default_cache
try:
# python 2
from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
from urllib import quote
from urlparse import urlparse, urlunparse
import mimetools
from httplib import HTTPMessage
from urllib2 import (BaseHandler, HTTPCookieProcessor, HTTPRedirectHandler,
Request, addinfourl, build_opener, parse_http_list,
parse_keqv_list)
from urlparse import urlsplit
except ImportError:
# python 3
from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
from urllib.parse import quote
from urllib.parse import urlparse, urlunparse
import email
from email import message_from_string
from http.client import HTTPMessage
from urllib.parse import quote, urlsplit
from urllib.request import (BaseHandler, HTTPCookieProcessor,
HTTPRedirectHandler, Request, addinfourl,
build_opener, parse_http_list, parse_keqv_list)
try:
# python 2
@@ -33,7 +59,9 @@ except NameError:
MIMETYPE = {
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
'html': ['text/html', 'application/xhtml+xml', 'application/xml'],
'json': ['application/json'],
}
DEFAULT_UAS = [
@@ -58,14 +86,17 @@ def get(*args, **kwargs):
return adv_get(*args, **kwargs)['data']
def adv_get(url, timeout=None, *args, **kwargs):
def adv_get(url, post=None, timeout=None, *args, **kwargs):
url = sanitize_url(url)
if post is not None:
post = post.encode('utf-8')
if timeout is None:
con = custom_handler(*args, **kwargs).open(url)
con = custom_opener(*args, **kwargs).open(url, data=post)
else:
con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
con = custom_opener(*args, **kwargs).open(url, data=post, timeout=timeout)
data = con.read()
@@ -73,7 +104,7 @@ def adv_get(url, timeout=None, *args, **kwargs):
encoding= detect_encoding(data, con)
return {
'data':data,
'data': data,
'url': con.geturl(),
'con': con,
'contenttype': contenttype,
@@ -81,9 +112,7 @@ def adv_get(url, timeout=None, *args, **kwargs):
}
def custom_handler(follow=None, delay=None, encoding=None):
handlers = []
def custom_opener(follow=None, policy=None, force_min=None, force_max=None):
# as per urllib2 source code, these Handelers are added first
# *unless* one of the custom handlers inherits from one of them
#
@@ -91,21 +120,33 @@ def custom_handler(follow=None, delay=None, encoding=None):
# HTTPDefaultErrorHandler, HTTPRedirectHandler,
# FTPHandler, FileHandler, HTTPErrorProcessor]
# & HTTPSHandler
#
# when processing a request:
# (1) all the *_request are run
# (2) the *_open are run until sth is returned (other than None)
# (3) all the *_response are run
#
# During (3), if an http error occurs (i.e. not a 2XX response code), the
# http_error_* are run until sth is returned (other than None). If they all
# return nothing, a python error is raised
#handlers.append(DebugHandler())
handlers.append(SizeLimitHandler(500*1024)) # 500KiB
handlers.append(HTTPCookieProcessor())
handlers.append(GZIPHandler())
handlers.append(HTTPEquivHandler())
handlers.append(HTTPRefreshHandler())
handlers.append(UAHandler(random.choice(DEFAULT_UAS)))
handlers.append(BrowserlyHeaderHandler())
handlers.append(EncodingFixHandler(encoding))
handlers = [
#DebugHandler(),
SizeLimitHandler(500*1024), # 500KiB
HTTPCookieProcessor(),
GZIPHandler(),
HTTPAllRedirectHandler(),
HTTPEquivHandler(),
HTTPRefreshHandler(),
UAHandler(random.choice(DEFAULT_UAS)),
BrowserlyHeaderHandler(),
EncodingFixHandler(),
]
if follow:
handlers.append(AlternateHandler(MIMETYPE[follow]))
handlers.append(CacheHandler(force_min=delay))
handlers.append(CacheHandler(policy=policy, force_min=force_min, force_max=force_max))
return build_opener(*handlers)
@@ -122,10 +163,20 @@ def is_ascii(string):
return True
def soft_quote(string):
" url-quote only when not a valid ascii string "
if is_ascii(string):
return string
else:
return quote(string.encode('utf-8'))
def sanitize_url(url):
# make sure the url is unicode, i.e. not bytes
if isinstance(url, bytes):
url = url.decode()
url = url.decode('utf-8')
# make sure there's a protocol (http://)
if url.split(':', 1)[0] not in PROTOCOL:
@@ -138,18 +189,64 @@ def sanitize_url(url):
url = url.replace(' ', '%20')
# escape non-ascii unicode characters
# https://stackoverflow.com/a/4391299
parts = list(urlparse(url))
parts = urlsplit(url)
for i in range(len(parts)):
if not is_ascii(parts[i]):
if i == 1:
parts[i] = parts[i].encode('idna').decode('ascii')
parts = parts._replace(
netloc=parts.netloc.replace(
parts.hostname,
parts.hostname.encode('idna').decode('ascii')
),
path=soft_quote(parts.path),
query=soft_quote(parts.query),
fragment=soft_quote(parts.fragment),
)
else:
parts[i] = quote(parts[i].encode('utf-8'))
return parts.geturl()
return urlunparse(parts)
class RespDataHandler(BaseHandler):
" Make it easier to use the reponse body "
def data_reponse(self, req, resp, data):
pass
def http_response(self, req, resp):
# read data
data = resp.read()
# process data and use returned content (if any)
data = self.data_response(req, resp, data) or data
# reformat the stuff
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
https_response = http_response
class RespStrHandler(RespDataHandler):
" Make it easier to use the _decoded_ reponse body "
def str_reponse(self, req, resp, data_str):
pass
def data_response(self, req, resp, data):
#decode
enc = detect_encoding(data, resp)
data_str = data.decode(enc, 'replace')
#process
data_str = self.str_response(req, resp, data_str)
# return
data = data_str.encode(enc) if data_str is not None else data
#return
return data
class DebugHandler(BaseHandler):
@@ -172,7 +269,7 @@ class SizeLimitHandler(BaseHandler):
handler_order = 450
def __init__(self, limit=5*1024^2):
def __init__(self, limit=5*1024**2):
self.limit = limit
def http_response(self, req, resp):
@@ -193,35 +290,23 @@ def UnGzip(data):
return zlib.decompressobj(zlib.MAX_WBITS | 32).decompress(data)
class GZIPHandler(BaseHandler):
class GZIPHandler(RespDataHandler):
def http_request(self, req):
req.add_unredirected_header('Accept-Encoding', 'gzip')
return req
def http_response(self, req, resp):
def data_response(self, req, resp, data):
if 200 <= resp.code < 300:
if resp.headers.get('Content-Encoding') == 'gzip':
data = resp.read()
data = UnGzip(data)
resp.headers['Content-Encoding'] = 'identity'
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
https_response = http_response
https_request = http_request
return UnGzip(data)
def detect_encoding(data, resp=None):
enc = detect_raw_encoding(data, resp)
if enc == 'gb2312':
if enc.lower() == 'gb2312':
enc = 'gbk'
return enc
@@ -252,32 +337,9 @@ def detect_raw_encoding(data, resp=None):
return 'utf-8'
class EncodingFixHandler(BaseHandler):
def __init__(self, encoding=None):
self.encoding = encoding
def http_response(self, req, resp):
maintype = resp.info().get('Content-Type', '').split('/')[0]
if 200 <= resp.code < 300 and maintype == 'text':
data = resp.read()
if not self.encoding:
enc = detect_encoding(data, resp)
else:
enc = self.encoding
if enc:
data = data.decode(enc, 'replace')
data = data.encode(enc)
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
https_response = http_response
class EncodingFixHandler(RespStrHandler):
def str_response(self, req, resp, data_str):
return data_str
class UAHandler(BaseHandler):
@@ -303,60 +365,58 @@ class BrowserlyHeaderHandler(BaseHandler):
https_request = http_request
class AlternateHandler(BaseHandler):
def iter_html_tag(html_str, tag_name):
" To avoid parsing whole pages when looking for a simple tag "
re_tag = r'<%s\s+[^>]+>' % tag_name
re_attr = r'(?P<key>[^=\s]+)=[\'"](?P<value>[^\'"]+)[\'"]'
for tag_match in re.finditer(re_tag, html_str):
attr_match = re.findall(re_attr, tag_match.group(0))
if attr_match is not None:
yield dict(attr_match)
class AlternateHandler(RespStrHandler):
" Follow <link rel='alternate' type='application/rss+xml' href='...' /> "
def __init__(self, follow=None):
self.follow = follow or []
def http_response(self, req, resp):
def str_response(self, req, resp, data_str):
contenttype = resp.info().get('Content-Type', '').split(';')[0]
if 200 <= resp.code < 300 and len(self.follow) and contenttype in MIMETYPE['html'] and contenttype not in self.follow:
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
data = resp.read()
links = lxml.html.fromstring(data[:10000]).findall('.//link[@rel="alternate"]')
for link in links:
if link.get('type', '') in self.follow:
for link in iter_html_tag(data_str[:10000], 'link'):
if (link.get('rel') == 'alternate'
and link.get('type') in self.follow
and 'href' in link):
resp.code = 302
resp.msg = 'Moved Temporarily'
resp.headers['location'] = link.get('href')
break
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
https_response = http_response
class HTTPEquivHandler(BaseHandler):
class HTTPEquivHandler(RespStrHandler):
" Handler to support <meta http-equiv='...' content='...' />, since it defines HTTP headers "
handler_order = 600
def http_response(self, req, resp):
def str_response(self, req, resp, data_str):
contenttype = resp.info().get('Content-Type', '').split(';')[0]
if 200 <= resp.code < 300 and contenttype in MIMETYPE['html']:
data = resp.read()
headers = lxml.html.fromstring(data[:10000]).findall('.//meta[@http-equiv]')
for meta in iter_html_tag(data_str[:10000], 'meta'):
if 'http-equiv' in meta and 'content' in meta:
resp.headers[meta.get('http-equiv').lower()] = meta.get('content')
for header in headers:
resp.headers[header.get('http-equiv').lower()] = header.get('content')
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
https_response = http_response
class HTTPAllRedirectHandler(HTTPRedirectHandler):
def http_error_308(self, req, fp, code, msg, headers):
return self.http_error_301(req, fp, 301, msg, headers)
class HTTPRefreshHandler(BaseHandler):
@@ -365,7 +425,7 @@ class HTTPRefreshHandler(BaseHandler):
def http_response(self, req, resp):
if 200 <= resp.code < 300:
if resp.headers.get('refresh'):
regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url=(["\']?)(?P<url>.+)\2$'
regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url\s*=\s*(["\']?)(?P<url>.+)\2$'
match = re.search(regex, resp.headers.get('refresh'))
if match:
@@ -381,59 +441,124 @@ class HTTPRefreshHandler(BaseHandler):
https_response = http_response
default_cache = {}
def parse_headers(text=u'\n\n'):
if sys.version_info[0] >= 3:
# python 3
return message_from_string(text, _class=HTTPMessage)
else:
# python 2
return HTTPMessage(StringIO(text))
def error_response(code, msg, url=''):
# return an error as a response
resp = addinfourl(BytesIO(), parse_headers(), url, code)
resp.msg = msg
return resp
class CacheHandler(BaseHandler):
" Cache based on etags/last-modified "
private_cache = False # Websites can indicate whether the page should be
# cached by CDNs (e.g. shouldn't be the case for
# private/confidential/user-specific pages.
# With this setting, decide whether (False) you want
# the cache to behave like a CDN (i.e. don't cache
# private pages), or (True) to behave like a end-cache
# private pages. If unsure, False is the safest bet.
privacy = 'private' # Websites can indicate whether the page should be cached
# by CDNs (e.g. shouldn't be the case for
# private/confidential/user-specific pages. With this
# setting, decide whether you want the cache to behave
# like a CDN (i.e. don't cache private pages, 'public'),
# or to behave like a end-user private pages
# ('private'). If unsure, 'public' is the safest bet,
# but many websites abuse this feature...
# NB. This overrides all the other min/max/policy settings.
handler_order = 499
def __init__(self, cache=None, force_min=None):
def __init__(self, cache=None, force_min=None, force_max=None, policy=None):
self.cache = cache or default_cache
self.force_min = force_min
# Servers indicate how long they think their content is "valid".
# With this parameter (force_min, expressed in seconds), we can
# override the validity period (i.e. bypassing http headers)
# Special values:
# -1: valid forever, i.e. use the cache no matter what (and fetch
# the page online if not present in cache)
# 0: valid zero second, i.e. force refresh
# -2: same as -1, i.e. use the cache no matter what, but do NOT
# fetch the page online if not present in cache, throw an
self.force_max = force_max
self.policy = policy # can be cached/refresh/offline/None (default)
# Servers indicate how long they think their content is "valid". With
# this parameter (force_min/max, expressed in seconds), we can override
# the validity period (i.e. bypassing http headers)
# Special choices, via "policy":
# cached: use the cache no matter what (and fetch the page online if
# not present in cache)
# refresh: valid zero second, i.e. force refresh
# offline: same as cached, i.e. use the cache no matter what, but do
# NOT fetch the page online if not present in cache, throw an
# error instead
# None: just follow protocols
# sanity checks
assert self.force_max is None or self.force_max >= 0
assert self.force_min is None or self.force_min >= 0
assert self.force_max is None or self.force_min is None or self.force_max >= self.force_min
def load(self, url):
try:
out = list(self.cache[url])
data = pickle.loads(self.cache[url])
except KeyError:
out = [None, None, unicode(), bytes(), 0]
data = None
if sys.version_info[0] >= 3:
out[2] = email.message_from_string(out[2] or unicode()) # headers
else:
out[2] = mimetools.Message(StringIO(out[2] or unicode()))
data['headers'] = parse_headers(data['headers'] or unicode())
return out
return data
def save(self, url, code, msg, headers, data, timestamp):
self.cache[url] = (code, msg, unicode(headers), data, timestamp)
def save(self, key, data):
data['headers'] = unicode(data['headers'])
self.cache[key] = pickle.dumps(data, 0)
def cached_response(self, req, fallback=None):
req.from_morss_cache = True
data = self.load(req.get_full_url())
if data is not None:
# return the cache as a response
resp = addinfourl(BytesIO(data['data']), data['headers'], req.get_full_url(), data['code'])
resp.msg = data['msg']
return resp
else:
return fallback
def save_response(self, req, resp):
if req.from_morss_cache:
# do not re-save (would reset the timing)
return resp
data = resp.read()
self.save(req.get_full_url(), {
'code': resp.code,
'msg': resp.msg,
'headers': resp.headers,
'data': data,
'timestamp': time.time()
})
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
def http_request(self, req):
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
req.from_morss_cache = False # to track whether it comes from cache
if 'etag' in headers:
req.add_unredirected_header('If-None-Match', headers['etag'])
data = self.load(req.get_full_url())
if 'last-modified' in headers:
req.add_unredirected_header('If-Modified-Since', headers.get('last-modified'))
if data is not None:
if 'etag' in data['headers']:
req.add_unredirected_header('If-None-Match', data['headers']['etag'])
if 'last-modified' in data['headers']:
req.add_unredirected_header('If-Modified-Since', data['headers']['last-modified'])
return req
@@ -442,218 +567,121 @@ class CacheHandler(BaseHandler):
# If 'None' is returned, try your chance with the next-available handler
# If a 'resp' is returned, stop there, and proceed with 'http_response'
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
# Here, we try to see whether we want to use data from cache (i.e.
# return 'resp'), or whether we want to refresh the content (return
# 'None')
data = self.load(req.get_full_url())
if data is not None:
# some info needed to process everything
cache_control = parse_http_list(headers.get('cache-control', ()))
cache_control += parse_http_list(headers.get('pragma', ()))
cache_control = parse_http_list(data['headers'].get('cache-control', ()))
cache_control += parse_http_list(data['headers'].get('pragma', ()))
cc_list = [x for x in cache_control if '=' not in x]
cc_values = parse_keqv_list([x for x in cache_control if '=' in x])
cache_age = time.time() - timestamp
cache_age = time.time() - data['timestamp']
# list in a simple way what to do when
if req.get_header('Morss') == 'from_304': # for whatever reason, we need an uppercase
# we're just in the middle of a dirty trick, use cache
pass
# list in a simple way what to do in special cases
elif self.force_min == -2:
if code is not None:
# already in cache, perfect, use cache
pass
else:
# raise an error, via urllib handlers
headers['Morss'] = 'from_cache'
resp = addinfourl(BytesIO(), headers, req.get_full_url(), 409)
resp.msg = 'Conflict'
return resp
elif code is None:
# cache empty, refresh
if data is not None and 'private' in cc_list and self.privacy == 'public':
# private data but public cache, do not use cache
# privacy concern, so handled first and foremost
# (and doesn't need to be addressed anymore afterwards)
return None
elif self.force_min == -1:
# force use cache
pass
elif self.policy == 'offline':
# use cache, or return an error
return self.cached_response(
req,
error_response(409, 'Conflict', req.get_full_url())
)
elif self.force_min == 0:
elif self.policy == 'cached':
# use cache, or fetch online
return self.cached_response(req, None)
elif self.policy == 'refresh':
# force refresh
return None
elif code == 301 and cache_age < 7*24*3600:
elif data is None:
# we have already settled all the cases that don't need the cache.
# all the following ones need the cached item
return None
elif self.force_max is not None and cache_age > self.force_max:
# older than we want, refresh
return None
elif self.force_min is not None and cache_age < self.force_min:
# recent enough, use cache
return self.cached_response(req)
elif data['code'] == 301 and cache_age < 7*24*3600:
# "301 Moved Permanently" has to be cached...as long as we want
# (awesome HTTP specs), let's say a week (why not?). Use force_min=0
# if you want to bypass this (needed for a proper refresh)
pass
return self.cached_response(req)
elif self.force_min is None and ('no-cache' in cc_list
or 'no-store' in cc_list
or ('private' in cc_list and not self.private_cache)):
# kindly follow web servers indications, refresh
# if the same settings are used all along, this section shouldn't be
# of any use, since the page woudln't be cached in the first place
# the check is only performed "just in case"
elif self.force_min is None and ('no-cache' in cc_list or 'no-store' in cc_list):
# kindly follow web servers indications, refresh if the same
# settings are used all along, this section shouldn't be of any use,
# since the page woudln't be cached in the first place the check is
# only performed "just in case"
# NB. NOT respected if force_min is set
return None
elif 'max-age' in cc_values and int(cc_values['max-age']) > cache_age:
# server says it's still fine (and we trust him, if not, use force_min=0), use cache
pass
elif self.force_min is not None and self.force_min > cache_age:
# still recent enough for us, use cache
pass
# server says it's still fine (and we trust him, if not, use overrides), use cache
return self.cached_response(req)
else:
# according to the www, we have to refresh when nothing is said
return None
# return the cache as a response. This code is reached with 'pass' above
headers['morss'] = 'from_cache' # TODO delete the morss header from incoming pages, to avoid websites messing up with us
resp = addinfourl(BytesIO(data), headers, req.get_full_url(), code)
resp.msg = msg
return resp
def http_response(self, req, resp):
# code for after-fetch, to know whether to save to hard-drive (if stiking to http headers' will)
# code for after-fetch, to know whether to save to hard-drive (if sticking to http headers' will)
if resp.code == 304:
return resp
if resp.code == 304 and resp.url in self.cache:
# we are hopefully the first after the HTTP handler, so no need
# to re-run all the *_response
# here: cached page, returning from cache
return self.cached_response(req)
if ('cache-control' in resp.headers or 'pragma' in resp.headers) and self.force_min is None:
elif self.force_min is None and ('cache-control' in resp.headers or 'pragma' in resp.headers):
cache_control = parse_http_list(resp.headers.get('cache-control', ()))
cache_control += parse_http_list(resp.headers.get('pragma', ()))
cc_list = [x for x in cache_control if '=' not in x]
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and not self.private_cache):
# kindly follow web servers indications
if 'no-cache' in cc_list or 'no-store' in cc_list or ('private' in cc_list and self.privacy == 'public'):
# kindly follow web servers indications (do not save & return)
return resp
if resp.headers.get('Morss') == 'from_cache':
# it comes from cache, so no need to save it again
return resp
else:
# save
return self.save_response(req, resp)
# save to disk
data = resp.read()
self.save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
# the below is only needed because of 'resp.read()' above, as we can't
# seek(0) on arbitraty file-like objects (e.g. sockets)
fp = BytesIO(data)
old_resp = resp
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
resp.msg = old_resp.msg
return resp
def http_error_304(self, req, fp, code, msg, headers):
cache = list(self.load(req.get_full_url()))
if cache[0]:
cache[-1] = time.time()
self.save(req.get_full_url(), *cache)
new = Request(req.get_full_url(),
headers=req.headers,
unverifiable=True)
new.add_unredirected_header('Morss', 'from_304')
# create a "fake" new request to just re-run through the various
# handlers
return self.parent.open(new, timeout=req.timeout)
return None # when returning 'None', the next-available handler is used
# the 'HTTPRedirectHandler' has no 'handler_order', i.e.
# uses the default of 500, therefore executed after this
else:
return self.save_response(req, resp)
https_request = http_request
https_open = http_open
https_response = http_response
class BaseCache:
""" Subclasses must behave like a dict """
def __contains__(self, url):
try:
self[url]
except KeyError:
return False
else:
return True
import sqlite3
class SQLiteCache(BaseCache):
def __init__(self, filename=':memory:'):
self.con = sqlite3.connect(filename, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
with self.con:
self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
self.con.execute('pragma journal_mode=WAL')
def __del__(self):
self.con.close()
def __getitem__(self, url):
row = self.con.execute('SELECT * FROM data WHERE url=?', (url,)).fetchone()
if not row:
raise KeyError
return row[1:]
def __setitem__(self, url, value): # value = (code, msg, headers, data, timestamp)
value = list(value)
value[3] = sqlite3.Binary(value[3]) # data
value = tuple(value)
with self.con:
self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?) ON CONFLICT(url) DO UPDATE SET code=?, msg=?, headers=?, data=?, timestamp=?', (url,) + value + value)
import pymysql.cursors
class MySQLCacheHandler(BaseCache):
def __init__(self, user, password, database, host='localhost'):
self.user = user
self.password = password
self.database = database
self.host = host
with self.cursor() as cursor:
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
def cursor(self):
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
def __getitem__(self, url):
cursor = self.cursor()
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
row = cursor.fetchone()
if not row:
raise KeyError
return row[1:]
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
with self.cursor() as cursor:
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE code=%s, msg=%s, headers=%s, data=%s, timestamp=%s',
(url,) + value + value)
if 'IGNORE_SSL' in os.environ:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
if __name__ == '__main__':
req = adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
if not sys.flags.interactive:
if sys.flags.interactive:
print('>>> Interactive shell: try using `req`')
else:
print(req['data'].decode(req['encoding']))

View File

@@ -73,7 +73,7 @@ item_updated = atom03:updated
mode = json
mimetype = application/json
timeformat = %Y-%m-%dT%H:%M:%SZ
timeformat = %Y-%m-%dT%H:%M:%S%z
base = {}
title = title
@@ -90,9 +90,6 @@ item_updated = updated
[html]
mode = html
path =
http://localhost/
title = //div[@id='header']/h1
desc = //div[@id='header']/p
items = //div[@id='content']/div

View File

@@ -1,32 +1,45 @@
import sys
import os.path
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
from datetime import datetime
import re
import json
import csv
import json
import re
from copy import deepcopy
from datetime import datetime
from fnmatch import fnmatch
from lxml import etree
from dateutil import tz
import dateutil.parser
from copy import deepcopy
import lxml.html
from dateutil import tz
from lxml import etree
from .readabilite import parse as html_parse
from .util import *
json.encoder.c_make_encoder = None
try:
# python 2
from StringIO import StringIO
from ConfigParser import RawConfigParser
from StringIO import StringIO
except ImportError:
# python 3
from io import StringIO
from configparser import RawConfigParser
from io import StringIO
try:
# python 2
@@ -38,7 +51,7 @@ except NameError:
def parse_rules(filename=None):
if not filename:
filename = os.path.join(os.path.dirname(__file__), 'feedify.ini')
filename = pkg_path('feedify.ini')
config = RawConfigParser()
config.read(filename)
@@ -52,39 +65,36 @@ def parse_rules(filename=None):
# for each rule
if rules[section][arg].startswith('file:'):
paths = [os.path.join(sys.prefix, 'share/morss/www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../..', rules[section][arg][5:])]
for path in paths:
try:
path = data_path('www', rules[section][arg][5:])
file_raw = open(path).read()
file_clean = re.sub('<[/?]?(xsl|xml)[^>]+?>', '', file_raw)
rules[section][arg] = file_clean
except IOError:
pass
elif '\n' in rules[section][arg]:
rules[section][arg] = rules[section][arg].split('\n')[1:]
return rules
def parse(data, url=None, encoding=None):
def parse(data, url=None, encoding=None, ruleset=None):
" Determine which ruleset to use "
rulesets = parse_rules()
if ruleset is not None:
rulesets = [ruleset]
else:
rulesets = parse_rules().values()
parsers = [FeedXML, FeedHTML, FeedJSON]
# 1) Look for a ruleset based on path
if url is not None:
for ruleset in rulesets.values():
for ruleset in rulesets:
if 'path' in ruleset:
for path in ruleset['path']:
if fnmatch(url, path):
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
parser = [x for x in parsers if x.mode == ruleset.get('mode')][0] # FIXME what if no mode specified?
return parser(data, ruleset, encoding=encoding)
# 2) Try each and every parser
@@ -94,9 +104,6 @@ def parse(data, url=None, encoding=None):
# 3b) See if .items matches anything
for parser in parsers:
ruleset_candidates = [x for x in rulesets.values() if x['mode'] == parser.mode and 'path' not in x]
# 'path' as they should have been caught beforehands
try:
feed = parser(data, encoding=encoding)
@@ -107,13 +114,17 @@ def parse(data, url=None, encoding=None):
else:
# parsing worked, now we try the rulesets
ruleset_candidates = [x for x in rulesets if x.get('mode') in (parser.mode, None) and 'path' not in x]
# 'path' as they should have been caught beforehands
# try anyway if no 'mode' specified
for ruleset in ruleset_candidates:
feed.rules = ruleset
try:
feed.items[0]
except (AttributeError, IndexError):
except (AttributeError, IndexError, TypeError):
# parsing and or item picking did not work out
pass
@@ -176,11 +187,12 @@ class ParserBase(object):
return self.convert(FeedHTML).tostring(**k)
def convert(self, TargetParser):
if type(self) == TargetParser:
return self
target = TargetParser()
if type(self) == TargetParser and self.rules == target.rules:
# check both type *AND* rules (e.g. when going from freeform xml to rss)
return self
for attr in target.dic:
if attr == 'items':
for item in self.items:
@@ -349,7 +361,13 @@ class ParserXML(ParserBase):
def rule_search_all(self, rule):
try:
return self.root.xpath(rule, namespaces=self.NSMAP)
match = self.root.xpath(rule, namespaces=self.NSMAP)
if isinstance(match, str):
# some xpath rules return a single string instead of an array (e.g. concatenate() )
return [match,]
else:
return match
except etree.XPathEvalError:
return []
@@ -412,7 +430,7 @@ class ParserXML(ParserBase):
match = self.rule_search(rrule)
html_rich = ('atom' in rule or self.rules['mode'] == 'html') \
html_rich = ('atom' in rule or self.rules.get('mode') == 'html') \
and rule in [self.rules.get('item_desc'), self.rules.get('item_content')]
if key is not None:
@@ -423,7 +441,7 @@ class ParserXML(ParserBase):
self._clean_node(match)
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
if self.rules['mode'] == 'html':
if self.rules.get('mode') == 'html':
match.find('div').drop_tag() # not supported by lxml.etree
else: # i.e. if atom
@@ -439,7 +457,7 @@ class ParserXML(ParserBase):
def rule_str(self, rule):
match = self.rule_search(rule)
html_rich = ('atom' in rule or self.rules['mode'] == 'html') \
html_rich = ('atom' in rule or self.mode == 'html') \
and rule in [self.rules.get('item_desc'), self.rules.get('item_content')]
if isinstance(match, etree._Element):
@@ -472,7 +490,14 @@ class ParserHTML(ParserXML):
repl = r'[@class and contains(concat(" ", normalize-space(@class), " "), " \1 ")]'
rule = re.sub(pattern, repl, rule)
return self.root.xpath(rule)
match = self.root.xpath(rule)
if isinstance(match, str):
# for some xpath rules, see XML parser
return [match,]
else:
return match
except etree.XPathEvalError:
return []
@@ -491,24 +516,31 @@ class ParserHTML(ParserXML):
def parse_time(value):
# parsing per se
if value is None or value == 0:
return None
time = None
elif isinstance(value, basestring):
if re.match(r'^[0-9]+$', value):
return datetime.fromtimestamp(int(value), tz.tzutc())
time = datetime.fromtimestamp(int(value))
else:
return dateutil.parser.parse(value).replace(tzinfo=tz.tzutc())
time = dateutil.parser.parse(value)
elif isinstance(value, int):
return datetime.fromtimestamp(value, tz.tzutc())
time = datetime.fromtimestamp(value)
elif isinstance(value, datetime):
return value
time = value
else:
return None
time = None
# add default time zone if none set
if time is not None and time.tzinfo is None:
time = time.replace(tzinfo=tz.tzutc())
return time
class ParserJSON(ParserBase):
@@ -609,34 +641,41 @@ class ParserJSON(ParserBase):
return out.replace('\n', '<br/>') if out else out
class Uniq(object):
_map = {}
_id = None
def wrap_uniq(wrapper_fn_name):
" Wraps the output of the function with the specified function "
# This is called when parsing "wrap_uniq('wrap_item')"
def __new__(cls, *args, **kwargs):
# check if a wrapper was already created for it
# if so, reuse it
# if not, create a new one
# note that the item itself (the tree node) is created beforehands
def decorator(func):
# This is called when parsing "@wrap_uniq('wrap_item')"
tmp_id = cls._gen_id(*args, **kwargs)
if tmp_id in cls._map:
return cls._map[tmp_id]
def wrapped_func(self, *args, **kwargs):
# This is called when the wrapped function is called
output = func(self, *args, **kwargs)
output_id = id(output)
try:
return self._map[output_id]
except (KeyError, AttributeError):
if not hasattr(self, '_map'):
self._map = {}
wrapper_fn = getattr(self, wrapper_fn_name)
obj = wrapper_fn(output)
self._map[output_id] = obj
else:
obj = object.__new__(cls) #, *args, **kwargs)
cls._map[tmp_id] = obj
return obj
return wrapped_func
return decorator
class Feed(object):
itemsClass = 'Item'
itemsClass = property(lambda x: Item) # because Item is define below, i.e. afterwards
dic = ('title', 'desc', 'items')
def wrap_items(self, items):
itemsClass = globals()[self.itemsClass]
return [itemsClass(x, self.rules, self) for x in items]
title = property(
lambda f: f.get('title'),
lambda f,x: f.set('title', x),
@@ -652,10 +691,7 @@ class Feed(object):
self.rule_create(self.rules['items'])
item = self.items[-1]
if new is None:
return
for attr in globals()[self.itemsClass].dic:
for attr in self.itemsClass.dic:
try:
setattr(item, attr, getattr(new, attr))
@@ -663,11 +699,17 @@ class Feed(object):
try:
setattr(item, attr, new[attr])
except (IndexError, TypeError):
except (KeyError, IndexError, TypeError):
pass
return item
def wrap_item(self, item):
return self.itemsClass(item, self.rules, self)
@wrap_uniq('wrap_item')
def __getitem__(self, key):
return self.wrap_items(self.get_raw('items'))[key]
return self.get_raw('items')[key]
def __delitem__(self, key):
self[key].remove()
@@ -676,7 +718,7 @@ class Feed(object):
return len(self.get_raw('items'))
class Item(Uniq):
class Item(object):
dic = ('title', 'link', 'desc', 'content', 'time', 'updated')
def __init__(self, xml=None, rules=None, parent=None):
@@ -715,32 +757,45 @@ class Item(Uniq):
lambda f: f.rmv('item_updated') )
class FeedXML(Feed, ParserXML):
itemsClass = 'ItemXML'
def tostring(self, encoding='unicode', **k):
# override needed due to "getroottree" inclusion
if self.root.getprevious() is None:
self.root.addprevious(etree.PI('xml-stylesheet', 'type="text/xsl" href="/sheet.xsl"'))
return etree.tostring(self.root.getroottree(), encoding=encoding, method='xml', **k)
class ItemXML(Item, ParserXML):
pass
class FeedHTML(Feed, ParserHTML):
itemsClass = 'ItemHTML'
class FeedXML(Feed, ParserXML):
itemsClass = ItemXML
def root_siblings(self):
out = []
current = self.root.getprevious()
while current is not None:
out.append(current)
current = current.getprevious()
return out
def tostring(self, encoding='unicode', **k):
# override needed due to "getroottree" inclusion
# and to add stylesheet
stylesheets = [x for x in self.root_siblings() if isinstance(x, etree.PIBase) and x.target == 'xml-stylesheet']
for stylesheet in stylesheets:
# remove all stylesheets present (be that ours or others')
self.root.append(stylesheet) # needed as we can't delete root siblings https://stackoverflow.com/a/60232366
self.root.remove(stylesheet)
self.root.addprevious(etree.PI('xml-stylesheet', 'type="text/xsl" href="/sheet.xsl"'))
return etree.tostring(self.root.getroottree(), encoding=encoding, method='xml', **k)
class ItemHTML(Item, ParserHTML):
pass
class FeedJSON(Feed, ParserJSON):
itemsClass = 'ItemJSON'
class FeedHTML(Feed, ParserHTML):
itemsClass = ItemHTML
class ItemJSON(Item, ParserJSON):
@@ -755,13 +810,21 @@ class ItemJSON(Item, ParserJSON):
cur = cur[node]
class FeedJSON(Feed, ParserJSON):
itemsClass = ItemJSON
if __name__ == '__main__':
import sys
from . import crawler
req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://www.nytimes.com/', follow='rss')
feed = parse(req['data'], url=req['url'], encoding=req['encoding'])
if not sys.flags.interactive:
if sys.flags.interactive:
print('>>> Interactive shell: try using `feed`')
else:
for item in feed.items:
print(item.title, item.link)

View File

@@ -1,72 +1,66 @@
import sys
import os
import os.path
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import os
import re
import sys
import time
from datetime import datetime
from dateutil import tz
from fnmatch import fnmatch
import re
import lxml.etree
import lxml.html
from dateutil import tz
from . import feeds
from . import crawler
from . import readabilite
import wsgiref.simple_server
import wsgiref.handlers
import cgitb
from . import caching, crawler, feeds, readabilite
try:
# python 2
from httplib import HTTPException
from urllib import unquote
from urlparse import urlparse, urljoin, parse_qs
from urlparse import parse_qs, urljoin, urlparse
except ImportError:
# python 3
from http.client import HTTPException
from urllib.parse import unquote
from urllib.parse import urlparse, urljoin, parse_qs
MAX_ITEM = 5 # cache-only beyond
MAX_TIME = 2 # cache-only after (in sec)
LIM_ITEM = 10 # deletes what's beyond
LIM_TIME = 2.5 # deletes what's after
DELAY = 10 * 60 # xml cache & ETag cache (in sec)
TIMEOUT = 4 # http timeout (in sec)
DEBUG = False
PORT = 8080
from urllib.parse import parse_qs, urljoin, urlparse
def filterOptions(options):
return options
MAX_ITEM = int(os.getenv('MAX_ITEM', 5)) # cache-only beyond
MAX_TIME = int(os.getenv('MAX_TIME', 2)) # cache-only after (in sec)
# example of filtering code below
LIM_ITEM = int(os.getenv('LIM_ITEM', 10)) # deletes what's beyond
LIM_TIME = int(os.getenv('LIM_TIME', 2.5)) # deletes what's after
#allowed = ['proxy', 'clip', 'cache', 'force', 'silent', 'pro', 'debug']
#filtered = dict([(key,value) for (key,value) in options.items() if key in allowed])
#return filtered
DELAY = int(os.getenv('DELAY', 10 * 60)) # xml cache & ETag cache (in sec)
TIMEOUT = int(os.getenv('TIMEOUT', 4)) # http timeout (in sec)
class MorssException(Exception):
pass
def log(txt, force=False):
if DEBUG or force:
def log(txt):
if 'DEBUG' in os.environ:
if 'REQUEST_URI' in os.environ:
# when running on Apache
open('morss.log', 'a').write("%s\n" % repr(txt))
else:
print(repr(txt))
# when using internal server or cli
print(repr(txt), file=sys.stderr)
def len_html(txt):
@@ -93,12 +87,12 @@ class Options:
else:
self.options = options or {}
def __getattr__(self, key):
def __getattr__(self, key, default=None):
if key in self.options:
return self.options[key]
else:
return False
return default
def __setitem__(self, key, value):
self.options[key] = value
@@ -106,28 +100,7 @@ class Options:
def __contains__(self, key):
return key in self.options
def parseOptions(options):
""" Turns ['md=True'] into {'md':True} """
out = {}
for option in options:
split = option.split('=', 1)
if len(split) > 1:
if split[0].lower() == 'true':
out[split[0]] = True
elif split[0].lower() == 'false':
out[split[0]] = False
else:
out[split[0]] = split[1]
else:
out[split[0]] = True
return out
get = __getitem__ = __getattr__
def ItemFix(item, options, feedurl='/'):
@@ -222,21 +195,20 @@ def ItemFill(item, options, feedurl='/', fast=False):
log(item.link)
# download
delay = -1
if fast or options.fast:
if fast or options.cache:
# force cache, don't fetch
delay = -2
policy = 'offline'
elif options.force:
# force refresh
delay = 0
policy = 'refresh'
else:
delay = 24*60*60 # 24h
policy = None
try:
req = crawler.adv_get(url=item.link, delay=delay, timeout=TIMEOUT)
req = crawler.adv_get(url=item.link, policy=policy, force_min=24*60*60, timeout=TIMEOUT)
except (IOError, HTTPException) as e:
log('http error')
@@ -246,11 +218,18 @@ def ItemFill(item, options, feedurl='/', fast=False):
log('non-text page')
return True
out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
if not req['data']:
log('empty page')
return True
out = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode', xpath=options.xpath)
if out is not None:
item.content = out
if options.resolve:
item.link = req['url']
return True
@@ -287,33 +266,43 @@ def FeedFetch(url, options):
# fetch feed
delay = DELAY
if options.force:
delay = 0
if options.cache:
policy = 'offline'
elif options.force:
policy = 'refresh'
else:
policy = None
try:
req = crawler.adv_get(url=url, follow=('rss' if not options.items else None), delay=delay, timeout=TIMEOUT * 2)
req = crawler.adv_get(url=url, post=options.post, follow=('rss' if not options.items else None), policy=policy, force_min=5*60, force_max=60*60, timeout=TIMEOUT)
except (IOError, HTTPException):
raise MorssException('Error downloading feed')
if options.items:
# using custom rules
rss = feeds.FeedHTML(req['data'], encoding=req['encoding'])
ruleset = {}
rss.rules['title'] = options.title if options.title else '//head/title'
rss.rules['desc'] = options.desc if options.desc else '//head/meta[@name="description"]/@content'
ruleset['items'] = options.items
rss.rules['items'] = options.items
if options.mode:
ruleset['mode'] = options.mode
rss.rules['item_title'] = options.item_title if options.item_title else './/a|.'
rss.rules['item_link'] = options.item_link if options.item_link else './@href|.//a/@href'
ruleset['title'] = options.get('title', '//head/title')
ruleset['desc'] = options.get('desc', '//head/meta[@name="description"]/@content')
ruleset['item_title'] = options.get('item_title', '.')
ruleset['item_link'] = options.get('item_link', '(.|.//a|ancestor::a)/@href')
if options.item_content:
rss.rules['item_content'] = options.item_content
ruleset['item_content'] = options.item_content
if options.item_time:
rss.rules['item_time'] = options.item_time
ruleset['item_time'] = options.item_time
rss = feeds.parse(req['data'], encoding=req['encoding'], ruleset=ruleset)
rss = rss.convert(feeds.FeedXML)
else:
@@ -343,9 +332,23 @@ def FeedGather(rss, url, options):
if options.cache:
max_time = 0
# sort
sorted_items = list(rss.items)
if options.order == 'last':
# `first` does nothing from a practical standpoint, so only `last` needs
# to be addressed
sorted_items = reversed(sorted_items)
elif options.order in ['newest', 'oldest']:
now = datetime.now(tz.tzutc())
sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
sorted_items = sorted(sorted_items, key=lambda x:x.updated or x.time or now) # oldest to newest
if options.order == 'newest':
sorted_items = reversed(sorted_items)
for i, item in enumerate(sorted_items):
# hard cap
if time.time() - start_time > lim_time >= 0 or i + 1 > lim_item >= 0:
log('dropped')
item.remove()
@@ -358,6 +361,7 @@ def FeedGather(rss, url, options):
item = ItemFix(item, options, url)
# soft cap
if time.time() - start_time > max_time >= 0 or i + 1 > max_item >= 0:
if not options.proxy:
if ItemFill(item, options, url, True) is False:
@@ -392,24 +396,24 @@ def FeedFormat(rss, options, encoding='utf-8'):
else:
raise MorssException('Invalid callback var name')
elif options.json:
elif options.format == 'json':
if options.indent:
return rss.tojson(encoding=encoding, indent=4)
else:
return rss.tojson(encoding=encoding)
elif options.csv:
elif options.format == 'csv':
return rss.tocsv(encoding=encoding)
elif options.html:
elif options.format == 'html':
if options.indent:
return rss.tohtml(encoding=encoding, pretty_print=True)
else:
return rss.tohtml(encoding=encoding)
else:
else: # i.e. format == 'rss'
if options.indent:
return rss.torss(xml_declaration=(not encoding == 'unicode'), encoding=encoding, pretty_print=True)
@@ -424,305 +428,9 @@ def process(url, cache=None, options=None):
options = Options(options)
if cache:
crawler.default_cache = crawler.SQLiteCache(cache)
caching.default_cache = caching.DiskCacheHandler(cache)
url, rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)
return FeedFormat(rss, options, 'unicode')
def cgi_parse_environ(environ):
# get options
if 'REQUEST_URI' in environ:
url = environ['REQUEST_URI'][1:]
else:
url = environ['PATH_INFO'][1:]
if environ['QUERY_STRING']:
url += '?' + environ['QUERY_STRING']
url = re.sub(r'^/?(cgi/)?(morss.py|main.py)/', '', url)
if url.startswith(':'):
split = url.split('/', 1)
raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
if len(split) > 1:
url = split[1]
else:
url = ''
else:
raw_options = []
# init
options = Options(filterOptions(parseOptions(raw_options)))
global DEBUG
DEBUG = options.debug
return (url, options)
def cgi_app(environ, start_response):
url, options = cgi_parse_environ(environ)
headers = {}
# headers
headers['status'] = '200 OK'
headers['cache-control'] = 'max-age=%s' % DELAY
headers['x-content-type-options'] = 'nosniff' # safari work around
if options.cors:
headers['access-control-allow-origin'] = '*'
if options.html:
headers['content-type'] = 'text/html'
elif options.txt or options.silent:
headers['content-type'] = 'text/plain'
elif options.json:
headers['content-type'] = 'application/json'
elif options.callback:
headers['content-type'] = 'application/javascript'
elif options.csv:
headers['content-type'] = 'text/csv'
headers['content-disposition'] = 'attachment; filename="feed.csv"'
else:
headers['content-type'] = 'text/xml'
headers['content-type'] += '; charset=utf-8'
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
# get the work done
url, rss = FeedFetch(url, options)
start_response(headers['status'], list(headers.items()))
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options)
if options.silent:
return ['']
else:
return [out]
def middleware(func):
" Decorator to turn a function into a wsgi middleware "
# This is called when parsing the "@middleware" code
def app_builder(app):
# This is called when doing app = cgi_wrapper(app)
def app_wrap(environ, start_response):
# This is called when a http request is being processed
return func(environ, start_response, app)
return app_wrap
return app_builder
@middleware
def cgi_file_handler(environ, start_response, app):
" Simple HTTP server to serve static files (.html, .css, etc.) "
files = {
'': 'text/html',
'index.html': 'text/html',
'sheet.xsl': 'text/xsl'}
if 'REQUEST_URI' in environ:
url = environ['REQUEST_URI'][1:]
else:
url = environ['PATH_INFO'][1:]
if url in files:
headers = {}
if url == '':
url = 'index.html'
paths = [os.path.join(sys.prefix, 'share/morss/www', url),
os.path.join(os.path.dirname(__file__), '../www', url)]
for path in paths:
try:
body = open(path, 'rb').read()
headers['status'] = '200 OK'
headers['content-type'] = files[url]
start_response(headers['status'], list(headers.items()))
return [body]
except IOError:
continue
else:
# the for loop did not return, so here we are, i.e. no file found
headers['status'] = '404 Not found'
start_response(headers['status'], list(headers.items()))
return ['Error %s' % headers['status']]
else:
return app(environ, start_response)
def cgi_get(environ, start_response):
url, options = cgi_parse_environ(environ)
# get page
req = crawler.adv_get(url=url, timeout=TIMEOUT)
if req['contenttype'] in ['text/html', 'application/xhtml+xml', 'application/xml']:
if options.get == 'page':
html = readabilite.parse(req['data'], encoding=req['encoding'])
html.make_links_absolute(req['url'])
kill_tags = ['script', 'iframe', 'noscript']
for tag in kill_tags:
for elem in html.xpath('//'+tag):
elem.getparent().remove(elem)
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8', method='html')
elif options.get == 'article':
output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
else:
raise MorssException('no :get option passed')
else:
output = req['data']
# return html page
headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8'}
start_response(headers['status'], list(headers.items()))
return [output]
dispatch_table = {
'get': cgi_get,
}
@middleware
def cgi_dispatcher(environ, start_response, app):
url, options = cgi_parse_environ(environ)
for key in dispatch_table.keys():
if key in options:
return dispatch_table[key](environ, start_response)
return app(environ, start_response)
@middleware
def cgi_error_handler(environ, start_response, app):
try:
return app(environ, start_response)
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
headers = {'status': '500 Oops', 'content-type': 'text/html'}
start_response(headers['status'], list(headers.items()), sys.exc_info())
log('ERROR: %s' % repr(e), force=True)
return [cgitb.html(sys.exc_info())]
@middleware
def cgi_encode(environ, start_response, app):
out = app(environ, start_response)
return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
cgi_standalone_app = cgi_encode(cgi_error_handler(cgi_dispatcher(cgi_file_handler(cgi_app))))
def cli_app():
options = Options(filterOptions(parseOptions(sys.argv[1:-1])))
url = sys.argv[-1]
global DEBUG
DEBUG = options.debug
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
url, rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options, 'unicode')
if not options.silent:
print(out)
log('done')
def isInt(string):
try:
int(string)
return True
except ValueError:
return False
def main():
if 'REQUEST_URI' in os.environ:
# mod_cgi
app = cgi_app
app = cgi_dispatcher(app)
app = cgi_error_handler(app)
app = cgi_encode(app)
wsgiref.handlers.CGIHandler().run(app)
elif len(sys.argv) <= 1 or isInt(sys.argv[1]):
# start internal (basic) http server
if len(sys.argv) > 1 and isInt(sys.argv[1]):
argPort = int(sys.argv[1])
if argPort > 0:
port = argPort
else:
raise MorssException('Port must be positive integer')
else:
port = PORT
app = cgi_app
app = cgi_file_handler(app)
app = cgi_dispatcher(app)
app = cgi_error_handler(app)
app = cgi_encode(app)
print('Serving http://localhost:%s/' % port)
httpd = wsgiref.simple_server.make_server('', port, app)
httpd.serve_forever()
else:
# as a CLI app
try:
cli_app()
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
print('ERROR: %s' % e.message)
if __name__ == '__main__':
main()

View File

@@ -1,19 +1,36 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import re
import bs4.builder._lxml
import lxml.etree
import lxml.html
from bs4 import BeautifulSoup
import re
import lxml.html.soupparser
class CustomTreeBuilder(bs4.builder._lxml.LXMLTreeBuilder):
def default_parser(self, encoding):
return lxml.html.HTMLParser(target=self, remove_comments=True, remove_pis=True, encoding=encoding)
def parse(data, encoding=None):
if encoding:
data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
else:
data = BeautifulSoup(data, 'lxml').prettify('utf-8')
parser = lxml.html.HTMLParser(remove_comments=True, encoding='utf-8')
return lxml.html.fromstring(data, parser=parser)
kwargs = {'from_encoding': encoding} if encoding else {}
return lxml.html.soupparser.fromstring(data, builder=CustomTreeBuilder, **kwargs)
def count_words(string):
@@ -26,6 +43,8 @@ def count_words(string):
if string is None:
return 0
string = string.strip()
i = 0
count = 0
@@ -47,12 +66,6 @@ def count_content(node):
return count_words(node.text_content()) + len(node.findall('.//img'))
def percentile(N, P):
# https://stackoverflow.com/a/7464107
n = max(int(round(P * len(N) + 0.5)), 2)
return N[n-2]
class_bad = ['comment', 'community', 'extra', 'foot',
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
@@ -131,7 +144,7 @@ def score_node(node):
if wc != 0:
wca = count_words(' '.join([x.text_content() for x in node.findall('.//a')]))
score = score * ( 1 - float(wca)/wc )
score = score * ( 1 - 2 * float(wca)/wc )
return score
@@ -141,15 +154,20 @@ def score_all(node):
for child in node:
score = score_node(child)
child.attrib['morss_own_score'] = str(float(score))
set_score(child, score, 'morss_own_score')
if score > 0 or len(list(child.iterancestors())) <= 2:
spread_score(child, score)
score_all(child)
def set_score(node, value):
node.attrib['morss_score'] = str(float(value))
def set_score(node, value, label='morss_score'):
try:
node.attrib[label] = str(float(value))
except KeyError:
# catch issues with e.g. html comments
pass
def get_score(node):
@@ -189,6 +207,12 @@ def clean_root(root, keep_threshold=None):
def clean_node(node, keep_threshold=None):
parent = node.getparent()
# remove comments
if (isinstance(node, lxml.html.HtmlComment)
or isinstance(node, lxml.html.HtmlProcessingInstruction)):
parent.remove(node)
return
if parent is None:
# this is <html/> (or a removed element waiting for GC)
return
@@ -198,8 +222,8 @@ def clean_node(node, keep_threshold=None):
parent.remove(node)
return
if keep_threshold is not None and get_score(node) >= keep_threshold:
# high score, so keep
if keep_threshold is not None and keep_threshold > 0 and get_score(node) >= keep_threshold:
return
gdparent = parent.getparent()
@@ -220,11 +244,6 @@ def clean_node(node, keep_threshold=None):
parent.remove(node)
return
# remove comments
if isinstance(node, lxml.html.HtmlComment) or isinstance(node, lxml.html.HtmlProcessingInstruction):
parent.remove(node)
return
# remove if too many kids & too high link density
wc = count_words(node.text_content())
if wc != 0 and len(list(node.iter())) > 3:
@@ -282,62 +301,79 @@ def clean_node(node, keep_threshold=None):
gdparent.insert(gdparent.index(parent)+1, new_node)
def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
ancestorsA = list(nodeA.iterancestors())
ancestorsB = list(nodeB.iterancestors())
def lowest_common_ancestor(node_a, node_b, max_depth=None):
ancestors_a = list(node_a.iterancestors())
ancestors_b = list(node_b.iterancestors())
if max_depth is not None:
ancestorsA = ancestorsA[:max_depth]
ancestorsB = ancestorsB[:max_depth]
ancestors_a = ancestors_a[:max_depth]
ancestors_b = ancestors_b[:max_depth]
ancestorsA.insert(0, nodeA)
ancestorsB.insert(0, nodeB)
ancestors_a.insert(0, node_a)
ancestors_b.insert(0, node_b)
for ancestorA in ancestorsA:
if ancestorA in ancestorsB:
return ancestorA
for ancestor_a in ancestors_a:
if ancestor_a in ancestors_b:
return ancestor_a
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
return node_a # should always find one tho, at least <html/>, but needed for max_depth
def rank_grades(grades):
# largest score to smallest
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
def get_best_node(html, threshold=5):
# score all nodes
score_all(html)
# rank all nodes (largest to smallest)
ranked_nodes = sorted(html.iter(), key=lambda x: get_score(x), reverse=True)
# minimum threshold
if not len(ranked_nodes) or get_score(ranked_nodes[0]) < threshold:
return None
# take common ancestor or the two highest rated nodes
if len(ranked_nodes) > 1:
best = lowest_common_ancestor(ranked_nodes[0], ranked_nodes[1], 3)
else:
best = ranked_nodes[0]
return best
def get_best_node(ranked_grades):
" To pick the best (raw) node. Another function will clean it "
if len(ranked_grades) == 1:
return ranked_grades[0]
lowest = lowest_common_ancestor(ranked_grades[0][0], ranked_grades[1][0], 3)
return lowest
def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5):
def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=False, threshold=5, xpath=None):
" Input a raw html string, returns a raw html string of the article "
html = parse(data, encoding_in)
score_all(html)
scores = rank_grades(get_all_scores(html))
if not len(scores) or scores[0][1] < threshold:
if xpath is not None:
xpath_match = html.xpath(xpath)
if len(xpath_match):
best = xpath_match[0]
else:
best = get_best_node(html, threshold)
else:
best = get_best_node(html, threshold)
if best is None:
# if threshold not met
return None
best = get_best_node(scores)
# clean up
if not debug:
keep_threshold = percentile([x[1] for x in scores], 0.1)
keep_threshold = get_score(best) * 3/4
clean_root(best, keep_threshold)
# check for spammy content (links only)
wc = count_words(best.text_content())
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
if not debug and (wc - wca < 50 or float(wca) / wc > 0.3):
return None
# fix urls
if url:
best.make_links_absolute(url)
@@ -346,10 +382,14 @@ def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=
if __name__ == '__main__':
import sys
from . import crawler
req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://morss.it')
article = get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='unicode')
if not sys.flags.interactive:
if sys.flags.interactive:
print('>>> Interactive shell: try using `article`')
else:
print(article)

57
morss/util.py Normal file
View File

@@ -0,0 +1,57 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import os
import os.path
import sys
def pkg_path(*path_elements):
return os.path.join(os.path.dirname(__file__), *path_elements)
data_path_base = None
def data_path(*path_elements):
global data_path_base
path = os.path.join(*path_elements)
if data_path_base is not None:
return os.path.join(data_path_base, path)
bases = [
os.path.join(sys.prefix, 'share/morss'), # when installed as root
pkg_path('../../../share/morss'),
pkg_path('../../../../share/morss'),
pkg_path('../share/morss'), # for `pip install --target=dir morss`
pkg_path('..'), # when running from source tree
]
if 'DATA_PATH' in os.environ:
bases.append(os.environ['DATA_PATH'])
for base in bases:
full_path = os.path.join(base, path)
if os.path.isfile(full_path):
data_path_base = os.path.abspath(base)
return data_path(path)
else:
raise IOError()

298
morss/wsgi.py Normal file
View File

@@ -0,0 +1,298 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import cgitb
import mimetypes
import os.path
import re
import sys
import wsgiref.handlers
import wsgiref.simple_server
import wsgiref.util
import lxml.etree
try:
# python 2
from urllib import unquote
except ImportError:
# python 3
from urllib.parse import unquote
from . import caching, crawler, readabilite
from .morss import (DELAY, TIMEOUT, FeedFetch, FeedFormat, FeedGather,
MorssException, Options, log)
from .util import data_path
PORT = int(os.getenv('PORT', 8000))
def parse_options(options):
""" Turns ['md=True'] into {'md':True} """
out = {}
for option in options:
split = option.split('=', 1)
if len(split) > 1:
out[split[0]] = unquote(split[1]).replace('|', '/') # | -> / for backward compatibility (and Apache)
else:
out[split[0]] = True
return out
def request_uri(environ):
if 'REQUEST_URI' in environ:
# when running on Apache/uwsgi
url = environ['REQUEST_URI']
elif 'RAW_URI' in environ:
# gunicorn
url = environ['RAW_URI']
else:
# when using other servers
url = environ['PATH_INFO']
if environ['QUERY_STRING']:
url += '?' + environ['QUERY_STRING']
return url
def cgi_parse_environ(environ):
# get options
url = request_uri(environ)[1:]
url = re.sub(r'^(cgi/)?(morss.py|main.py)/', '', url)
if url.startswith(':'):
parts = url.split('/', 1)
raw_options = parts[0].split(':')[1:]
url = parts[1] if len(parts) > 1 else ''
else:
raw_options = []
# init
options = Options(parse_options(raw_options))
return (url, options)
def cgi_app(environ, start_response):
url, options = cgi_parse_environ(environ)
headers = {}
# headers
headers['status'] = '200 OK'
headers['cache-control'] = 'max-age=%s' % DELAY
headers['x-content-type-options'] = 'nosniff' # safari work around
if options.cors:
headers['access-control-allow-origin'] = '*'
if options.format == 'html':
headers['content-type'] = 'text/html'
elif options.txt or options.silent:
headers['content-type'] = 'text/plain'
elif options.format == 'json':
headers['content-type'] = 'application/json'
elif options.callback:
headers['content-type'] = 'application/javascript'
elif options.format == 'csv':
headers['content-type'] = 'text/csv'
headers['content-disposition'] = 'attachment; filename="feed.csv"'
else:
headers['content-type'] = 'text/xml'
headers['content-type'] += '; charset=utf-8'
# get the work done
url, rss = FeedFetch(url, options)
start_response(headers['status'], list(headers.items()))
rss = FeedGather(rss, url, options)
out = FeedFormat(rss, options)
if options.silent:
return ['']
else:
return [out]
def middleware(func):
" Decorator to turn a function into a wsgi middleware "
# This is called when parsing the "@middleware" code
def app_builder(app):
# This is called when doing app = cgi_wrapper(app)
def app_wrap(environ, start_response):
# This is called when a http request is being processed
return func(environ, start_response, app)
return app_wrap
return app_builder
@middleware
def cgi_file_handler(environ, start_response, app):
" Simple HTTP server to serve static files (.html, .css, etc.) "
url = request_uri(environ)[1:]
if url == '':
url = 'index.html'
if re.match(r'^/?([a-zA-Z0-9_-][a-zA-Z0-9\._-]+/?)*$', url):
# if it is a legitimate url (no funny relative paths)
try:
path = data_path('www', url)
f = open(path, 'rb')
except IOError:
# problem with file (cannot open or not found)
pass
else:
# file successfully open
headers = {}
headers['status'] = '200 OK'
headers['content-type'] = mimetypes.guess_type(path)[0] or 'application/octet-stream'
start_response(headers['status'], list(headers.items()))
return wsgiref.util.FileWrapper(f)
# regex didn't validate or no file found
return app(environ, start_response)
def cgi_get(environ, start_response):
url, options = cgi_parse_environ(environ)
# get page
if options['get'] in ('page', 'article'):
req = crawler.adv_get(url=url, timeout=TIMEOUT)
if req['contenttype'] in crawler.MIMETYPE['html']:
if options['get'] == 'page':
html = readabilite.parse(req['data'], encoding=req['encoding'])
html.make_links_absolute(req['url'])
kill_tags = ['script', 'iframe', 'noscript']
for tag in kill_tags:
for elem in html.xpath('//'+tag):
elem.getparent().remove(elem)
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8', method='html')
else: # i.e. options['get'] == 'article'
output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
elif req['contenttype'] in crawler.MIMETYPE['xml'] + crawler.MIMETYPE['rss'] + crawler.MIMETYPE['json']:
output = req['data']
else:
raise MorssException('unsupported mimetype')
else:
raise MorssException('no :get option passed')
# return html page
headers = {'status': '200 OK', 'content-type': req['contenttype'], 'X-Frame-Options': 'SAMEORIGIN'} # SAMEORIGIN to avoid potential abuse
start_response(headers['status'], list(headers.items()))
return [output]
dispatch_table = {
'get': cgi_get,
}
@middleware
def cgi_dispatcher(environ, start_response, app):
url, options = cgi_parse_environ(environ)
for key in dispatch_table.keys():
if key in options:
return dispatch_table[key](environ, start_response)
return app(environ, start_response)
@middleware
def cgi_error_handler(environ, start_response, app):
try:
return app(environ, start_response)
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
headers = {'status': '404 Not Found', 'content-type': 'text/html', 'x-morss-error': repr(e)}
start_response(headers['status'], list(headers.items()), sys.exc_info())
log('ERROR: %s' % repr(e))
return [cgitb.html(sys.exc_info())]
@middleware
def cgi_encode(environ, start_response, app):
out = app(environ, start_response)
return [x if isinstance(x, bytes) else str(x).encode('utf-8') for x in out]
application = cgi_app
application = cgi_file_handler(application)
application = cgi_dispatcher(application)
application = cgi_error_handler(application)
application = cgi_encode(application)
def cgi_handle_request():
app = cgi_app
app = cgi_dispatcher(app)
app = cgi_error_handler(app)
app = cgi_encode(app)
wsgiref.handlers.CGIHandler().run(app)
class WSGIRequestHandlerRequestUri(wsgiref.simple_server.WSGIRequestHandler):
def get_environ(self):
env = wsgiref.simple_server.WSGIRequestHandler.get_environ(self)
env['REQUEST_URI'] = self.path
return env
def cgi_start_server():
caching.default_cache.autotrim()
print('Serving http://localhost:%s/' % PORT)
httpd = wsgiref.simple_server.make_server('', PORT, application, handler_class=WSGIRequestHandlerRequestUri)
httpd.serve_forever()
if 'gunicorn' in os.getenv('SERVER_SOFTWARE', ''):
caching.default_cache.autotrim()

View File

@@ -1,24 +1,60 @@
from setuptools import setup
from datetime import datetime
from glob import glob
from setuptools import setup
def get_version():
with open('morss/__init__.py', 'r+') as file:
lines = file.readlines()
# look for hard coded version number
for i in range(len(lines)):
if lines[i].startswith('__version__'):
version = lines[i].split('"')[1]
break
# create (& save) one if none found
if version == '':
version = datetime.now().strftime('%Y%m%d.%H%M')
lines[i] = '__version__ = "' + version + '"\n'
file.seek(0)
file.writelines(lines)
# return version number
return version
package_name = 'morss'
setup(
name = package_name,
version = get_version(),
description = 'Get full-text RSS feeds',
author = 'pictuga, Samuel Marks',
author_email = 'contact at pictuga dot com',
long_description = open('README.md').read(),
long_description_content_type = 'text/markdown',
author = 'pictuga',
author_email = 'contact@pictuga.com',
url = 'http://morss.it/',
download_url = 'https://git.pictuga.com/pictuga/morss',
project_urls = {
'Source': 'https://git.pictuga.com/pictuga/morss',
'Bug Tracker': 'https://github.com/pictuga/morss/issues',
},
license = 'AGPL v3',
packages = [package_name],
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet', 'pymysql'],
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet'],
extras_require = {
'full': ['redis', 'diskcache', 'gunicorn', 'setproctitle'],
'dev': ['pylint', 'pyenchant', 'pytest', 'pytest-cov'],
},
python_requires = '>=2.7',
package_data = {package_name: ['feedify.ini']},
data_files = [
('share/' + package_name, ['README.md', 'LICENSE']),
('share/' + package_name + '/www', glob('www/*.*')),
('share/' + package_name + '/www/cgi', [])
],
entry_points = {
'console_scripts': [package_name + '=' + package_name + ':main']
})
'console_scripts': [package_name + '=' + package_name + '.__main__:main'],
},
scripts = ['morss-helper'],
)

60
tests/conftest.py Normal file
View File

@@ -0,0 +1,60 @@
import os
import os.path
import threading
import pytest
try:
# python2
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler
except:
# python3
from http.server import (BaseHTTPRequestHandler, HTTPServer,
SimpleHTTPRequestHandler)
class HTTPReplayHandler(SimpleHTTPRequestHandler):
" Serves pages saved alongside with headers. See `curl --http1.1 -is http://...` "
directory = os.path.join(os.path.dirname(__file__), './samples/')
__init__ = BaseHTTPRequestHandler.__init__
def do_GET(self):
path = self.translate_path(self.path)
if os.path.isdir(path):
f = self.list_directory(path)
else:
f = open(path, 'rb')
try:
self.copyfile(f, self.wfile)
finally:
f.close()
class MuteHTTPServer(HTTPServer):
def handle_error(self, request, client_address):
# mute errors
pass
def make_server(port=8888):
print('Serving http://localhost:%s/' % port)
return MuteHTTPServer(('', port), RequestHandlerClass=HTTPReplayHandler)
@pytest.fixture
def replay_server():
httpd = make_server()
thread = threading.Thread(target=httpd.serve_forever)
thread.start()
yield
httpd.shutdown()
thread.join()
if __name__ == '__main__':
httpd = make_server()
httpd.serve_forever()

4
tests/samples/200-ok.txt Normal file
View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain
success

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: /200-ok.txt

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: ./200-ok.txt

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: http://localhost:8888/200-ok.txt

View File

@@ -0,0 +1,4 @@
HTTP/1.1 308 Permanent Redirect
location: /200-ok.txt
/200-ok.txt

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><link rel="alternate" type="application/rss+xml" href="/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=gb2312
<EFBFBD>ɹ<EFBFBD>

View File

@@ -0,0 +1,10 @@
HTTP/1.1 200 OK
content-type: text/html
<!DOCTYPE html>
<html>
<head><meta charset="gb2312"/></head>
<body>
<EFBFBD>ɹ<EFBFBD>
</body></html>

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=iso-8859-1
succ<EFBFBD>s

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain
succ<EFBFBD>s

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=UTF-8
succès

View File

@@ -0,0 +1,16 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<summary>!ITEM_DESC!</summary>
<content type="html">!ITEM_CONTENT!</content>
<link href="!ITEM_LINK!"/>
<updated>2022-01-01T00:00:01+01:00</updated>
<published>2022-01-01T00:00:02+01:00</published>
</entry>
</feed>

View File

@@ -0,0 +1,15 @@
HTTP/1.1 200 OK
content-type: application/xml
<?xml version='1.0' encoding='utf-8' ?>
<feed version='0.3' xmlns='http://purl.org/atom/ns#'>
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<link rel='alternate' type='text/html' href='!ITEM_LINK!' />
<summary>!ITEM_DESC!</summary>
<content>!ITEM_CONTENT!</content>
<issued>2022-01-01T00:00:01+01:00</issued> <!-- FIXME -->
</entry>
</feed>

View File

@@ -0,0 +1,22 @@
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
<html>
<head></head>
<body>
<div id="header">
<h1>!TITLE!</h1>
<p>!DESC!</p>
</div>
<div id="content">
<div class="item">
<a target="_blank" href="!ITEM_LINK!">!ITEM_TITLE!</a>
<div class="desc">!ITEM_DESC!</div>
<div class="content">!ITEM_CONTENT!</div>
</div>
</div>
</body>
</html>

View File

@@ -0,0 +1,16 @@
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{
"title": "!TITLE!",
"desc": "!DESC!",
"items": [
{
"title": "!ITEM_TITLE!",
"time": "2022-01-01T00:00:01+0100",
"url": "!ITEM_LINK!",
"desc": "!ITEM_DESC!",
"content": "!ITEM_CONTENT!"
}
]
}

View File

@@ -0,0 +1,17 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
<channel>
<title>!TITLE!</title>
<description>!DESC!</description>
<item>
<title>!ITEM_TITLE!</title>
<pubDate>Mon, 01 Jan 2022 00:00:01 +0100</pubDate>
<link>!ITEM_LINK!</link>
<description>!ITEM_DESC!</description>
<content:encoded>!ITEM_CONTENT!</content:encoded>
</item>
</channel>
</rss>

BIN
tests/samples/gzip.txt Normal file

Binary file not shown.

View File

@@ -0,0 +1,3 @@
HTTP/1.1 200 OK
refresh: 0;url=/200-ok.txt

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = /200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = ./200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = http://localhost:8888/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

9220
tests/samples/size-1MiB.txt Normal file

File diff suppressed because it is too large Load Diff

62
tests/test_crawler.py Normal file
View File

@@ -0,0 +1,62 @@
import pytest
from morss.crawler import *
def test_get(replay_server):
assert get('http://localhost:8888/200-ok.txt') == b'success\r\n'
def test_adv_get(replay_server):
assert adv_get('http://localhost:8888/200-ok.txt')['data'] == b'success\r\n'
@pytest.mark.parametrize('before,after', [
(b'http://localhost:8888/', 'http://localhost:8888/'),
('localhost:8888/', 'http://localhost:8888/'),
('http:/localhost:8888/', 'http://localhost:8888/'),
('http://localhost:8888/&/', 'http://localhost:8888/&/'),
('http://localhost:8888/ /', 'http://localhost:8888/%20/'),
('http://localhost-€/€/', 'http://xn--localhost--077e/%E2%82%AC/'),
('http://localhost-€:8888/€/', 'http://xn--localhost--077e:8888/%E2%82%AC/'),
])
def test_sanitize_url(before, after):
assert sanitize_url(before) == after
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(SizeLimitHandler(500*1024))])
def test_size_limit_handler(replay_server, opener):
assert len(opener.open('http://localhost:8888/size-1MiB.txt').read()) == 500*1024
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(GZIPHandler())])
def test_gzip_handler(replay_server, opener):
assert opener.open('http://localhost:8888/gzip.txt').read() == b'success\n'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(EncodingFixHandler())])
@pytest.mark.parametrize('url', [
'enc-gb2312-header.txt', 'enc-gb2312-meta.txt', #'enc-gb2312-missing.txt',
'enc-iso-8859-1-header.txt', 'enc-iso-8859-1-missing.txt',
'enc-utf-8-header.txt',
])
def test_encoding_fix_handler(replay_server, opener, url):
out = adv_get('http://localhost:8888/%s' % url)
out = out['data'].decode(out['encoding'])
assert 'succes' in out or 'succès' in out or '成功' in out
@pytest.mark.parametrize('opener', [custom_opener(follow='rss'), build_opener(AlternateHandler(MIMETYPE['rss']))])
def test_alternate_handler(replay_server, opener):
assert opener.open('http://localhost:8888/alternate-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPEquivHandler(), HTTPRefreshHandler())])
def test_http_equiv_handler(replay_server, opener):
assert opener.open('http://localhost:8888/meta-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPAllRedirectHandler())])
def test_http_all_redirect_handler(replay_server, opener):
assert opener.open('http://localhost:8888/308-redirect.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPRefreshHandler())])
def test_http_refresh_handler(replay_server, opener):
assert opener.open('http://localhost:8888/header-refresh.txt').geturl() == 'http://localhost:8888/200-ok.txt'

108
tests/test_feeds.py Normal file
View File

@@ -0,0 +1,108 @@
import pytest
from morss.crawler import adv_get
from morss.feeds import *
def get_feed(url):
url = 'http://localhost:8888/%s' % url
out = adv_get(url)
feed = parse(out['data'], url=url, encoding=out['encoding'])
return feed
def check_feed(feed):
# NB. time and updated not covered
assert feed.title == '!TITLE!'
assert feed.desc == '!DESC!'
assert feed[0] == feed.items[0]
assert feed[0].title == '!ITEM_TITLE!'
assert feed[0].link == '!ITEM_LINK!'
assert '!ITEM_DESC!' in feed[0].desc # broader test due to possible inclusion of surrounding <div> in xml
assert '!ITEM_CONTENT!' in feed[0].content
def check_output(feed):
output = feed.tostring()
assert '!TITLE!' in output
assert '!DESC!' in output
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output
def check_change(feed):
feed.title = '!TITLE2!'
feed.desc = '!DESC2!'
feed[0].title = '!ITEM_TITLE2!'
feed[0].link = '!ITEM_LINK2!'
feed[0].desc = '!ITEM_DESC2!'
feed[0].content = '!ITEM_CONTENT2!'
assert feed.title == '!TITLE2!'
assert feed.desc == '!DESC2!'
assert feed[0].title == '!ITEM_TITLE2!'
assert feed[0].link == '!ITEM_LINK2!'
assert '!ITEM_DESC2!' in feed[0].desc
assert '!ITEM_CONTENT2!' in feed[0].content
def check_add(feed):
feed.append({
'title': '!ITEM_TITLE3!',
'link': '!ITEM_LINK3!',
'desc': '!ITEM_DESC3!',
'content': '!ITEM_CONTENT3!',
})
assert feed[1].title == '!ITEM_TITLE3!'
assert feed[1].link == '!ITEM_LINK3!'
assert '!ITEM_DESC3!' in feed[1].desc
assert '!ITEM_CONTENT3!' in feed[1].content
each_format = pytest.mark.parametrize('url', [
'feed-rss-channel-utf-8.txt', 'feed-atom-utf-8.txt',
'feed-atom03-utf-8.txt', 'feed-json-utf-8.txt', 'feed-html-utf-8.txt',
])
each_check = pytest.mark.parametrize('check', [
check_feed, check_output, check_change, check_add,
])
@each_format
@each_check
def test_parse(replay_server, url, check):
feed = get_feed(url)
check(feed)
@each_format
@each_check
def test_convert_rss(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedXML)
check(feed)
@each_format
@each_check
def test_convert_json(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedJSON)
check(feed)
@each_format
@each_check
def test_convert_html(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedHTML)
if len(feed) > 1:
# remove the 'blank' default html item
del feed[0]
check(feed)
@each_format
def test_convert_csv(replay_server, url):
# only csv output, not csv feed, check therefore differnet
feed = get_feed(url)
output = feed.tocsv()
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output

View File

@@ -1,9 +0,0 @@
Options -Indexes
ErrorDocument 403 "Access forbidden"
ErrorDocument 404 /cgi/main.py
ErrorDocument 500 "A very nasty bug found his way onto this very server"
<Files ~ "\.(py|pyc|db|log)$">
deny from all
</Files>

View File

@@ -1,9 +0,0 @@
order allow,deny
deny from all
<Files main.py>
allow from all
AddHandler cgi-script .py
Options +ExecCGI
</Files>

View File

@@ -4,6 +4,7 @@
<title>morss</title>
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
<meta charset="UTF-8" />
<link rel="shortcut icon" type="image/svg+xml" href="/logo.svg" sizes="any" />
<style type="text/css">
body
{

17
www/logo.svg Normal file
View File

@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg width="16" height="16" viewBox="0 0 16 16" shape-rendering="crispEdges" fill="black" version="1.1" xmlns="http://www.w3.org/2000/svg">
<rect x="2" y="4" width="2" height="2" />
<rect x="5" y="4" width="6" height="2" />
<rect x="12" y="4" width="2" height="2" />
<rect x="2" y="7" width="2" height="2" />
<rect x="7" y="7" width="2" height="2" />
<rect x="12" y="7" width="2" height="2" />
<rect x="2" y="10" width="2" height="2" />
<rect x="7" y="10" width="2" height="2" />
<rect x="12" y="10" width="2" height="2" />
</svg>
<!-- This work by pictuga is licensed under CC BY-NC-SA 4.0. To view a copy of
this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0 -->

After

Width:  |  Height:  |  Size: 735 B

View File

@@ -16,14 +16,21 @@
<title>RSS feed by morss</title>
<meta name="viewport" content="width=device-width; initial-scale=1.0;" />
<meta name="robots" content="noindex" />
<link rel="shortcut icon" type="image/svg+xml" href="/logo.svg" sizes="any" />
<style type="text/css">
body * {
box-sizing: border-box;
}
body {
overflow-wrap: anywhere;
word-wrap: anywhere;
word-break: break-word;
font-family: sans-serif;
-webkit-tap-highlight-color: transparent; /* safari work around */
}
input, select {
@@ -33,7 +40,8 @@
}
header {
text-align: center;
text-align: justify;
text-align-last: center;
border-bottom: 1px solid silver;
}
@@ -112,7 +120,6 @@
}
header > form {
text-align: center;
margin: 1%;
}
@@ -133,6 +140,10 @@
padding: 1%;
}
.item > *:empty {
display: none;
}
.item > :not(:last-child) {
border-bottom: 1px solid silver;
}
@@ -176,21 +187,28 @@
<select>
<option value="">full-text</option>
<option value=":proxy">original</option>
<option value=":clip">original + full-text</option>
<option value=":clip" title="original + full-text: keep the original description above the full article. Useful for reddit feeds for example, to keep the comment links">combined (?)</option>
</select>
feed as
<select>
<option value="">RSS</option>
<option value=":json:cors">JSON</option>
<option value=":html">HTML</option>
<option value=":csv">CSV</option>
<option value=":format=json:cors">JSON</option>
<option value=":format=html">HTML</option>
<option value=":format=csv">CSV</option>
</select>
using
using the
<select>
<option value="">the standard link</option>
<option value=":firstlink" title="Useful for Twitter feeds for example, to get the articles referred to in tweets rather than the tweet itself">the first link from the description (?)</option>
<option value="">standard</option>
<option value=":firstlink" title="Pull the article from the first available link in the description, instead of the standard link. Useful for Twitter feeds for example, to get the articles referred to in tweets rather than the tweet itself">first (?)</option>
</select>
and
link of the
<select>
<option value="">first</option>
<option value=":order=newest" title="Select feed items by publication date (instead of appearing order)">newest (?)</option>
<option value=":order=last">last</option>
<option value=":order=oldest">oldest</option>
</select>
items and
<select>
<option value="">keep</option>
<option value=":nolink:noref">remove</option>
@@ -199,7 +217,8 @@
<input type="hidden" value="" name="extra_options"/>
</form>
<p>Click <a href="/">here</a> to go back to morss</p>
<p>You can find a <em>preview</em> of the feed below. You need a <em>feed reader</em> for optimal use</p>
<p>Click <a href="/">here</a> to go back to morss and/or to use the tool on another feed</p>
</header>
<div id="header" dir="auto">
@@ -215,7 +234,7 @@
<div id="content">
<xsl:for-each select="rdf:RDF/rssfake:channel/rssfake:item|rss/channel/item|atom:feed/atom:entry|atom03:feed/atom03:entry">
<div class="item" dir="auto">
<a href="/" target="_blank"><xsl:attribute name="href"><xsl:value-of select="rssfake:link|link|atom:link/@href|atom03:link/@href"/></xsl:attribute>
<a target="_blank"><xsl:attribute name="href"><xsl:value-of select="rssfake:link|link|atom:link/@href|atom03:link/@href"/></xsl:attribute>
<xsl:value-of select="rssfake:title|title|atom:title|atom03:title"/>
</a>
@@ -236,7 +255,7 @@
if (!/:html/.test(window.location.href))
for (var content of document.querySelectorAll(".desc,.content"))
content.innerHTML = (content.innerText.match(/>/g) || []).length > 10 ? content.innerText : content.innerHTML
content.innerHTML = (content.innerText.match(/>/g) || []).length > 3 ? content.innerText : content.innerHTML
var options = parse_location()[0]