Compare commits

..

1 Commits

Author SHA1 Message Date
8af45cdbd4 More cloud-aware instructions
All checks were successful
continuous-integration/drone/push Build is passing
2021-11-21 22:37:36 +01:00
49 changed files with 343 additions and 10186 deletions

15
.drone.yml Normal file
View File

@@ -0,0 +1,15 @@
kind: pipeline
name: default
steps:
- name: isort
image: python:alpine
commands:
- pip install isort
- isort --check-only --diff .
- name: pylint
image: alpine
commands:
- apk add --no-cache python3 py3-lxml py3-pip py3-wheel py3-pylint py3-enchant hunspell-en
- pip3 install --no-cache-dir .[full]
- pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8

View File

@@ -1,78 +0,0 @@
name: default
on:
push:
branches:
- master
jobs:
test-lint:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip libenchant-2-2 aspell-en
- name: Install dependencies
run: pip3 install .[full] .[dev]
- run: isort --check-only --diff .
- run: pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8
- run: pytest --cov=morss tests
python-publish:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip python3-build
- name: Build package
run: python3 -m build
- name: Publish package
uses: https://github.com/pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.pypi_api_token }}
docker-publish-deploy:
runs-on: ubuntu-latest
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up QEMU
uses: https://github.com/docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: https://github.com/docker/setup-buildx-action@v2
- name: Login to Docker Hub
uses: https://github.com/docker/login-action@v2
with:
username: ${{ secrets.docker_user }}
password: ${{ secrets.docker_pwd }}
- name: Build and push
uses: https://github.com/docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64,linux/arm/v7
push: true
tags: ${{ secrets.docker_repo }}
- name: Deploy on server
uses: https://github.com/appleboy/ssh-action@v0.1.10
with:
host: ${{ secrets.ssh_host }}
username: ${{ secrets.ssh_user }}
key: ${{ secrets.ssh_key }}
script: morss-update

View File

@@ -1,16 +1,11 @@
FROM alpine:edge
FROM alpine:latest
RUN apk add --no-cache python3 py3-lxml py3-pip py3-wheel git
ADD . /app
RUN set -ex; \
apk add --no-cache --virtual .run-deps python3 py3-lxml py3-setproctitle py3-setuptools; \
apk add --no-cache --virtual .build-deps py3-pip py3-wheel; \
pip3 install --no-cache-dir /app[full]; \
apk del .build-deps
RUN pip3 install --no-cache-dir /app[full] gunicorn
USER 1000:1000
ENTRYPOINT ["/bin/sh", "/app/morss-helper"]
ENTRYPOINT ["/bin/sh", "/app/docker-entry.sh"]
CMD ["run"]
HEALTHCHECK CMD /bin/sh /app/morss-helper check

193
README.md
View File

@@ -1,14 +1,13 @@
# Morss - Get full-text RSS feeds
[Homepage](https://morss.it/) •
[Upstream source code](https://git.pictuga.com/pictuga/morss) •
[Github mirror](https://github.com/pictuga/morss) (for Issues & Pull requests)
[![Build Status](https://ci.pictuga.com/api/badges/pictuga/morss/status.svg)](https://ci.pictuga.com/pictuga/morss)
[![Github Stars](https://img.shields.io/github/stars/pictuga/morss?logo=github)](https://github.com/pictuga/morss/stargazers)
[![Github Forks](https://img.shields.io/github/forks/pictuga/morss?logo=github)](https://github.com/pictuga/morss/network/members)
[![GNU AGPLv3 code](https://img.shields.io/static/v1?label=license&message=AGPLv3)](https://git.pictuga.com/pictuga/morss/src/branch/master/LICENSE)
[![Logo is CC BY-NC-SA 4.0](https://img.shields.io/static/v1?label=CC&message=BY-NC-SA%204.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
_GNU AGPLv3 code_
_Provided logo is CC BY-NC-SA 4.0_
Upstream source code: <https://git.pictuga.com/pictuga/morss>
Github mirror (for Issues & Pull requests): <https://github.com/pictuga/morss>
Homepage: <https://morss.it/>
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
commonly available on internet. Indeed most newspapers only make a small
@@ -41,7 +40,7 @@ Some features of morss:
- Follow 301/meta redirects
- Recover xml feeds with corrupt encoding
- Supports gzip-compressed http content
- HTTP caching with different backends (in-memory/redis/diskcache)
- HTTP caching with different backends (in-memory/sqlite/mysql/redis/diskcache)
- Works as server/cli tool
- Deobfuscate various tracking links
@@ -49,41 +48,20 @@ Some features of morss:
### Python package
![Build Python](https://img.shields.io/badge/dynamic/json?label=build%20python&query=$.stages[?(@.name=='python')].status&url=https://ci.pictuga.com/api/repos/pictuga/morss/builds/latest)
[![PyPI](https://img.shields.io/pypi/v/morss)](https://pypi.org/project/morss/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/morss)](https://pypistats.org/packages/morss)
Simple install (without optional dependencies)
From pip
```shell
pip install morss
```
From git
```shell
pip install git+https://git.pictuga.com/pictuga/morss.git
```
Full installation (including optional dependencies)
From pip
```shell
pip install morss[full]
pip install git+https://git.pictuga.com/pictuga/morss.git#[full]
```
From git
```shell
pip install git+https://git.pictuga.com/pictuga/morss.git#egg=morss[full]
```
The full install includes all the cache backends. Otherwise, only in-memory
cache is available. The full install also includes gunicorn (for more efficient
HTTP handling).
The full install includes mysql, redis and diskcache (possible cache backends).
Otherwise, only in-memory and sqlite3 caches are available.
The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
C code needs to be compiled). If possible on your distribution, try installing
@@ -91,37 +69,13 @@ it with the system package manager.
### Docker
![Build Docker](https://img.shields.io/badge/dynamic/json?label=build%20docker&query=$.stages[?(@.name=='docker')].status&url=https://ci.pictuga.com/api/repos/pictuga/morss/builds/latest)
[![Docker Hub](https://img.shields.io/docker/pulls/pictuga/morss)](https://hub.docker.com/r/pictuga/morss)
[![Docker Arch](https://img.shields.io/badge/dynamic/json?color=blue&label=docker%20arch&query=$.results[0].images[*].architecture&url=https://hub.docker.com/v2/repositories/pictuga/morss/tags)](https://hub.docker.com/r/pictuga/morss/tags)
From docker hub
With cli
```shell
docker pull pictuga/morss
```
With docker-compose **(recommended)**
```yml
services:
app:
image: pictuga/morss
ports:
- '8000:8000'
```
Build from source
With cli
Build
```shell
docker build --tag morss https://git.pictuga.com/pictuga/morss.git --no-cache --pull
```
With docker-compose
With docker-compose:
```yml
services:
@@ -142,40 +96,48 @@ docker-compose build --no-cache --pull
One-click deployment:
[![Heroku](https://img.shields.io/static/v1?label=deploy%20to&message=heroku&logo=heroku&color=79589F)](https://heroku.com/deploy?template=https://github.com/pictuga/morss)
[![Google Cloud](https://img.shields.io/static/v1?label=deploy%20to&message=google&logo=google&color=4285F4)](https://deploy.cloud.run/?git_repo=https://github.com/pictuga/morss.git)
* Heroku: <https://heroku.com/deploy?template=https://github.com/pictuga/morss>
* Google Cloud: <https://deploy.cloud.run/?git_repo=https://github.com/pictuga/morss.git>
Providers supporting `cloud-init` (AWS, Oracle Cloud Infrastructure), based on Ubuntu:
Providers supporting `cloud-init`:
``` yml
#cloud-config
packages:
- python3-pip
- python3-wheel
- python3-lxml
- python3-setproctitle
- docker.io
- docker-compose
- git
- ca-certificates
groups:
- docker
system_info:
default_user:
groups: [docker]
write_files:
- path: /etc/environment
append: true
- path: /docker-compose.yml
permissions: "0644"
content: |
DEBUG=1
CACHE=diskcache
CACHE_SIZE=1073741824 # 1GiB
- path: /var/lib/cloud/scripts/per-boot/morss.sh
permissions: 744
content: |
#!/bin/sh
/usr/local/bin/morss-helper daemon
version: '3.7'
services:
app:
build: 'https://git.pictuga.com/pictuga/morss.git#master'
image: morss
ports:
- 80:8000
environment:
- DEBUG=1
- CACHE=diskcache
- CACHE_SIZE=1073741824
restart: always
runcmd:
- source /etc/environment
- update-ca-certificates
- iptables -I INPUT 6 -m state --state NEW -p tcp --dport ${PORT:-8000} -j ACCEPT
- netfilter-persistent save
- pip install morss[full]
- docker-compose -f /docker-compose.yml build --no-cache
- docker-compose -f /docker-compose.yml up -dV
```
## Run
@@ -203,19 +165,13 @@ other clients.
#### Using Docker
From docker hub
```shell
docker run -p 8000:8000 pictuga/morss
```
From source
Run
```shell
docker run -p 8000:8000 morss
```
With docker-compose **(recommended)**
With docker-compose:
```shell
docker-compose up
@@ -276,30 +232,8 @@ For this, you need to make sure your host allows python script execution. This
method uses HTTP calls to fetch the RSS feeds, which will be handled through
`mod_cgi` for example on Apache severs.
Please pay attention to `main.py` permissions for it to be executable. See below
some tips for the `.htaccess` file.
```htaccess
Options -Indexes
ErrorDocument 404 /cgi/main.py
# Turn debug on for all requests
SetEnv DEBUG 1
# Turn debug on for requests with :debug in the url
SetEnvIf Request_URI :debug DEBUG=1
<Files ~ "\.(py|pyc|db|log)$">
deny from all
</Files>
<Files main.py>
allow from all
AddHandler cgi-script .py
Options +ExecCGI
</Files>
```
Please pay attention to `main.py` permissions for it to be executable. Also
ensure that the provided `/www/.htaccess` works well with your server.
### As a CLI application
@@ -353,7 +287,7 @@ Using cache and passing arguments:
```python
>>> import morss
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
>>> cache = '/tmp/morss-cache' # diskcache cache location
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
>>> options = {'csv':True}
>>> xml_string = morss.process(url, cache, options)
>>> xml_string[:50]
@@ -367,10 +301,11 @@ under the hood.
Doing it step-by-step:
```python
import morss
import morss, morss.crawler
url = 'http://newspaper.example/feed.xml'
options = morss.Options(csv=True) # arguments
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
@@ -391,11 +326,10 @@ The list of arguments can be obtained by running `morss --help`
```
usage: morss [-h] [--post STRING] [--xpath XPATH]
[--format {rss,json,html,csv}] [--search STRING] [--clip]
[--indent] [--cache] [--force] [--proxy]
[--order {first,last,newest,oldest}] [--firstlink] [--resolve]
[--items XPATH] [--item_link XPATH] [--item_title XPATH]
[--item_content XPATH] [--item_time XPATH]
[--mode {xml,html,json}] [--nolink] [--noref] [--silent]
[--indent] [--cache] [--force] [--proxy] [--newest] [--firstlink]
[--resolve] [--items XPATH] [--item_link XPATH]
[--item_title XPATH] [--item_content XPATH] [--item_time XPATH]
[--nolink] [--noref] [--silent]
url
Get full-text RSS feeds
@@ -403,7 +337,7 @@ Get full-text RSS feeds
positional arguments:
url feed url
options:
optional arguments:
-h, --help show this help message and exit
--post STRING POST request
--xpath XPATH xpath rule to manually detect the article
@@ -422,9 +356,8 @@ action:
articles' content), so as to save time
--force force refetch the rss feed and articles
--proxy doesn't fill the articles
--order {first,last,newest,oldest}
order in which to process items (which are however NOT
sorted in the output)
--newest return the feed items in chronological order (morss
ohterwise shows the items by appearing order)
--firstlink pull the first article mentioned in the description
instead of the default link
--resolve replace tracking links with direct links to articles
@@ -439,8 +372,6 @@ custom feeds:
--item_content XPATH entry's content
--item_time XPATH entry's date & time (accepts a wide range of time
formats)
--mode {xml,html,json}
parser to use for the custom feeds
misc:
--nolink drop links, but keeps links' inner text
@@ -466,7 +397,6 @@ To pass environment variables:
- docker-compose: add an `environment:` section in the .yml file
- Gunicorn/uWSGI/CLI: prepend `KEY=value` before the command
- Apache: via the `SetEnv` instruction (see sample `.htaccess` provided)
- cloud-init: in the `/etc/environment` file
Generic:
@@ -475,7 +405,6 @@ debugging.
- `IGNORE_SSL=1`: to ignore SSL certs when fetch feeds and articles
- `DELAY` (seconds) sets the browser cache delay, only for HTTP clients
- `TIMEOUT` (seconds) sets the HTTP timeout when fetching rss feeds and articles
- `DATA_PATH`: to set custom file location for the `www` folder
When parsing long feeds, with a lot of items (100+), morss might take a lot of
time to parse it, or might even run into a memory overflow on some shared
@@ -502,10 +431,15 @@ be dropped from the feed, even if they're cached. `-1` for unlimited.
morss uses caching to make loading faster. There are 3 possible cache backends:
- `(nothing/default)`: a simple python in-memory dict-like object.
- `CACHE=sqlite`: sqlite3 cache. Default file location is in-memory (i.e. it
will be cleared every time the program is run). Path can be defined with
`SQLITE_PATH`.
- `CACHE=mysql`: MySQL cache. Connection can be defined with the following
environment variables: `MYSQL_USER`, `MYSQL_PWD`, `MYSQL_DB`, `MYSQL_HOST`
- `CACHE=redis`: Redis cache. Connection can be defined with the following
environment variables: `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, `REDIS_PWD`
- `CACHE=diskcache`: disk-based cache. Target directory canbe defined with
`DISKCACHE_DIR`.
`DISKCAHE_DIR`.
To limit the size of the cache:
@@ -515,9 +449,6 @@ entries. NB. When using `diskcache`, this is the cache max size in Bytes.
- `CACHE_LIFESPAN` (seconds) sets how often the cache must be trimmed (i.e. cut
down to the number of items set in `CACHE_SIZE`). Defaults to 1min.
Gunicorn also accepts command line arguments via the `GUNICORN_CMD_ARGS`
environment variable.
### Content matching
The content of articles is grabbed with our own readability fork. This means

View File

@@ -1,21 +1 @@
{
"stack": "container",
"env": {
"DEBUG": {
"value": 1,
"required": false
},
"GUNICORN_CMD_ARGS": {
"value": "",
"required": false
},
"CACHE": {
"value": "diskcache",
"required": false
},
"CACHE_SIZE": {
"value": 1073741824,
"required": false
}
}
}
{"stack": "container"}

7
docker-entry.sh Executable file
View File

@@ -0,0 +1,7 @@
#! /bin/sh
if [ -z "$1" ] || [ "$@" = "run" ]; then
gunicorn --bind 0.0.0.0:${PORT:-8000} -w 4 --preload --access-logfile - morss
else
morss $@
fi

View File

@@ -1,3 +1,9 @@
setup:
config:
DEBUG: 1
CACHE: diskcache
CACHE_SIZE: 1073741824
build:
docker:
web: Dockerfile

0
main.py Executable file → Normal file
View File

View File

@@ -1,47 +0,0 @@
#! /bin/sh
set -ex
if ! command -v python && command -v python3 ; then
alias python='python3'
fi
run() {
gunicorn --bind 0.0.0.0:${PORT:-8000} --preload --access-logfile - morss
}
daemon() {
gunicorn --bind 0.0.0.0:${PORT:-8000} --preload --access-logfile - --daemon morss
}
reload() {
pid=$(pidof 'gunicorn: master [morss]' || true)
# NB. requires python-setproctitle
# `|| true` due to `set -e`
if [ -z "$pid" ]; then
# if gunicorn is not currently running
daemon
else
kill -s USR2 $pid
kill -s WINCH $pid
sleep 1 # give gunicorn some time to reload
kill -s TERM $pid
fi
}
check() {
python -m morss.crawler http://localhost:${PORT:-8000}/ > /dev/null 2>&1
}
if [ -z "$1" ]; then
run
elif [ "$1" = "sh" ] || [ "$1" = "bash" ] || command -v "$1" ; then
$@
else
python -m morss $@
fi

View File

@@ -1,13 +0,0 @@
[Unit]
Description=morss server (gunicorn)
After=network.target
[Service]
ExecStart=/usr/local/bin/morss-helper run
ExecReload=/usr/local/bin/morss-helper reload
KillMode=process
Restart=always
User=http
[Install]
WantedBy=multi-user.target

View File

@@ -19,7 +19,5 @@
# pylint: disable=unused-import,unused-variable
__version__ = ""
from .morss import *
from .wsgi import application

View File

@@ -16,6 +16,7 @@
# with this program. If not, see <https://www.gnu.org/licenses/>.
import os
import pickle
import threading
import time
from collections import OrderedDict
@@ -50,6 +51,83 @@ class BaseCache:
return True
try:
import sqlite3 # isort:skip
except ImportError:
pass
class SQLiteCache(BaseCache):
def __init__(self, path=':memory:'):
self.con = sqlite3.connect(path, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
with self.con:
self.con.execute('CREATE TABLE IF NOT EXISTS data (ky UNICODE PRIMARY KEY, data BLOB, timestamp INT)')
self.con.execute('pragma journal_mode=WAL')
self.trim()
def __del__(self):
self.con.close()
def trim(self):
with self.con:
self.con.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET ? ) foo )', (CACHE_SIZE,))
def __getitem__(self, key):
row = self.con.execute('SELECT * FROM data WHERE ky=?', (key,)).fetchone()
if not row:
raise KeyError
return row[1]
def __setitem__(self, key, data):
with self.con:
self.con.execute('INSERT INTO data VALUES (?,?,?) ON CONFLICT(ky) DO UPDATE SET data=?, timestamp=?', (key, data, time.time(), data, time.time()))
try:
import pymysql.cursors # isort:skip
except ImportError:
pass
class MySQLCacheHandler(BaseCache):
def __init__(self, user, password, database, host='localhost'):
self.user = user
self.password = password
self.database = database
self.host = host
with self.cursor() as cursor:
cursor.execute('CREATE TABLE IF NOT EXISTS data (ky VARCHAR(255) NOT NULL PRIMARY KEY, data MEDIUMBLOB, timestamp INT)')
self.trim()
def cursor(self):
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
def trim(self):
with self.cursor() as cursor:
cursor.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET %s ) foo )', (CACHE_SIZE,))
def __getitem__(self, key):
cursor = self.cursor()
cursor.execute('SELECT * FROM data WHERE ky=%s', (key,))
row = cursor.fetchone()
if not row:
raise KeyError
return row[1]
def __setitem__(self, key, data):
with self.cursor() as cursor:
cursor.execute('INSERT INTO data VALUES (%s,%s,%s) ON DUPLICATE KEY UPDATE data=%s, timestamp=%s',
(key, data, time.time(), data, time.time()))
class CappedDict(OrderedDict, BaseCache):
def trim(self):
if CACHE_SIZE >= 0:
@@ -104,7 +182,20 @@ class DiskCacheHandler(BaseCache):
if 'CACHE' in os.environ:
if os.environ['CACHE'] == 'redis':
if os.environ['CACHE'] == 'mysql':
default_cache = MySQLCacheHandler(
user = os.getenv('MYSQL_USER'),
password = os.getenv('MYSQL_PWD'),
database = os.getenv('MYSQL_DB'),
host = os.getenv('MYSQL_HOST', 'localhost')
)
elif os.environ['CACHE'] == 'sqlite':
default_cache = SQLiteCache(
os.getenv('SQLITE_PATH', ':memory:')
)
elif os.environ['CACHE'] == 'redis':
default_cache = RedisCacheHandler(
host = os.getenv('REDIS_HOST', 'localhost'),
port = int(os.getenv('REDIS_PORT', 6379)),
@@ -114,7 +205,7 @@ if 'CACHE' in os.environ:
elif os.environ['CACHE'] == 'diskcache':
default_cache = DiskCacheHandler(
directory = os.getenv('DISKCACHE_DIR', '/tmp/morss-diskcache'),
directory = os.getenv('DISKCAHE_DIR', '/tmp/morss-diskcache'),
size_limit = CACHE_SIZE # in Bytes
)

View File

@@ -44,7 +44,7 @@ def cli_app():
group.add_argument('--cache', action='store_true', help='only take articles from the cache (ie. don\'t grab new articles\' content), so as to save time')
group.add_argument('--force', action='store_true', help='force refetch the rss feed and articles')
group.add_argument('--proxy', action='store_true', help='doesn\'t fill the articles')
group.add_argument('--order', default='first', choices=('first', 'last', 'newest', 'oldest'), help='order in which to process items (which are however NOT sorted in the output)')
group.add_argument('--newest', action='store_true', help='return the feed items in chronological order (morss ohterwise shows the items by appearing order)')
group.add_argument('--firstlink', action='store_true', help='pull the first article mentioned in the description instead of the default link')
group.add_argument('--resolve', action='store_true', help='replace tracking links with direct links to articles (not compatible with --proxy)')
@@ -54,7 +54,6 @@ def cli_app():
group.add_argument('--item_title', action='store', type=str, metavar='XPATH', help='entry\'s title')
group.add_argument('--item_content', action='store', type=str, metavar='XPATH', help='entry\'s content')
group.add_argument('--item_time', action='store', type=str, metavar='XPATH', help='entry\'s date & time (accepts a wide range of time formats)')
group.add_argument('--mode', default=None, choices=('xml', 'html', 'json'), help='parser to use for the custom feeds')
group = parser.add_argument_group('misc')
group.add_argument('--nolink', action='store_true', help='drop links, but keeps links\' inner text')

View File

@@ -38,12 +38,12 @@ try:
from urllib2 import (BaseHandler, HTTPCookieProcessor, HTTPRedirectHandler,
Request, addinfourl, build_opener, parse_http_list,
parse_keqv_list)
from urlparse import urlsplit
from urlparse import urlparse, urlunparse
except ImportError:
# python 3
from email import message_from_string
from http.client import HTTPMessage
from urllib.parse import quote, urlsplit
from urllib.parse import quote, urlparse, urlunparse
from urllib.request import (BaseHandler, HTTPCookieProcessor,
HTTPRedirectHandler, Request, addinfourl,
build_opener, parse_http_list, parse_keqv_list)
@@ -59,9 +59,7 @@ except NameError:
MIMETYPE = {
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
'rss': ['application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
'html': ['text/html', 'application/xhtml+xml', 'application/xml'],
'json': ['application/json'],
}
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
DEFAULT_UAS = [
@@ -113,6 +111,8 @@ def adv_get(url, post=None, timeout=None, *args, **kwargs):
def custom_opener(follow=None, policy=None, force_min=None, force_max=None):
handlers = []
# as per urllib2 source code, these Handelers are added first
# *unless* one of the custom handlers inherits from one of them
#
@@ -130,18 +130,16 @@ def custom_opener(follow=None, policy=None, force_min=None, force_max=None):
# http_error_* are run until sth is returned (other than None). If they all
# return nothing, a python error is raised
handlers = [
#DebugHandler(),
SizeLimitHandler(500*1024), # 500KiB
HTTPCookieProcessor(),
GZIPHandler(),
HTTPAllRedirectHandler(),
HTTPEquivHandler(),
HTTPRefreshHandler(),
UAHandler(random.choice(DEFAULT_UAS)),
BrowserlyHeaderHandler(),
EncodingFixHandler(),
]
#handlers.append(DebugHandler())
handlers.append(SizeLimitHandler(500*1024)) # 500KiB
handlers.append(HTTPCookieProcessor())
handlers.append(GZIPHandler())
handlers.append(HTTPAllRedirectHandler())
handlers.append(HTTPEquivHandler())
handlers.append(HTTPRefreshHandler())
handlers.append(UAHandler(random.choice(DEFAULT_UAS)))
handlers.append(BrowserlyHeaderHandler())
handlers.append(EncodingFixHandler())
if follow:
handlers.append(AlternateHandler(MIMETYPE[follow]))
@@ -163,20 +161,10 @@ def is_ascii(string):
return True
def soft_quote(string):
" url-quote only when not a valid ascii string "
if is_ascii(string):
return string
else:
return quote(string.encode('utf-8'))
def sanitize_url(url):
# make sure the url is unicode, i.e. not bytes
if isinstance(url, bytes):
url = url.decode('utf-8')
url = url.decode()
# make sure there's a protocol (http://)
if url.split(':', 1)[0] not in PROTOCOL:
@@ -189,19 +177,18 @@ def sanitize_url(url):
url = url.replace(' ', '%20')
# escape non-ascii unicode characters
parts = urlsplit(url)
# https://stackoverflow.com/a/4391299
parts = list(urlparse(url))
parts = parts._replace(
netloc=parts.netloc.replace(
parts.hostname,
parts.hostname.encode('idna').decode('ascii')
),
path=soft_quote(parts.path),
query=soft_quote(parts.query),
fragment=soft_quote(parts.fragment),
)
for i in range(len(parts)):
if not is_ascii(parts[i]):
if i == 1:
parts[i] = parts[i].encode('idna').decode('ascii')
return parts.geturl()
else:
parts[i] = quote(parts[i].encode('utf-8'))
return urlunparse(parts)
class RespDataHandler(BaseHandler):
@@ -368,7 +355,7 @@ class BrowserlyHeaderHandler(BaseHandler):
def iter_html_tag(html_str, tag_name):
" To avoid parsing whole pages when looking for a simple tag "
re_tag = r'<%s\s+[^>]+>' % tag_name
re_tag = r'<%s(\s*[^>])*>' % tag_name
re_attr = r'(?P<key>[^=\s]+)=[\'"](?P<value>[^\'"]+)[\'"]'
for tag_match in re.finditer(re_tag, html_str):
@@ -425,7 +412,7 @@ class HTTPRefreshHandler(BaseHandler):
def http_response(self, req, resp):
if 200 <= resp.code < 300:
if resp.headers.get('refresh'):
regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url\s*=\s*(["\']?)(?P<url>.+)\2$'
regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url=(["\']?)(?P<url>.+)\2$'
match = re.search(regex, resp.headers.get('refresh'))
if match:
@@ -513,8 +500,6 @@ class CacheHandler(BaseHandler):
self.cache[key] = pickle.dumps(data, 0)
def cached_response(self, req, fallback=None):
req.from_morss_cache = True
data = self.load(req.get_full_url())
if data is not None:
@@ -527,10 +512,6 @@ class CacheHandler(BaseHandler):
return fallback
def save_response(self, req, resp):
if req.from_morss_cache:
# do not re-save (would reset the timing)
return resp
data = resp.read()
self.save(req.get_full_url(), {
@@ -549,8 +530,6 @@ class CacheHandler(BaseHandler):
return resp
def http_request(self, req):
req.from_morss_cache = False # to track whether it comes from cache
data = self.load(req.get_full_url())
if data is not None:
@@ -642,7 +621,8 @@ class CacheHandler(BaseHandler):
return None
def http_response(self, req, resp):
# code for after-fetch, to know whether to save to hard-drive (if sticking to http headers' will)
# code for after-fetch, to know whether to save to hard-drive (if stiking to http headers' will)
# NB. It might re-save requests pulled from cache, which will re-set the time() to the latest, i.e. lenghten its useful life
if resp.code == 304 and resp.url in self.cache:
# we are hopefully the first after the HTTP handler, so no need

View File

@@ -90,6 +90,9 @@ item_updated = updated
[html]
mode = html
path =
http://localhost/
title = //div[@id='header']/h1
desc = //div[@id='header']/p
items = //div[@id='content']/div

View File

@@ -17,7 +17,9 @@
import csv
import json
import os.path
import re
import sys
from copy import deepcopy
from datetime import datetime
from fnmatch import fnmatch
@@ -28,7 +30,6 @@ from dateutil import tz
from lxml import etree
from .readabilite import parse as html_parse
from .util import *
json.encoder.c_make_encoder = None
@@ -51,7 +52,7 @@ except NameError:
def parse_rules(filename=None):
if not filename:
filename = pkg_path('feedify.ini')
filename = os.path.join(os.path.dirname(__file__), 'feedify.ini')
config = RawConfigParser()
config.read(filename)
@@ -65,11 +66,19 @@ def parse_rules(filename=None):
# for each rule
if rules[section][arg].startswith('file:'):
path = data_path('www', rules[section][arg][5:])
paths = [os.path.join(sys.prefix, 'share/morss/www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../www', rules[section][arg][5:]),
os.path.join(os.path.dirname(__file__), '../..', rules[section][arg][5:])]
for path in paths:
try:
file_raw = open(path).read()
file_clean = re.sub('<[/?]?(xsl|xml)[^>]+?>', '', file_raw)
rules[section][arg] = file_clean
except IOError:
pass
elif '\n' in rules[section][arg]:
rules[section][arg] = rules[section][arg].split('\n')[1:]
@@ -94,7 +103,7 @@ def parse(data, url=None, encoding=None, ruleset=None):
if 'path' in ruleset:
for path in ruleset['path']:
if fnmatch(url, path):
parser = [x for x in parsers if x.mode == ruleset.get('mode')][0] # FIXME what if no mode specified?
parser = [x for x in parsers if x.mode == ruleset['mode']][0]
return parser(data, ruleset, encoding=encoding)
# 2) Try each and every parser
@@ -114,7 +123,7 @@ def parse(data, url=None, encoding=None, ruleset=None):
else:
# parsing worked, now we try the rulesets
ruleset_candidates = [x for x in rulesets if x.get('mode') in (parser.mode, None) and 'path' not in x]
ruleset_candidates = [x for x in rulesets if x.get('mode', None) in (parser.mode, None) and 'path' not in x]
# 'path' as they should have been caught beforehands
# try anyway if no 'mode' specified
@@ -187,12 +196,11 @@ class ParserBase(object):
return self.convert(FeedHTML).tostring(**k)
def convert(self, TargetParser):
target = TargetParser()
if type(self) == TargetParser and self.rules == target.rules:
# check both type *AND* rules (e.g. when going from freeform xml to rss)
if type(self) == TargetParser:
return self
target = TargetParser()
for attr in target.dic:
if attr == 'items':
for item in self.items:
@@ -361,13 +369,7 @@ class ParserXML(ParserBase):
def rule_search_all(self, rule):
try:
match = self.root.xpath(rule, namespaces=self.NSMAP)
if isinstance(match, str):
# some xpath rules return a single string instead of an array (e.g. concatenate() )
return [match,]
else:
return match
return self.root.xpath(rule, namespaces=self.NSMAP)
except etree.XPathEvalError:
return []
@@ -430,7 +432,7 @@ class ParserXML(ParserBase):
match = self.rule_search(rrule)
html_rich = ('atom' in rule or self.rules.get('mode') == 'html') \
html_rich = ('atom' in rule or self.rules['mode'] == 'html') \
and rule in [self.rules.get('item_desc'), self.rules.get('item_content')]
if key is not None:
@@ -441,7 +443,7 @@ class ParserXML(ParserBase):
self._clean_node(match)
match.append(lxml.html.fragment_fromstring(value, create_parent='div'))
if self.rules.get('mode') == 'html':
if self.rules['mode'] == 'html':
match.find('div').drop_tag() # not supported by lxml.etree
else: # i.e. if atom
@@ -490,14 +492,7 @@ class ParserHTML(ParserXML):
repl = r'[@class and contains(concat(" ", normalize-space(@class), " "), " \1 ")]'
rule = re.sub(pattern, repl, rule)
match = self.root.xpath(rule)
if isinstance(match, str):
# for some xpath rules, see XML parser
return [match,]
else:
return match
return self.root.xpath(rule)
except etree.XPathEvalError:
return []
@@ -699,7 +694,7 @@ class Feed(object):
try:
setattr(item, attr, new[attr])
except (KeyError, IndexError, TypeError):
except (IndexError, TypeError):
pass
return item
@@ -815,8 +810,6 @@ class FeedJSON(Feed, ParserJSON):
if __name__ == '__main__':
import sys
from . import crawler
req = crawler.adv_get(sys.argv[1] if len(sys.argv) > 1 else 'https://www.nytimes.com/', follow='rss')

View File

@@ -17,7 +17,6 @@
import os
import re
import sys
import time
from datetime import datetime
from fnmatch import fnmatch
@@ -60,7 +59,7 @@ def log(txt):
else:
# when using internal server or cli
print(repr(txt), file=sys.stderr)
print(repr(txt))
def len_html(txt):
@@ -276,7 +275,7 @@ def FeedFetch(url, options):
policy = None
try:
req = crawler.adv_get(url=url, post=options.post, follow=('rss' if not options.items else None), policy=policy, force_min=5*60, force_max=60*60, timeout=TIMEOUT)
req = crawler.adv_get(url=url, post=options.post, follow=('rss' if not options.items else None), policy=policy, force_min=5*60, force_max=60*60, timeout=TIMEOUT * 2)
except (IOError, HTTPException):
raise MorssException('Error downloading feed')
@@ -287,14 +286,11 @@ def FeedFetch(url, options):
ruleset['items'] = options.items
if options.mode:
ruleset['mode'] = options.mode
ruleset['title'] = options.get('title', '//head/title')
ruleset['desc'] = options.get('desc', '//head/meta[@name="description"]/@content')
ruleset['item_title'] = options.get('item_title', '.')
ruleset['item_link'] = options.get('item_link', '(.|.//a|ancestor::a)/@href')
ruleset['item_link'] = options.get('item_link', './@href|.//a/@href|ancestor::a/@href')
if options.item_content:
ruleset['item_content'] = options.item_content
@@ -332,20 +328,14 @@ def FeedGather(rss, url, options):
if options.cache:
max_time = 0
# sort
sorted_items = list(rss.items)
if options.order == 'last':
# `first` does nothing from a practical standpoint, so only `last` needs
# to be addressed
sorted_items = reversed(sorted_items)
elif options.order in ['newest', 'oldest']:
if options.newest:
# :newest take the newest items (instead of appearing order)
now = datetime.now(tz.tzutc())
sorted_items = sorted(sorted_items, key=lambda x:x.updated or x.time or now) # oldest to newest
sorted_items = sorted(rss.items, key=lambda x:x.updated or x.time or now, reverse=True)
if options.order == 'newest':
sorted_items = reversed(sorted_items)
else:
# default behavior, take the first items (in appearing order)
sorted_items = list(rss.items)
for i, item in enumerate(sorted_items):
# hard cap
@@ -428,7 +418,7 @@ def process(url, cache=None, options=None):
options = Options(options)
if cache:
caching.default_cache = caching.DiskCacheHandler(cache)
caching.default_cache = caching.SQLiteCache(cache)
url, rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options)

View File

@@ -17,20 +17,21 @@
import re
import bs4.builder._lxml
import lxml.etree
import lxml.html
import lxml.html.soupparser
class CustomTreeBuilder(bs4.builder._lxml.LXMLTreeBuilder):
def default_parser(self, encoding):
return lxml.html.HTMLParser(target=self, remove_comments=True, remove_pis=True, encoding=encoding)
from bs4 import BeautifulSoup
def parse(data, encoding=None):
kwargs = {'from_encoding': encoding} if encoding else {}
return lxml.html.soupparser.fromstring(data, builder=CustomTreeBuilder, **kwargs)
if encoding:
data = BeautifulSoup(data, 'lxml', from_encoding=encoding).prettify('utf-8')
else:
data = BeautifulSoup(data, 'lxml').prettify('utf-8')
parser = lxml.html.HTMLParser(remove_comments=True, encoding='utf-8')
return lxml.html.fromstring(data, parser=parser)
def count_words(string):
@@ -154,20 +155,15 @@ def score_all(node):
for child in node:
score = score_node(child)
set_score(child, score, 'morss_own_score')
child.attrib['morss_own_score'] = str(float(score))
if score > 0 or len(list(child.iterancestors())) <= 2:
spread_score(child, score)
score_all(child)
def set_score(node, value, label='morss_score'):
try:
node.attrib[label] = str(float(value))
except KeyError:
# catch issues with e.g. html comments
pass
def set_score(node, value):
node.attrib['morss_score'] = str(float(value))
def get_score(node):
@@ -207,12 +203,6 @@ def clean_root(root, keep_threshold=None):
def clean_node(node, keep_threshold=None):
parent = node.getparent()
# remove comments
if (isinstance(node, lxml.html.HtmlComment)
or isinstance(node, lxml.html.HtmlProcessingInstruction)):
parent.remove(node)
return
if parent is None:
# this is <html/> (or a removed element waiting for GC)
return
@@ -244,6 +234,11 @@ def clean_node(node, keep_threshold=None):
parent.remove(node)
return
# remove comments
if isinstance(node, lxml.html.HtmlComment) or isinstance(node, lxml.html.HtmlProcessingInstruction):
parent.remove(node)
return
# remove if too many kids & too high link density
wc = count_words(node.text_content())
if wc != 0 and len(list(node.iter())) > 3:
@@ -357,10 +352,6 @@ def get_article(data, url=None, encoding_in=None, encoding_out='unicode', debug=
else:
best = get_best_node(html, threshold)
if best is None:
# if threshold not met
return None
# clean up
if not debug:
keep_threshold = get_score(best) * 3/4

View File

@@ -1,57 +0,0 @@
# This file is part of morss
#
# Copyright (C) 2013-2020 pictuga <contact@pictuga.com>
#
# This program is free software: you can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option) any
# later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
# FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
# details.
#
# You should have received a copy of the GNU Affero General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
import os
import os.path
import sys
def pkg_path(*path_elements):
return os.path.join(os.path.dirname(__file__), *path_elements)
data_path_base = None
def data_path(*path_elements):
global data_path_base
path = os.path.join(*path_elements)
if data_path_base is not None:
return os.path.join(data_path_base, path)
bases = [
os.path.join(sys.prefix, 'share/morss'), # when installed as root
pkg_path('../../../share/morss'),
pkg_path('../../../../share/morss'),
pkg_path('../share/morss'), # for `pip install --target=dir morss`
pkg_path('..'), # when running from source tree
]
if 'DATA_PATH' in os.environ:
bases.append(os.environ['DATA_PATH'])
for base in bases:
full_path = os.path.join(base, path)
if os.path.isfile(full_path):
data_path_base = os.path.abspath(base)
return data_path(path)
else:
raise IOError()

View File

@@ -36,7 +36,6 @@ except ImportError:
from . import caching, crawler, readabilite
from .morss import (DELAY, TIMEOUT, FeedFetch, FeedFormat, FeedGather,
MorssException, Options, log)
from .util import data_path
PORT = int(os.getenv('PORT', 8000))
@@ -168,13 +167,18 @@ def cgi_file_handler(environ, start_response, app):
if re.match(r'^/?([a-zA-Z0-9_-][a-zA-Z0-9\._-]+/?)*$', url):
# if it is a legitimate url (no funny relative paths)
paths = [
os.path.join(sys.prefix, 'share/morss/www', url),
os.path.join(os.path.dirname(__file__), '../www', url)
]
for path in paths:
try:
path = data_path('www', url)
f = open(path, 'rb')
except IOError:
# problem with file (cannot open or not found)
pass
continue
else:
# file successfully open
@@ -192,10 +196,9 @@ def cgi_get(environ, start_response):
url, options = cgi_parse_environ(environ)
# get page
if options['get'] in ('page', 'article'):
req = crawler.adv_get(url=url, timeout=TIMEOUT)
if req['contenttype'] in crawler.MIMETYPE['html']:
if req['contenttype'] in ['text/html', 'application/xhtml+xml', 'application/xml']:
if options['get'] == 'page':
html = readabilite.parse(req['data'], encoding=req['encoding'])
html.make_links_absolute(req['url'])
@@ -208,20 +211,17 @@ def cgi_get(environ, start_response):
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8', method='html')
else: # i.e. options['get'] == 'article'
elif options['get'] == 'article':
output = readabilite.get_article(req['data'], url=req['url'], encoding_in=req['encoding'], encoding_out='utf-8', debug=options.debug)
elif req['contenttype'] in crawler.MIMETYPE['xml'] + crawler.MIMETYPE['rss'] + crawler.MIMETYPE['json']:
output = req['data']
else:
raise MorssException('unsupported mimetype')
else:
raise MorssException('no :get option passed')
else:
output = req['data']
# return html page
headers = {'status': '200 OK', 'content-type': req['contenttype'], 'X-Frame-Options': 'SAMEORIGIN'} # SAMEORIGIN to avoid potential abuse
headers = {'status': '200 OK', 'content-type': 'text/html; charset=utf-8', 'X-Frame-Options': 'SAMEORIGIN'} # SAMEORIGIN to avoid potential abuse
start_response(headers['status'], list(headers.items()))
return [output]
@@ -251,7 +251,7 @@ def cgi_error_handler(environ, start_response, app):
raise
except Exception as e:
headers = {'status': '404 Not Found', 'content-type': 'text/html', 'x-morss-error': repr(e)}
headers = {'status': '500 Oops', 'content-type': 'text/html'}
start_response(headers['status'], list(headers.items()), sys.exc_info())
log('ERROR: %s' % repr(e))
return [cgitb.html(sys.exc_info())]
@@ -281,7 +281,7 @@ def cgi_handle_request():
class WSGIRequestHandlerRequestUri(wsgiref.simple_server.WSGIRequestHandler):
def get_environ(self):
env = wsgiref.simple_server.WSGIRequestHandler.get_environ(self)
env = super().get_environ()
env['REQUEST_URI'] = self.path
return env

View File

@@ -1,60 +1,26 @@
from datetime import datetime
from glob import glob
from setuptools import setup
def get_version():
with open('morss/__init__.py', 'r+') as file:
lines = file.readlines()
# look for hard coded version number
for i in range(len(lines)):
if lines[i].startswith('__version__'):
version = lines[i].split('"')[1]
break
# create (& save) one if none found
if version == '':
version = datetime.now().strftime('%Y%m%d.%H%M')
lines[i] = '__version__ = "' + version + '"\n'
file.seek(0)
file.writelines(lines)
# return version number
return version
package_name = 'morss'
setup(
name = package_name,
version = get_version(),
description = 'Get full-text RSS feeds',
long_description = open('README.md').read(),
long_description_content_type = 'text/markdown',
author = 'pictuga',
author_email = 'contact@pictuga.com',
author = 'pictuga, Samuel Marks',
author_email = 'contact at pictuga dot com',
url = 'http://morss.it/',
project_urls = {
'Source': 'https://git.pictuga.com/pictuga/morss',
'Bug Tracker': 'https://github.com/pictuga/morss/issues',
},
download_url = 'https://git.pictuga.com/pictuga/morss',
license = 'AGPL v3',
packages = [package_name],
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet'],
extras_require = {
'full': ['redis', 'diskcache', 'gunicorn', 'setproctitle'],
'dev': ['pylint', 'pyenchant', 'pytest', 'pytest-cov'],
},
python_requires = '>=2.7',
extras_require = {'full': ['pymysql', 'redis', 'diskcache']},
package_data = {package_name: ['feedify.ini']},
data_files = [
('share/' + package_name, ['README.md', 'LICENSE']),
('share/' + package_name + '/www', glob('www/*.*')),
('share/' + package_name + '/www/cgi', [])
],
entry_points = {
'console_scripts': [package_name + '=' + package_name + '.__main__:main'],
},
scripts = ['morss-helper'],
)
'console_scripts': [package_name + '=' + package_name + '.__main__:main']
})

View File

@@ -1,60 +0,0 @@
import os
import os.path
import threading
import pytest
try:
# python2
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler
except:
# python3
from http.server import (BaseHTTPRequestHandler, HTTPServer,
SimpleHTTPRequestHandler)
class HTTPReplayHandler(SimpleHTTPRequestHandler):
" Serves pages saved alongside with headers. See `curl --http1.1 -is http://...` "
directory = os.path.join(os.path.dirname(__file__), './samples/')
__init__ = BaseHTTPRequestHandler.__init__
def do_GET(self):
path = self.translate_path(self.path)
if os.path.isdir(path):
f = self.list_directory(path)
else:
f = open(path, 'rb')
try:
self.copyfile(f, self.wfile)
finally:
f.close()
class MuteHTTPServer(HTTPServer):
def handle_error(self, request, client_address):
# mute errors
pass
def make_server(port=8888):
print('Serving http://localhost:%s/' % port)
return MuteHTTPServer(('', port), RequestHandlerClass=HTTPReplayHandler)
@pytest.fixture
def replay_server():
httpd = make_server()
thread = threading.Thread(target=httpd.serve_forever)
thread.start()
yield
httpd.shutdown()
thread.join()
if __name__ == '__main__':
httpd = make_server()
httpd.serve_forever()

View File

@@ -1,4 +0,0 @@
HTTP/1.1 200 OK
content-type: text/plain
success

View File

@@ -1,3 +0,0 @@
HTTP/1.1 301 Moved Permanently
location: /200-ok.txt

View File

@@ -1,3 +0,0 @@
HTTP/1.1 301 Moved Permanently
location: ./200-ok.txt

View File

@@ -1,3 +0,0 @@
HTTP/1.1 301 Moved Permanently
location: http://localhost:8888/200-ok.txt

View File

@@ -1,4 +0,0 @@
HTTP/1.1 308 Permanent Redirect
location: /200-ok.txt
/200-ok.txt

View File

@@ -1,8 +0,0 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><link rel="alternate" type="application/rss+xml" href="/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -1,4 +0,0 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=gb2312
<EFBFBD>ɹ<EFBFBD>

View File

@@ -1,10 +0,0 @@
HTTP/1.1 200 OK
content-type: text/html
<!DOCTYPE html>
<html>
<head><meta charset="gb2312"/></head>
<body>
<EFBFBD>ɹ<EFBFBD>
</body></html>

View File

@@ -1,4 +0,0 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=iso-8859-1
succ<EFBFBD>s

View File

@@ -1,4 +0,0 @@
HTTP/1.1 200 OK
content-type: text/plain
succ<EFBFBD>s

View File

@@ -1,4 +0,0 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=UTF-8
succès

View File

@@ -1,16 +0,0 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<summary>!ITEM_DESC!</summary>
<content type="html">!ITEM_CONTENT!</content>
<link href="!ITEM_LINK!"/>
<updated>2022-01-01T00:00:01+01:00</updated>
<published>2022-01-01T00:00:02+01:00</published>
</entry>
</feed>

View File

@@ -1,15 +0,0 @@
HTTP/1.1 200 OK
content-type: application/xml
<?xml version='1.0' encoding='utf-8' ?>
<feed version='0.3' xmlns='http://purl.org/atom/ns#'>
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<link rel='alternate' type='text/html' href='!ITEM_LINK!' />
<summary>!ITEM_DESC!</summary>
<content>!ITEM_CONTENT!</content>
<issued>2022-01-01T00:00:01+01:00</issued> <!-- FIXME -->
</entry>
</feed>

View File

@@ -1,22 +0,0 @@
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
<html>
<head></head>
<body>
<div id="header">
<h1>!TITLE!</h1>
<p>!DESC!</p>
</div>
<div id="content">
<div class="item">
<a target="_blank" href="!ITEM_LINK!">!ITEM_TITLE!</a>
<div class="desc">!ITEM_DESC!</div>
<div class="content">!ITEM_CONTENT!</div>
</div>
</div>
</body>
</html>

View File

@@ -1,16 +0,0 @@
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{
"title": "!TITLE!",
"desc": "!DESC!",
"items": [
{
"title": "!ITEM_TITLE!",
"time": "2022-01-01T00:00:01+0100",
"url": "!ITEM_LINK!",
"desc": "!ITEM_DESC!",
"content": "!ITEM_CONTENT!"
}
]
}

View File

@@ -1,17 +0,0 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
<channel>
<title>!TITLE!</title>
<description>!DESC!</description>
<item>
<title>!ITEM_TITLE!</title>
<pubDate>Mon, 01 Jan 2022 00:00:01 +0100</pubDate>
<link>!ITEM_LINK!</link>
<description>!ITEM_DESC!</description>
<content:encoded>!ITEM_CONTENT!</content:encoded>
</item>
</channel>
</rss>

Binary file not shown.

View File

@@ -1,3 +0,0 @@
HTTP/1.1 200 OK
refresh: 0;url=/200-ok.txt

View File

@@ -1,8 +0,0 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = /200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -1,8 +0,0 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = ./200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -1,8 +0,0 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = http://localhost:8888/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

File diff suppressed because it is too large Load Diff

View File

@@ -1,62 +0,0 @@
import pytest
from morss.crawler import *
def test_get(replay_server):
assert get('http://localhost:8888/200-ok.txt') == b'success\r\n'
def test_adv_get(replay_server):
assert adv_get('http://localhost:8888/200-ok.txt')['data'] == b'success\r\n'
@pytest.mark.parametrize('before,after', [
(b'http://localhost:8888/', 'http://localhost:8888/'),
('localhost:8888/', 'http://localhost:8888/'),
('http:/localhost:8888/', 'http://localhost:8888/'),
('http://localhost:8888/&/', 'http://localhost:8888/&/'),
('http://localhost:8888/ /', 'http://localhost:8888/%20/'),
('http://localhost-€/€/', 'http://xn--localhost--077e/%E2%82%AC/'),
('http://localhost-€:8888/€/', 'http://xn--localhost--077e:8888/%E2%82%AC/'),
])
def test_sanitize_url(before, after):
assert sanitize_url(before) == after
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(SizeLimitHandler(500*1024))])
def test_size_limit_handler(replay_server, opener):
assert len(opener.open('http://localhost:8888/size-1MiB.txt').read()) == 500*1024
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(GZIPHandler())])
def test_gzip_handler(replay_server, opener):
assert opener.open('http://localhost:8888/gzip.txt').read() == b'success\n'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(EncodingFixHandler())])
@pytest.mark.parametrize('url', [
'enc-gb2312-header.txt', 'enc-gb2312-meta.txt', #'enc-gb2312-missing.txt',
'enc-iso-8859-1-header.txt', 'enc-iso-8859-1-missing.txt',
'enc-utf-8-header.txt',
])
def test_encoding_fix_handler(replay_server, opener, url):
out = adv_get('http://localhost:8888/%s' % url)
out = out['data'].decode(out['encoding'])
assert 'succes' in out or 'succès' in out or '成功' in out
@pytest.mark.parametrize('opener', [custom_opener(follow='rss'), build_opener(AlternateHandler(MIMETYPE['rss']))])
def test_alternate_handler(replay_server, opener):
assert opener.open('http://localhost:8888/alternate-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPEquivHandler(), HTTPRefreshHandler())])
def test_http_equiv_handler(replay_server, opener):
assert opener.open('http://localhost:8888/meta-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPAllRedirectHandler())])
def test_http_all_redirect_handler(replay_server, opener):
assert opener.open('http://localhost:8888/308-redirect.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPRefreshHandler())])
def test_http_refresh_handler(replay_server, opener):
assert opener.open('http://localhost:8888/header-refresh.txt').geturl() == 'http://localhost:8888/200-ok.txt'

View File

@@ -1,108 +0,0 @@
import pytest
from morss.crawler import adv_get
from morss.feeds import *
def get_feed(url):
url = 'http://localhost:8888/%s' % url
out = adv_get(url)
feed = parse(out['data'], url=url, encoding=out['encoding'])
return feed
def check_feed(feed):
# NB. time and updated not covered
assert feed.title == '!TITLE!'
assert feed.desc == '!DESC!'
assert feed[0] == feed.items[0]
assert feed[0].title == '!ITEM_TITLE!'
assert feed[0].link == '!ITEM_LINK!'
assert '!ITEM_DESC!' in feed[0].desc # broader test due to possible inclusion of surrounding <div> in xml
assert '!ITEM_CONTENT!' in feed[0].content
def check_output(feed):
output = feed.tostring()
assert '!TITLE!' in output
assert '!DESC!' in output
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output
def check_change(feed):
feed.title = '!TITLE2!'
feed.desc = '!DESC2!'
feed[0].title = '!ITEM_TITLE2!'
feed[0].link = '!ITEM_LINK2!'
feed[0].desc = '!ITEM_DESC2!'
feed[0].content = '!ITEM_CONTENT2!'
assert feed.title == '!TITLE2!'
assert feed.desc == '!DESC2!'
assert feed[0].title == '!ITEM_TITLE2!'
assert feed[0].link == '!ITEM_LINK2!'
assert '!ITEM_DESC2!' in feed[0].desc
assert '!ITEM_CONTENT2!' in feed[0].content
def check_add(feed):
feed.append({
'title': '!ITEM_TITLE3!',
'link': '!ITEM_LINK3!',
'desc': '!ITEM_DESC3!',
'content': '!ITEM_CONTENT3!',
})
assert feed[1].title == '!ITEM_TITLE3!'
assert feed[1].link == '!ITEM_LINK3!'
assert '!ITEM_DESC3!' in feed[1].desc
assert '!ITEM_CONTENT3!' in feed[1].content
each_format = pytest.mark.parametrize('url', [
'feed-rss-channel-utf-8.txt', 'feed-atom-utf-8.txt',
'feed-atom03-utf-8.txt', 'feed-json-utf-8.txt', 'feed-html-utf-8.txt',
])
each_check = pytest.mark.parametrize('check', [
check_feed, check_output, check_change, check_add,
])
@each_format
@each_check
def test_parse(replay_server, url, check):
feed = get_feed(url)
check(feed)
@each_format
@each_check
def test_convert_rss(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedXML)
check(feed)
@each_format
@each_check
def test_convert_json(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedJSON)
check(feed)
@each_format
@each_check
def test_convert_html(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedHTML)
if len(feed) > 1:
# remove the 'blank' default html item
del feed[0]
check(feed)
@each_format
def test_convert_csv(replay_server, url):
# only csv output, not csv feed, check therefore differnet
feed = get_feed(url)
output = feed.tocsv()
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output

15
www/.htaccess Normal file
View File

@@ -0,0 +1,15 @@
Options -Indexes
ErrorDocument 403 "Access forbidden"
ErrorDocument 404 /cgi/main.py
ErrorDocument 500 "A very nasty bug found his way onto this very server"
# Uncomment below line to turn debug on for all requests
#SetEnv DEBUG 1
# Uncomment below line to turn debug on for requests with :debug in the url
#SetEnvIf Request_URI :debug DEBUG=1
<Files ~ "\.(py|pyc|db|log)$">
deny from all
</Files>

9
www/cgi/.htaccess Normal file
View File

@@ -0,0 +1,9 @@
order allow,deny
deny from all
<Files main.py>
allow from all
AddHandler cgi-script .py
Options +ExecCGI
</Files>

View File

@@ -16,7 +16,6 @@
<title>RSS feed by morss</title>
<meta name="viewport" content="width=device-width; initial-scale=1.0;" />
<meta name="robots" content="noindex" />
<link rel="shortcut icon" type="image/svg+xml" href="/logo.svg" sizes="any" />
<style type="text/css">
body * {
@@ -204,9 +203,7 @@
link of the
<select>
<option value="">first</option>
<option value=":order=newest" title="Select feed items by publication date (instead of appearing order)">newest (?)</option>
<option value=":order=last">last</option>
<option value=":order=oldest">oldest</option>
<option value=":newest" title="Select feed items by publication date (instead of appearing order)">newest (?)</option>
</select>
items and
<select>