Compare commits

...

17 Commits

Author SHA1 Message Date
c5b2df754e Save auto version number
All checks were successful
default / test-lint (push) Successful in 1m31s
default / python-publish (push) Successful in 35s
default / docker-publish-deploy (push) Successful in 1m56s
Fixed #108
2023-06-27 22:36:29 +02:00
6529fdbdd8 Clean up sqlite code
All checks were successful
default / test-lint (push) Successful in 1m26s
default / python-publish (push) Successful in 30s
default / docker-publish-deploy (push) Successful in 1m35s
2023-06-26 01:30:47 +02:00
f4da40fffb actions: fix deploy 2023-06-26 01:29:00 +02:00
d27fc93f75 actions: clean up 2023-06-26 01:28:33 +02:00
dfb2b83c06 actions: fix python setup
Some checks reported warnings
default / publish-deploy (push) Has been cancelled
default / docker-publish (push) Has been cancelled
default / test-lint (push) Successful in 1m33s
2023-06-24 01:50:12 +02:00
4340b678d0 actions: change image
Some checks failed
default / test-lint (push) Failing after 23s
default / publish-deploy (push) Failing after 12s
default / docker-publish (push) Successful in 2m19s
2023-06-23 23:14:32 +02:00
ff9503b0d0 Switch from Drone to Gitea Actions
Some checks failed
default / publish-deploy (push) Failing after 45s
default / docker-publish (push) Failing after 10s
default / test-lint (push) Failing after 11m42s
2023-05-17 22:54:05 +02:00
Nesswit
8bdcd8f386 Add mode option 2023-05-04 16:01:52 +09:00
ea2ebedfcb Added systemd service file
Some checks failed
continuous-integration/drone/push Build is failing
Fixing #94
2022-12-13 23:01:42 +01:00
438c32a312 Remove sqlite & mysql cache backends
Some checks failed
continuous-integration/drone/push Build is failing
Obsoleted since the introduction of diskcache & redis
2022-12-13 22:40:13 +01:00
8b26797e93 README: add recommended install way
Some checks reported errors
continuous-integration/drone/push Build was killed
continuous-integration/drone Build is passing
Part of discussions on #94
2022-12-13 22:07:21 +01:00
e1ed33f320 crawler: improve html iter code
All checks were successful
continuous-integration/drone/push Build is passing
Ignores tags without attributes. Avoids bug with unclosed tags.
2022-02-09 15:57:12 +01:00
b65272daab crawler: accept more meta redirects
All checks were successful
continuous-integration/drone/push Build is passing
2022-02-01 23:32:49 +01:00
4d64afe9cb crawler: fix regression from d6b90448f3
Some checks failed
continuous-integration/drone/push Build is failing
2022-02-01 23:18:16 +01:00
d3b623482d pytest: crawler 2022-02-01 23:16:43 +01:00
32645548c2 pytest: first batch with test_feeds
Some checks failed
continuous-integration/drone/push Build is failing
And multiple related fixes
2022-01-31 08:32:34 +01:00
d6b90448f3 crawler: improve handling of non-ascii urls 2022-01-30 23:27:49 +01:00
37 changed files with 9777 additions and 224 deletions

View File

@@ -1,93 +0,0 @@
---
kind: pipeline
name: test
steps:
- name: lint
image: alpine:edge
commands:
- apk add --no-cache python3 py3-lxml py3-setproctitle py3-pip py3-wheel py3-enchant hunspell-en
- pip3 install --no-cache-dir .[full] .[dev]
- isort --check-only --diff .
- pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8
---
kind: pipeline
name: python
steps:
- name: publish
image: plugins/pypi
settings:
username:
from_secret: pypi_user
password:
from_secret: pypi_pwd
commands:
- /bin/drone-pypi
- cp dist/morss-*.tar.gz dist/morss.tar.gz
- name: push
image: appleboy/drone-scp
settings:
host:
from_secret: ssh_host
username:
from_secret: ssh_user
key:
from_secret: ssh_key
source:
- dist/morss.tar.gz
target: /home/ubuntu
- name: deploy
image: appleboy/drone-ssh
settings:
host:
from_secret: ssh_host
username:
from_secret: ssh_user
key:
from_secret: ssh_key
script_stop: true
script:
- sudo pip install --upgrade dist/morss.tar.gz[full]
- sudo rm -r dist
- sudo morss-helper reload
trigger:
branch:
- master
event:
- push
depends_on:
- test
---
kind: pipeline
name: docker
steps:
- name: publish
image: thegeeklab/drone-docker-buildx
# NB. this requires qemu installed on host
privileged: true
settings:
username:
from_secret: docker_user
password:
from_secret: docker_pwd
repo:
from_secret: docker_repo
tags: latest
platforms: linux/amd64,linux/arm64,linux/arm/v7
trigger:
branch:
- master
event:
- push
depends_on:
- test

78
.github/workflows/default.yml vendored Normal file
View File

@@ -0,0 +1,78 @@
name: default
on:
push:
branches:
- master
jobs:
test-lint:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip libenchant-2-2 aspell-en
- name: Install dependencies
run: pip3 install .[full] .[dev]
- run: isort --check-only --diff .
- run: pylint morss --rcfile=.pylintrc --disable=C,R,W --fail-under=8
- run: pytest --cov=morss tests
python-publish:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Prepare image
run: apt-get -y update && apt-get -y install python3-pip python3-build
- name: Build package
run: python3 -m build
- name: Publish package
uses: https://github.com/pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.pypi_api_token }}
docker-publish-deploy:
runs-on: ubuntu-latest
container:
image: catthehacker/ubuntu:act-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up QEMU
uses: https://github.com/docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: https://github.com/docker/setup-buildx-action@v2
- name: Login to Docker Hub
uses: https://github.com/docker/login-action@v2
with:
username: ${{ secrets.docker_user }}
password: ${{ secrets.docker_pwd }}
- name: Build and push
uses: https://github.com/docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64,linux/arm/v7
push: true
tags: ${{ secrets.docker_repo }}
- name: Deploy on server
uses: https://github.com/appleboy/ssh-action@v0.1.10
with:
host: ${{ secrets.ssh_host }}
username: ${{ secrets.ssh_user }}
key: ${{ secrets.ssh_key }}
script: morss-update

View File

@@ -41,7 +41,7 @@ Some features of morss:
- Follow 301/meta redirects - Follow 301/meta redirects
- Recover xml feeds with corrupt encoding - Recover xml feeds with corrupt encoding
- Supports gzip-compressed http content - Supports gzip-compressed http content
- HTTP caching with different backends (in-memory/sqlite/mysql/redis/diskcache) - HTTP caching with different backends (in-memory/redis/diskcache)
- Works as server/cli tool - Works as server/cli tool
- Deobfuscate various tracking links - Deobfuscate various tracking links
@@ -81,9 +81,9 @@ From git
pip install git+https://git.pictuga.com/pictuga/morss.git#egg=morss[full] pip install git+https://git.pictuga.com/pictuga/morss.git#egg=morss[full]
``` ```
The full install includes all the cache backends. Otherwise, only in-memory and The full install includes all the cache backends. Otherwise, only in-memory
sqlite3 caches are available. The full install also includes gunicorn (for more cache is available. The full install also includes gunicorn (for more efficient
efficient HTTP handling). HTTP handling).
The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
C code needs to be compiled). If possible on your distribution, try installing C code needs to be compiled). If possible on your distribution, try installing
@@ -103,7 +103,7 @@ With cli
docker pull pictuga/morss docker pull pictuga/morss
``` ```
With docker-compose With docker-compose **(recommended)**
```yml ```yml
services: services:
@@ -215,7 +215,7 @@ From source
docker run -p 8000:8000 morss docker run -p 8000:8000 morss
``` ```
With docker-compose With docker-compose **(recommended)**
```shell ```shell
docker-compose up docker-compose up
@@ -353,7 +353,7 @@ Using cache and passing arguments:
```python ```python
>>> import morss >>> import morss
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml' >>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
>>> cache = '/tmp/morss-cache.db' # sqlite cache location >>> cache = '/tmp/morss-cache' # diskcache cache location
>>> options = {'csv':True} >>> options = {'csv':True}
>>> xml_string = morss.process(url, cache, options) >>> xml_string = morss.process(url, cache, options)
>>> xml_string[:50] >>> xml_string[:50]
@@ -367,11 +367,10 @@ under the hood.
Doing it step-by-step: Doing it step-by-step:
```python ```python
import morss, morss.crawler import morss
url = 'http://newspaper.example/feed.xml' url = 'http://newspaper.example/feed.xml'
options = morss.Options(csv=True) # arguments options = morss.Options(csv=True) # arguments
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
@@ -395,8 +394,8 @@ usage: morss [-h] [--post STRING] [--xpath XPATH]
[--indent] [--cache] [--force] [--proxy] [--indent] [--cache] [--force] [--proxy]
[--order {first,last,newest,oldest}] [--firstlink] [--resolve] [--order {first,last,newest,oldest}] [--firstlink] [--resolve]
[--items XPATH] [--item_link XPATH] [--item_title XPATH] [--items XPATH] [--item_link XPATH] [--item_title XPATH]
[--item_content XPATH] [--item_time XPATH] [--nolink] [--noref] [--item_content XPATH] [--item_time XPATH]
[--silent] [--mode {xml,html,json}] [--nolink] [--noref] [--silent]
url url
Get full-text RSS feeds Get full-text RSS feeds
@@ -440,6 +439,8 @@ custom feeds:
--item_content XPATH entry's content --item_content XPATH entry's content
--item_time XPATH entry's date & time (accepts a wide range of time --item_time XPATH entry's date & time (accepts a wide range of time
formats) formats)
--mode {xml,html,json}
parser to use for the custom feeds
misc: misc:
--nolink drop links, but keeps links' inner text --nolink drop links, but keeps links' inner text
@@ -501,11 +502,6 @@ be dropped from the feed, even if they're cached. `-1` for unlimited.
morss uses caching to make loading faster. There are 3 possible cache backends: morss uses caching to make loading faster. There are 3 possible cache backends:
- `(nothing/default)`: a simple python in-memory dict-like object. - `(nothing/default)`: a simple python in-memory dict-like object.
- `CACHE=sqlite`: sqlite3 cache. Default file location is in-memory (i.e. it
will be cleared every time the program is run). Path can be defined with
`SQLITE_PATH`.
- `CACHE=mysql`: MySQL cache. Connection can be defined with the following
environment variables: `MYSQL_USER`, `MYSQL_PWD`, `MYSQL_DB`, `MYSQL_HOST`
- `CACHE=redis`: Redis cache. Connection can be defined with the following - `CACHE=redis`: Redis cache. Connection can be defined with the following
environment variables: `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, `REDIS_PWD` environment variables: `REDIS_HOST`, `REDIS_PORT`, `REDIS_DB`, `REDIS_PWD`
- `CACHE=diskcache`: disk-based cache. Target directory canbe defined with - `CACHE=diskcache`: disk-based cache. Target directory canbe defined with

13
morss.service Normal file
View File

@@ -0,0 +1,13 @@
[Unit]
Description=morss server (gunicorn)
After=network.target
[Service]
ExecStart=/usr/local/bin/morss-helper run
ExecReload=/usr/local/bin/morss-helper reload
KillMode=process
Restart=always
User=http
[Install]
WantedBy=multi-user.target

View File

@@ -19,5 +19,7 @@
# pylint: disable=unused-import,unused-variable # pylint: disable=unused-import,unused-variable
__version__ = ""
from .morss import * from .morss import *
from .wsgi import application from .wsgi import application

View File

@@ -16,7 +16,6 @@
# with this program. If not, see <https://www.gnu.org/licenses/>. # with this program. If not, see <https://www.gnu.org/licenses/>.
import os import os
import pickle
import threading import threading
import time import time
from collections import OrderedDict from collections import OrderedDict
@@ -51,83 +50,6 @@ class BaseCache:
return True return True
try:
import sqlite3 # isort:skip
except ImportError:
pass
class SQLiteCache(BaseCache):
def __init__(self, path=':memory:'):
self.con = sqlite3.connect(path, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
with self.con:
self.con.execute('CREATE TABLE IF NOT EXISTS data (ky UNICODE PRIMARY KEY, data BLOB, timestamp INT)')
self.con.execute('pragma journal_mode=WAL')
self.trim()
def __del__(self):
self.con.close()
def trim(self):
with self.con:
self.con.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET ? ) foo )', (CACHE_SIZE,))
def __getitem__(self, key):
row = self.con.execute('SELECT * FROM data WHERE ky=?', (key,)).fetchone()
if not row:
raise KeyError
return row[1]
def __setitem__(self, key, data):
with self.con:
self.con.execute('INSERT INTO data VALUES (?,?,?) ON CONFLICT(ky) DO UPDATE SET data=?, timestamp=?', (key, data, time.time(), data, time.time()))
try:
import pymysql.cursors # isort:skip
except ImportError:
pass
class MySQLCacheHandler(BaseCache):
def __init__(self, user, password, database, host='localhost'):
self.user = user
self.password = password
self.database = database
self.host = host
with self.cursor() as cursor:
cursor.execute('CREATE TABLE IF NOT EXISTS data (ky VARCHAR(255) NOT NULL PRIMARY KEY, data MEDIUMBLOB, timestamp INT)')
self.trim()
def cursor(self):
return pymysql.connect(host=self.host, user=self.user, password=self.password, database=self.database, charset='utf8', autocommit=True).cursor()
def trim(self):
with self.cursor() as cursor:
cursor.execute('DELETE FROM data WHERE timestamp <= ( SELECT timestamp FROM ( SELECT timestamp FROM data ORDER BY timestamp DESC LIMIT 1 OFFSET %s ) foo )', (CACHE_SIZE,))
def __getitem__(self, key):
cursor = self.cursor()
cursor.execute('SELECT * FROM data WHERE ky=%s', (key,))
row = cursor.fetchone()
if not row:
raise KeyError
return row[1]
def __setitem__(self, key, data):
with self.cursor() as cursor:
cursor.execute('INSERT INTO data VALUES (%s,%s,%s) ON DUPLICATE KEY UPDATE data=%s, timestamp=%s',
(key, data, time.time(), data, time.time()))
class CappedDict(OrderedDict, BaseCache): class CappedDict(OrderedDict, BaseCache):
def trim(self): def trim(self):
if CACHE_SIZE >= 0: if CACHE_SIZE >= 0:
@@ -182,20 +104,7 @@ class DiskCacheHandler(BaseCache):
if 'CACHE' in os.environ: if 'CACHE' in os.environ:
if os.environ['CACHE'] == 'mysql': if os.environ['CACHE'] == 'redis':
default_cache = MySQLCacheHandler(
user = os.getenv('MYSQL_USER'),
password = os.getenv('MYSQL_PWD'),
database = os.getenv('MYSQL_DB'),
host = os.getenv('MYSQL_HOST', 'localhost')
)
elif os.environ['CACHE'] == 'sqlite':
default_cache = SQLiteCache(
os.getenv('SQLITE_PATH', ':memory:')
)
elif os.environ['CACHE'] == 'redis':
default_cache = RedisCacheHandler( default_cache = RedisCacheHandler(
host = os.getenv('REDIS_HOST', 'localhost'), host = os.getenv('REDIS_HOST', 'localhost'),
port = int(os.getenv('REDIS_PORT', 6379)), port = int(os.getenv('REDIS_PORT', 6379)),

View File

@@ -54,6 +54,7 @@ def cli_app():
group.add_argument('--item_title', action='store', type=str, metavar='XPATH', help='entry\'s title') group.add_argument('--item_title', action='store', type=str, metavar='XPATH', help='entry\'s title')
group.add_argument('--item_content', action='store', type=str, metavar='XPATH', help='entry\'s content') group.add_argument('--item_content', action='store', type=str, metavar='XPATH', help='entry\'s content')
group.add_argument('--item_time', action='store', type=str, metavar='XPATH', help='entry\'s date & time (accepts a wide range of time formats)') group.add_argument('--item_time', action='store', type=str, metavar='XPATH', help='entry\'s date & time (accepts a wide range of time formats)')
group.add_argument('--mode', default=None, choices=('xml', 'html', 'json'), help='parser to use for the custom feeds')
group = parser.add_argument_group('misc') group = parser.add_argument_group('misc')
group.add_argument('--nolink', action='store_true', help='drop links, but keeps links\' inner text') group.add_argument('--nolink', action='store_true', help='drop links, but keeps links\' inner text')

View File

@@ -38,12 +38,12 @@ try:
from urllib2 import (BaseHandler, HTTPCookieProcessor, HTTPRedirectHandler, from urllib2 import (BaseHandler, HTTPCookieProcessor, HTTPRedirectHandler,
Request, addinfourl, build_opener, parse_http_list, Request, addinfourl, build_opener, parse_http_list,
parse_keqv_list) parse_keqv_list)
from urlparse import urlparse, urlunparse from urlparse import urlsplit
except ImportError: except ImportError:
# python 3 # python 3
from email import message_from_string from email import message_from_string
from http.client import HTTPMessage from http.client import HTTPMessage
from urllib.parse import quote, urlparse, urlunparse from urllib.parse import quote, urlsplit
from urllib.request import (BaseHandler, HTTPCookieProcessor, from urllib.request import (BaseHandler, HTTPCookieProcessor,
HTTPRedirectHandler, Request, addinfourl, HTTPRedirectHandler, Request, addinfourl,
build_opener, parse_http_list, parse_keqv_list) build_opener, parse_http_list, parse_keqv_list)
@@ -163,10 +163,20 @@ def is_ascii(string):
return True return True
def soft_quote(string):
" url-quote only when not a valid ascii string "
if is_ascii(string):
return string
else:
return quote(string.encode('utf-8'))
def sanitize_url(url): def sanitize_url(url):
# make sure the url is unicode, i.e. not bytes # make sure the url is unicode, i.e. not bytes
if isinstance(url, bytes): if isinstance(url, bytes):
url = url.decode() url = url.decode('utf-8')
# make sure there's a protocol (http://) # make sure there's a protocol (http://)
if url.split(':', 1)[0] not in PROTOCOL: if url.split(':', 1)[0] not in PROTOCOL:
@@ -179,18 +189,19 @@ def sanitize_url(url):
url = url.replace(' ', '%20') url = url.replace(' ', '%20')
# escape non-ascii unicode characters # escape non-ascii unicode characters
# https://stackoverflow.com/a/4391299 parts = urlsplit(url)
parts = list(urlparse(url))
for i in range(len(parts)): parts = parts._replace(
if not is_ascii(parts[i]): netloc=parts.netloc.replace(
if i == 1: parts.hostname,
parts[i] = parts[i].encode('idna').decode('ascii') parts.hostname.encode('idna').decode('ascii')
),
path=soft_quote(parts.path),
query=soft_quote(parts.query),
fragment=soft_quote(parts.fragment),
)
else: return parts.geturl()
parts[i] = quote(parts[i].encode('utf-8'))
return urlunparse(parts)
class RespDataHandler(BaseHandler): class RespDataHandler(BaseHandler):
@@ -357,7 +368,7 @@ class BrowserlyHeaderHandler(BaseHandler):
def iter_html_tag(html_str, tag_name): def iter_html_tag(html_str, tag_name):
" To avoid parsing whole pages when looking for a simple tag " " To avoid parsing whole pages when looking for a simple tag "
re_tag = r'<%s(\s*[^>])*>' % tag_name re_tag = r'<%s\s+[^>]+>' % tag_name
re_attr = r'(?P<key>[^=\s]+)=[\'"](?P<value>[^\'"]+)[\'"]' re_attr = r'(?P<key>[^=\s]+)=[\'"](?P<value>[^\'"]+)[\'"]'
for tag_match in re.finditer(re_tag, html_str): for tag_match in re.finditer(re_tag, html_str):
@@ -414,7 +425,7 @@ class HTTPRefreshHandler(BaseHandler):
def http_response(self, req, resp): def http_response(self, req, resp):
if 200 <= resp.code < 300: if 200 <= resp.code < 300:
if resp.headers.get('refresh'): if resp.headers.get('refresh'):
regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url=(["\']?)(?P<url>.+)\2$' regex = r'(?i)^(?P<delay>[0-9]+)\s*;\s*url\s*=\s*(["\']?)(?P<url>.+)\2$'
match = re.search(regex, resp.headers.get('refresh')) match = re.search(regex, resp.headers.get('refresh'))
if match: if match:

View File

@@ -90,9 +90,6 @@ item_updated = updated
[html] [html]
mode = html mode = html
path =
http://localhost/
title = //div[@id='header']/h1 title = //div[@id='header']/h1
desc = //div[@id='header']/p desc = //div[@id='header']/p
items = //div[@id='content']/div items = //div[@id='content']/div

View File

@@ -699,7 +699,7 @@ class Feed(object):
try: try:
setattr(item, attr, new[attr]) setattr(item, attr, new[attr])
except (IndexError, TypeError): except (KeyError, IndexError, TypeError):
pass pass
return item return item

View File

@@ -287,6 +287,9 @@ def FeedFetch(url, options):
ruleset['items'] = options.items ruleset['items'] = options.items
if options.mode:
ruleset['mode'] = options.mode
ruleset['title'] = options.get('title', '//head/title') ruleset['title'] = options.get('title', '//head/title')
ruleset['desc'] = options.get('desc', '//head/meta[@name="description"]/@content') ruleset['desc'] = options.get('desc', '//head/meta[@name="description"]/@content')
@@ -425,7 +428,7 @@ def process(url, cache=None, options=None):
options = Options(options) options = Options(options)
if cache: if cache:
caching.default_cache = caching.SQLiteCache(cache) caching.default_cache = caching.DiskCacheHandler(cache)
url, rss = FeedFetch(url, options) url, rss = FeedFetch(url, options)
rss = FeedGather(rss, url, options) rss = FeedGather(rss, url, options)

View File

@@ -3,11 +3,33 @@ from glob import glob
from setuptools import setup from setuptools import setup
def get_version():
with open('morss/__init__.py', 'r+') as file:
lines = file.readlines()
# look for hard coded version number
for i in range(len(lines)):
if lines[i].startswith('__version__'):
version = lines[i].split('"')[1]
break
# create (& save) one if none found
if version == '':
version = datetime.now().strftime('%Y%m%d.%H%M')
lines[i] = '__version__ = "' + version + '"\n'
file.seek(0)
file.writelines(lines)
# return version number
return version
package_name = 'morss' package_name = 'morss'
setup( setup(
name = package_name, name = package_name,
version = datetime.now().strftime('%Y%m%d.%H%M'), version = get_version(),
description = 'Get full-text RSS feeds', description = 'Get full-text RSS feeds',
long_description = open('README.md').read(), long_description = open('README.md').read(),
long_description_content_type = 'text/markdown', long_description_content_type = 'text/markdown',
@@ -22,8 +44,8 @@ setup(
packages = [package_name], packages = [package_name],
install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet'], install_requires = ['lxml', 'bs4', 'python-dateutil', 'chardet'],
extras_require = { extras_require = {
'full': ['pymysql', 'redis', 'diskcache', 'gunicorn', 'setproctitle'], 'full': ['redis', 'diskcache', 'gunicorn', 'setproctitle'],
'dev': ['pylint'] 'dev': ['pylint', 'pyenchant', 'pytest', 'pytest-cov'],
}, },
python_requires = '>=2.7', python_requires = '>=2.7',
package_data = {package_name: ['feedify.ini']}, package_data = {package_name: ['feedify.ini']},

60
tests/conftest.py Normal file
View File

@@ -0,0 +1,60 @@
import os
import os.path
import threading
import pytest
try:
# python2
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler
except:
# python3
from http.server import (BaseHTTPRequestHandler, HTTPServer,
SimpleHTTPRequestHandler)
class HTTPReplayHandler(SimpleHTTPRequestHandler):
" Serves pages saved alongside with headers. See `curl --http1.1 -is http://...` "
directory = os.path.join(os.path.dirname(__file__), './samples/')
__init__ = BaseHTTPRequestHandler.__init__
def do_GET(self):
path = self.translate_path(self.path)
if os.path.isdir(path):
f = self.list_directory(path)
else:
f = open(path, 'rb')
try:
self.copyfile(f, self.wfile)
finally:
f.close()
class MuteHTTPServer(HTTPServer):
def handle_error(self, request, client_address):
# mute errors
pass
def make_server(port=8888):
print('Serving http://localhost:%s/' % port)
return MuteHTTPServer(('', port), RequestHandlerClass=HTTPReplayHandler)
@pytest.fixture
def replay_server():
httpd = make_server()
thread = threading.Thread(target=httpd.serve_forever)
thread.start()
yield
httpd.shutdown()
thread.join()
if __name__ == '__main__':
httpd = make_server()
httpd.serve_forever()

4
tests/samples/200-ok.txt Normal file
View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain
success

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: /200-ok.txt

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: ./200-ok.txt

View File

@@ -0,0 +1,3 @@
HTTP/1.1 301 Moved Permanently
location: http://localhost:8888/200-ok.txt

View File

@@ -0,0 +1,4 @@
HTTP/1.1 308 Permanent Redirect
location: /200-ok.txt
/200-ok.txt

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><link rel="alternate" type="application/rss+xml" href="/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=gb2312
<EFBFBD>ɹ<EFBFBD>

View File

@@ -0,0 +1,10 @@
HTTP/1.1 200 OK
content-type: text/html
<!DOCTYPE html>
<html>
<head><meta charset="gb2312"/></head>
<body>
<EFBFBD>ɹ<EFBFBD>
</body></html>

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=iso-8859-1
succ<EFBFBD>s

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain
succ<EFBFBD>s

View File

@@ -0,0 +1,4 @@
HTTP/1.1 200 OK
content-type: text/plain; charset=UTF-8
succès

View File

@@ -0,0 +1,16 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<summary>!ITEM_DESC!</summary>
<content type="html">!ITEM_CONTENT!</content>
<link href="!ITEM_LINK!"/>
<updated>2022-01-01T00:00:01+01:00</updated>
<published>2022-01-01T00:00:02+01:00</published>
</entry>
</feed>

View File

@@ -0,0 +1,15 @@
HTTP/1.1 200 OK
content-type: application/xml
<?xml version='1.0' encoding='utf-8' ?>
<feed version='0.3' xmlns='http://purl.org/atom/ns#'>
<title>!TITLE!</title>
<subtitle>!DESC!</subtitle>
<entry>
<title>!ITEM_TITLE!</title>
<link rel='alternate' type='text/html' href='!ITEM_LINK!' />
<summary>!ITEM_DESC!</summary>
<content>!ITEM_CONTENT!</content>
<issued>2022-01-01T00:00:01+01:00</issued> <!-- FIXME -->
</entry>
</feed>

View File

@@ -0,0 +1,22 @@
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
<html>
<head></head>
<body>
<div id="header">
<h1>!TITLE!</h1>
<p>!DESC!</p>
</div>
<div id="content">
<div class="item">
<a target="_blank" href="!ITEM_LINK!">!ITEM_TITLE!</a>
<div class="desc">!ITEM_DESC!</div>
<div class="content">!ITEM_CONTENT!</div>
</div>
</div>
</body>
</html>

View File

@@ -0,0 +1,16 @@
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{
"title": "!TITLE!",
"desc": "!DESC!",
"items": [
{
"title": "!ITEM_TITLE!",
"time": "2022-01-01T00:00:01+0100",
"url": "!ITEM_LINK!",
"desc": "!ITEM_DESC!",
"content": "!ITEM_CONTENT!"
}
]
}

View File

@@ -0,0 +1,17 @@
HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8
<?xml version='1.0' encoding='utf-8'?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
<channel>
<title>!TITLE!</title>
<description>!DESC!</description>
<item>
<title>!ITEM_TITLE!</title>
<pubDate>Mon, 01 Jan 2022 00:00:01 +0100</pubDate>
<link>!ITEM_LINK!</link>
<description>!ITEM_DESC!</description>
<content:encoded>!ITEM_CONTENT!</content:encoded>
</item>
</channel>
</rss>

BIN
tests/samples/gzip.txt Normal file

Binary file not shown.

View File

@@ -0,0 +1,3 @@
HTTP/1.1 200 OK
refresh: 0;url=/200-ok.txt

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = /200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = ./200-ok.txt" /></head>
<body>meta redirect</body>
</html>

View File

@@ -0,0 +1,8 @@
HTTP/1.1 200 OK
content-type: text/html; charset=UTF-8
<!DOCTYPE html>
<html>
<head><meta http-equiv="refresh" content="2; url = http://localhost:8888/200-ok.txt" /></head>
<body>meta redirect</body>
</html>

9220
tests/samples/size-1MiB.txt Normal file

File diff suppressed because it is too large Load Diff

62
tests/test_crawler.py Normal file
View File

@@ -0,0 +1,62 @@
import pytest
from morss.crawler import *
def test_get(replay_server):
assert get('http://localhost:8888/200-ok.txt') == b'success\r\n'
def test_adv_get(replay_server):
assert adv_get('http://localhost:8888/200-ok.txt')['data'] == b'success\r\n'
@pytest.mark.parametrize('before,after', [
(b'http://localhost:8888/', 'http://localhost:8888/'),
('localhost:8888/', 'http://localhost:8888/'),
('http:/localhost:8888/', 'http://localhost:8888/'),
('http://localhost:8888/&/', 'http://localhost:8888/&/'),
('http://localhost:8888/ /', 'http://localhost:8888/%20/'),
('http://localhost-€/€/', 'http://xn--localhost--077e/%E2%82%AC/'),
('http://localhost-€:8888/€/', 'http://xn--localhost--077e:8888/%E2%82%AC/'),
])
def test_sanitize_url(before, after):
assert sanitize_url(before) == after
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(SizeLimitHandler(500*1024))])
def test_size_limit_handler(replay_server, opener):
assert len(opener.open('http://localhost:8888/size-1MiB.txt').read()) == 500*1024
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(GZIPHandler())])
def test_gzip_handler(replay_server, opener):
assert opener.open('http://localhost:8888/gzip.txt').read() == b'success\n'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(EncodingFixHandler())])
@pytest.mark.parametrize('url', [
'enc-gb2312-header.txt', 'enc-gb2312-meta.txt', #'enc-gb2312-missing.txt',
'enc-iso-8859-1-header.txt', 'enc-iso-8859-1-missing.txt',
'enc-utf-8-header.txt',
])
def test_encoding_fix_handler(replay_server, opener, url):
out = adv_get('http://localhost:8888/%s' % url)
out = out['data'].decode(out['encoding'])
assert 'succes' in out or 'succès' in out or '成功' in out
@pytest.mark.parametrize('opener', [custom_opener(follow='rss'), build_opener(AlternateHandler(MIMETYPE['rss']))])
def test_alternate_handler(replay_server, opener):
assert opener.open('http://localhost:8888/alternate-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPEquivHandler(), HTTPRefreshHandler())])
def test_http_equiv_handler(replay_server, opener):
assert opener.open('http://localhost:8888/meta-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/meta-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPAllRedirectHandler())])
def test_http_all_redirect_handler(replay_server, opener):
assert opener.open('http://localhost:8888/308-redirect.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-abs.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-rel.txt').geturl() == 'http://localhost:8888/200-ok.txt'
assert opener.open('http://localhost:8888/301-redirect-url.txt').geturl() == 'http://localhost:8888/200-ok.txt'
@pytest.mark.parametrize('opener', [custom_opener(), build_opener(HTTPRefreshHandler())])
def test_http_refresh_handler(replay_server, opener):
assert opener.open('http://localhost:8888/header-refresh.txt').geturl() == 'http://localhost:8888/200-ok.txt'

108
tests/test_feeds.py Normal file
View File

@@ -0,0 +1,108 @@
import pytest
from morss.crawler import adv_get
from morss.feeds import *
def get_feed(url):
url = 'http://localhost:8888/%s' % url
out = adv_get(url)
feed = parse(out['data'], url=url, encoding=out['encoding'])
return feed
def check_feed(feed):
# NB. time and updated not covered
assert feed.title == '!TITLE!'
assert feed.desc == '!DESC!'
assert feed[0] == feed.items[0]
assert feed[0].title == '!ITEM_TITLE!'
assert feed[0].link == '!ITEM_LINK!'
assert '!ITEM_DESC!' in feed[0].desc # broader test due to possible inclusion of surrounding <div> in xml
assert '!ITEM_CONTENT!' in feed[0].content
def check_output(feed):
output = feed.tostring()
assert '!TITLE!' in output
assert '!DESC!' in output
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output
def check_change(feed):
feed.title = '!TITLE2!'
feed.desc = '!DESC2!'
feed[0].title = '!ITEM_TITLE2!'
feed[0].link = '!ITEM_LINK2!'
feed[0].desc = '!ITEM_DESC2!'
feed[0].content = '!ITEM_CONTENT2!'
assert feed.title == '!TITLE2!'
assert feed.desc == '!DESC2!'
assert feed[0].title == '!ITEM_TITLE2!'
assert feed[0].link == '!ITEM_LINK2!'
assert '!ITEM_DESC2!' in feed[0].desc
assert '!ITEM_CONTENT2!' in feed[0].content
def check_add(feed):
feed.append({
'title': '!ITEM_TITLE3!',
'link': '!ITEM_LINK3!',
'desc': '!ITEM_DESC3!',
'content': '!ITEM_CONTENT3!',
})
assert feed[1].title == '!ITEM_TITLE3!'
assert feed[1].link == '!ITEM_LINK3!'
assert '!ITEM_DESC3!' in feed[1].desc
assert '!ITEM_CONTENT3!' in feed[1].content
each_format = pytest.mark.parametrize('url', [
'feed-rss-channel-utf-8.txt', 'feed-atom-utf-8.txt',
'feed-atom03-utf-8.txt', 'feed-json-utf-8.txt', 'feed-html-utf-8.txt',
])
each_check = pytest.mark.parametrize('check', [
check_feed, check_output, check_change, check_add,
])
@each_format
@each_check
def test_parse(replay_server, url, check):
feed = get_feed(url)
check(feed)
@each_format
@each_check
def test_convert_rss(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedXML)
check(feed)
@each_format
@each_check
def test_convert_json(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedJSON)
check(feed)
@each_format
@each_check
def test_convert_html(replay_server, url, check):
feed = get_feed(url)
feed = feed.convert(FeedHTML)
if len(feed) > 1:
# remove the 'blank' default html item
del feed[0]
check(feed)
@each_format
def test_convert_csv(replay_server, url):
# only csv output, not csv feed, check therefore differnet
feed = get_feed(url)
output = feed.tocsv()
assert '!ITEM_TITLE!' in output
assert '!ITEM_LINK!' in output
assert '!ITEM_DESC!' in output
assert '!ITEM_CONTENT!' in output