2017-03-19 08:36:32 +00:00
# Morss - Get full-text RSS feeds
2013-02-25 15:40:51 +00:00
2018-11-11 15:13:32 +00:00
_GNU AGPLv3 code_
2020-03-19 22:06:25 +00:00
Upstream source code: https://git.pictuga.com/pictuga/morss
Github mirror (for Issues & Pull requests): https://github.com/pictuga/morss
2020-03-19 22:04:21 +00:00
Homepage: https://morss.it/
2015-08-29 10:45:36 +00:00
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
commonly available on internet. Indeed most newspapers only make a small
description available to users in their rss feeds, which makes the RSS feed
rather useless. So this tool intends to fix that problem.
2013-02-25 20:49:38 +00:00
2015-08-29 10:45:36 +00:00
This tool opens the links from the rss feed, then downloads the full article
from the newspaper website and puts it back in the rss feed.
2014-05-21 10:09:50 +00:00
2015-08-29 10:45:36 +00:00
Morss also provides additional features, such as: .csv and json export, extended
control over output. A strength of morss is its ability to deal with broken
feeds, and to replace tracking links with direct links to the actual content.
Morss can also generate feeds from html and json files (see `feedify.py` ), which
for instance makes it possible to get feeds for Facebook or Twitter, using
hand-written rules (ie. there's no automatic detection of links to build feeds).
Please mind that feeds based on html files may stop working unexpectedly, due to
html structure changes on the target website.
2020-04-05 20:20:33 +00:00
Additionally morss can detect rss feeds in html pages' `<meta>` .
2015-08-29 10:45:36 +00:00
2018-09-30 19:56:03 +00:00
You can use this program online for free at ** [morss.it ](https://morss.it/ )**.
2013-04-15 16:51:55 +00:00
2018-09-30 19:56:30 +00:00
Some features of morss:
- Read RSS/Atom feeds
- Create RSS feeds from json/html pages
- Export feeds as RSS/JSON/CSV/HTML
- Fetch full-text content of feed items
- Follow 301/meta redirects
- Recover xml feeds with corrupt encoding
- Supports gzip-compressed http content
- HTTP caching with 3 different backends (in-memory/sqlite/mysql)
- Works as server/cli tool
- Deobfuscate various tracking links
2017-03-19 08:36:32 +00:00
## Dependencies
2013-06-19 19:12:03 +00:00
You do need:
2015-08-29 10:45:36 +00:00
2015-04-07 11:01:41 +00:00
- [python ](http://www.python.org/ ) >= 2.6 (python 3 is supported)
2013-06-19 19:12:03 +00:00
- [lxml ](http://lxml.de/ ) for xml parsing
2020-04-04 14:37:15 +00:00
- [bs4 ](https://pypi.org/project/bs4/ ) for badly-formatted html pages
2013-10-12 21:43:09 +00:00
- [dateutil ](http://labix.org/python-dateutil ) to parse feed dates
2017-03-08 21:45:13 +00:00
- [chardet ](https://pypi.python.org/pypi/chardet )
2017-07-30 14:58:53 +00:00
- [six ](https://pypi.python.org/pypi/six ), a dependency of chardet
2017-11-04 13:51:41 +00:00
- pymysql
2013-06-19 19:12:03 +00:00
2014-06-21 14:38:48 +00:00
Simplest way to get these:
2015-08-29 10:45:36 +00:00
2020-03-20 10:33:52 +00:00
```shell
2020-04-14 15:36:42 +00:00
pip install git+https://git.pictuga.com/pictuga/morss.git@master
2020-03-20 10:33:52 +00:00
```
2014-06-21 14:38:48 +00:00
2020-04-28 19:58:26 +00:00
The dependency `lxml` is fairly long to install (especially on Raspberry Pi, as
C code needs to be compiled). If possible on your distribution, try installing
it with the system package manager.
2013-06-19 19:12:03 +00:00
You may also need:
2015-08-29 10:45:36 +00:00
2013-06-19 19:12:03 +00:00
- Apache, with python-cgi support, to run on a server
- a fast internet connection
2017-03-19 08:36:32 +00:00
## Arguments
2013-11-16 16:48:21 +00:00
2015-08-29 10:45:36 +00:00
morss accepts some arguments, to lightly alter the output of morss. Arguments
may need to have a value (usually a string or a number). In the different "Use
cases" below is detailed how to pass those arguments to morss.
2013-11-16 16:48:21 +00:00
2020-08-21 21:52:56 +00:00
The list of arguments can be obtained by running `morss --help`
2013-11-16 16:48:21 +00:00
2020-08-21 21:52:56 +00:00
```
usage: morss [-h] [--format {rss,json,html,csv}] [--search STRING] [--clip] [--indent] [--cache] [--force] [--proxy] [--newest] [--firstlink] [--items XPATH] [--item_link XPATH]
[--item_title XPATH] [--item_content XPATH] [--item_time XPATH] [--nolink] [--noref] [--debug]
url
Get full-text RSS feeds
positional arguments:
url feed url
optional arguments:
-h, --help show this help message and exit
output:
--format {rss,json,html,csv}
output format
--search STRING does a basic case-sensitive search in the feed
--clip stick the full article content under the original feed content (useful for twitter)
--indent returns indented XML or JSON, takes more place, but human-readable
action:
--cache only take articles from the cache (ie. don't grab new articles' content), so as to save time
--force force refetch the rss feed and articles
--proxy doesn't fill the articles
--newest return the feed items in chronological order (morss ohterwise shows the items by appearing order)
--firstlink pull the first article mentioned in the description instead of the default link
custom feeds:
--items XPATH (mandatory to activate the custom feeds function) xpath rule to match all the RSS entries
--item_link XPATH xpath rule relative to items to point to the entry's link
--item_title XPATH entry's title
--item_content XPATH entry's content
--item_time XPATH entry's date & time (accepts a wide range of time formats)
misc:
--nolink drop links, but keeps links' inner text
--noref drop items' link
2020-08-21 22:02:08 +00:00
--silent don't output the final RSS (useless on its own, but can be nice when debugging)
2020-08-21 21:52:56 +00:00
GNU AGPLv3 code
```
Further options:
2013-11-16 16:48:21 +00:00
- Change what morss does
2020-08-23 12:33:32 +00:00
- Environment variable `DEBUG=` : to have some feedback from the script execution. Useful for debugging. On Apache, can be set via the `SetEnv` instruction
2020-08-21 21:52:56 +00:00
- `silent` : don't output the final RSS (useless on its own, but can be nice when debugging)
- `callback=NAME` : for JSONP calls
- `cors` : allow Cross-origin resource sharing (allows XHR calls from other servers)
- `txt` : changes the http content-type to txt (for faster "`view-source:`")
2013-11-16 16:48:21 +00:00
2017-03-19 08:36:32 +00:00
## Use cases
2013-04-28 09:37:11 +00:00
morss will auto-detect what "mode" to use.
2017-03-19 08:36:32 +00:00
### Running on a server
#### Via mod_cgi/FastCGI with Apache/nginx
2014-05-21 10:09:50 +00:00
2015-08-29 10:45:36 +00:00
For this, you'll want to change a bit the architecture of the files, for example
into something like this.
2014-05-21 10:09:50 +00:00
2015-08-29 10:45:36 +00:00
```
/
├── cgi
│ │
│ ├── main.py
│ ├── morss
│ │ ├── __init__ .py
│ │ ├── __main__ .py
│ │ ├── morss.py
2017-03-09 01:21:26 +00:00
│ │ └── ...
2015-08-29 10:45:36 +00:00
│ │
│ ├── dateutil
2017-03-09 01:21:26 +00:00
│ └── ...
2015-08-29 10:45:36 +00:00
│
├── .htaccess
2017-03-09 01:21:26 +00:00
├── index.html
└── ...
2015-08-29 10:45:36 +00:00
```
For this, you need to make sure your host allows python script execution. This
method uses HTTP calls to fetch the RSS feeds, which will be handled through
`mod_cgi` for example on Apache severs.
Please pay attention to `main.py` permissions for it to be executable. Also
ensure that the provided `/www/.htaccess` works well with your server.
2017-03-19 08:36:32 +00:00
#### Using uWSGI
2015-08-29 10:45:36 +00:00
Running this command should do:
2020-03-20 10:33:52 +00:00
```shell
2020-04-18 15:15:59 +00:00
uwsgi --http :8080 --plugin python --wsgi-file main.py
2020-03-20 10:33:52 +00:00
```
2015-08-29 10:45:36 +00:00
2020-04-15 20:31:21 +00:00
#### Using Gunicorn
```shell
gunicorn morss:cgi_standalone_app
```
2020-04-18 15:04:44 +00:00
#### Using docker
2020-04-19 11:01:08 +00:00
Build & run
2020-04-18 15:04:44 +00:00
```shell
2020-04-19 11:01:08 +00:00
docker build https://git.pictuga.com/pictuga/morss.git -t morss
docker run -p 8080:8080 morss
2020-04-18 15:04:44 +00:00
```
2020-04-19 11:01:08 +00:00
In one line
2020-04-18 15:04:44 +00:00
```shell
2020-04-18 15:15:59 +00:00
docker run -p 8080:8080 $(docker build -q https://git.pictuga.com/pictuga/morss.git)
2020-04-18 15:04:44 +00:00
```
2020-07-13 18:50:39 +00:00
With docker-compose:
```yml
services:
app:
build: https://git.pictuga.com/pictuga/morss.git
ports:
- '8080:8080'
```
Then run
```shell
docker-compose up --build
```
2017-03-19 08:36:32 +00:00
#### Using morss' internal HTTP server
2014-01-09 19:34:12 +00:00
2015-08-29 10:45:36 +00:00
Morss can run its own HTTP server. The later should start when you run morss
without any argument, on port 8080.
2020-04-14 16:21:32 +00:00
```shell
morss
```
You can change the port like this `morss 9000` .
2014-01-09 19:34:12 +00:00
2017-03-19 08:36:32 +00:00
#### Passing arguments
2014-01-09 19:34:12 +00:00
2020-03-20 10:41:43 +00:00
Then visit:
```
http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL
```
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
2013-11-16 16:48:21 +00:00
*(Brackets indicate optional text)*
2015-08-29 10:45:36 +00:00
The `main.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess` .
2013-11-16 16:48:21 +00:00
2014-01-09 19:34:12 +00:00
Works like a charm with [Tiny Tiny RSS ](http://tt-rss.org/redmine/projects/tt-rss/wiki ), and most probably other clients.
2013-04-04 15:43:30 +00:00
2017-03-19 08:36:32 +00:00
### As a CLI application
2013-11-16 16:48:21 +00:00
2020-03-20 10:41:43 +00:00
Run:
```
2020-08-21 21:52:56 +00:00
morss [--argwithoutvalue] [--argwithvalue=value] [...] FEEDURL
2020-03-20 10:41:43 +00:00
```
2020-08-21 21:52:56 +00:00
For example: `morss --debug http://feeds.bbci.co.uk/news/rss.xml`
2020-03-20 10:41:43 +00:00
2013-11-16 16:48:21 +00:00
*(Brackets indicate optional text)*
2017-03-19 08:36:32 +00:00
### As a newsreader hook
2013-04-04 15:43:30 +00:00
2015-08-29 10:45:36 +00:00
To use it, the newsreader [Liferea ](http://lzone.de/liferea/ ) is required
(unless other newsreaders provide the same kind of feature), since custom
scripts can be run on top of the RSS feed, using its
[output ](http://lzone.de/liferea/scraping.htm ) as an RSS feed.
2013-04-04 15:43:30 +00:00
2020-03-20 10:41:43 +00:00
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
```
2020-04-14 16:07:19 +00:00
morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
2020-03-20 10:41:43 +00:00
```
2020-04-14 16:07:19 +00:00
For example: `morss http://feeds.bbci.co.uk/news/rss.xml`
2020-03-20 10:41:43 +00:00
2014-05-21 10:07:15 +00:00
*(Brackets indicate optional text)*
2013-04-04 15:43:30 +00:00
2017-03-19 08:36:32 +00:00
### As a python library
2014-05-21 10:14:26 +00:00
2014-05-24 17:03:05 +00:00
Quickly get a full-text feed:
```python
>>> import morss
2014-05-24 17:14:12 +00:00
>>> xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')
2014-05-24 17:03:05 +00:00
>>> xml_string[:50]
"<?xml version='1.0' encoding='UTF-8'?> \n< ?xml-style"
```
2014-05-24 17:14:12 +00:00
Using cache and passing arguments:
2014-05-24 17:03:05 +00:00
```python
>>> import morss
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
2015-04-06 15:26:12 +00:00
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
2020-03-18 15:47:00 +00:00
>>> options = {'csv':True}
2014-05-24 17:03:05 +00:00
>>> xml_string = morss.process(url, cache, options)
>>> xml_string[:50]
'{"title": "BBC News - Home", "desc": "The latest s'
```
2015-08-29 10:45:36 +00:00
`morss.process` is actually a wrapper around simpler function. It's still
possible to call the simpler functions, to have more control on what's happening
under the hood.
2014-05-24 17:17:58 +00:00
2014-05-24 17:03:05 +00:00
Doing it step-by-step:
2014-05-21 10:46:48 +00:00
```python
2015-04-06 15:26:12 +00:00
import morss, morss.crawler
2014-05-21 13:05:35 +00:00
2014-05-21 10:46:48 +00:00
url = 'http://newspaper.example/feed.xml'
2020-03-18 15:47:00 +00:00
options = morss.Options(csv=True) # arguments
2015-04-06 15:26:12 +00:00
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
2014-05-21 10:46:48 +00:00
2020-04-28 20:30:21 +00:00
url, rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
2017-03-09 01:17:40 +00:00
rss = morss.FeedGather(rss, url, options) # this fills the feed and cleans it up
2014-05-21 10:46:48 +00:00
2020-04-27 16:12:14 +00:00
output = morss.FeedFormat(rss, options, 'unicode') # formats final feed
2014-05-21 10:46:48 +00:00
```
2014-05-21 10:14:26 +00:00
2017-03-19 08:36:32 +00:00
## Cache information
2013-04-04 15:43:30 +00:00
2020-04-19 10:54:02 +00:00
morss uses caching to make loading faster. There are 3 possible cache backends
2020-03-20 10:27:05 +00:00
(visible in `morss/crawler.py` ):
2020-04-19 10:54:02 +00:00
- `{}` : a simple python in-memory dict() object
2020-03-20 10:27:05 +00:00
- `SQLiteCache` : sqlite3 cache. Default file location is in-memory (i.e. it will
be cleared every time the program is run
2020-04-19 10:29:52 +00:00
- `MySQLCacheHandler`
2013-04-04 15:43:30 +00:00
2017-03-19 08:36:32 +00:00
## Configuration
### Length limitation
2013-04-04 15:43:30 +00:00
2015-08-29 10:45:36 +00:00
When parsing long feeds, with a lot of items (100+), morss might take a lot of
time to parse it, or might even run into a memory overflow on some shared
hosting plans (limits around 10Mb), in which case you might want to adjust the
different values at the top of the script.
2013-05-23 19:48:45 +00:00
- `MAX_TIME` sets the maximum amount of time spent *fetching* articles, more time might be spent taking older articles from cache. `-1` for unlimited.
2013-11-16 16:48:21 +00:00
- `MAX_ITEM` sets the maximum number of articles to fetch. `-1` for unlimited. More articles will be taken from cache following the nexts settings.
- `LIM_TIME` sets the maximum amount of time spent working on the feed (whether or not it's already cached). Articles beyond that limit will be dropped from the feed. `-1` for unlimited.
- `LIM_ITEM` sets the maximum number of article checked, limiting both the number of articles fetched and taken from cache. Articles beyond that limit will be dropped from the feed, even if they're cached. `-1` for unlimited.
2013-04-19 09:37:43 +00:00
2017-03-19 08:36:32 +00:00
### Other settings
2014-01-09 19:34:12 +00:00
- `DELAY` sets the browser cache delay, only for HTTP clients
- `TIMEOUT` sets the HTTP timeout when fetching rss feeds and articles
2017-03-19 08:36:32 +00:00
### Content matching
2013-04-19 09:37:43 +00:00
2017-03-01 05:25:59 +00:00
The content of articles is grabbed with our own readability fork. This means
2015-08-29 10:45:36 +00:00
that most of the time the right content is matched. However sometimes it fails,
therefore some tweaking is required. Most of the time, what has to be done is to
add some "rules" in the main script file in *readability* (not in morss).
2013-03-29 19:05:53 +00:00
2015-08-29 10:45:36 +00:00
Most of the time when hardly nothing is matched, it means that the main content
of the article is made of images, videos, pictures, etc., which readability
doesn't detect. Also, readability has some trouble to match content of very
small articles.
2013-04-04 16:31:26 +00:00
2015-08-29 10:45:36 +00:00
morss will also try to figure out whether the full content is already in place
(for those websites which understood the whole point of RSS feeds). However this
detection is very simple, and only works if the actual content is put in the
"content" section in the feed and not in the "summary" section.
2013-04-04 16:31:26 +00:00
2013-06-19 19:16:46 +00:00
***
2017-03-19 08:36:32 +00:00
## Todo
2013-06-19 19:12:03 +00:00
2015-08-29 10:45:36 +00:00
You can contribute to this project. If you're not sure what to do, you can pick
from this list:
2013-06-19 19:12:03 +00:00
- Add ability to run morss.py as an update daemon
2015-05-04 14:23:08 +00:00
- Add ability to use custom xpath rule instead of readability
2017-03-19 08:38:50 +00:00
- More ideas here < https: // github . com / pictuga / morss / issues / 15 >