pictuga
|
b525ab0d26
|
crawler: fix typo
|
2020-10-30 22:12:43 +01:00 |
pictuga
|
bd0bca69fc
|
crawler: ignore ssl via env var
|
2020-10-03 19:57:08 +02:00 |
pictuga
|
8abd951d40
|
More sensible default values for cache autotrim (1k entries, 1min)
|
2020-10-03 19:55:57 +02:00 |
pictuga
|
2514fabd38
|
Replace memory-leak-prone Uniq with @uniq_wrapper
|
2020-10-03 19:43:55 +02:00 |
pictuga
|
8cb7002fe6
|
feeds: make it possible to append empty items
And return the newly appended items, to make it easy to edit them
|
2020-10-03 16:56:07 +02:00 |
pictuga
|
6966e03bef
|
Clean up itemClass code
To avoid globals()
|
2020-10-03 16:25:29 +02:00 |
pictuga
|
9ce6acba20
|
Fix gunicorn related typo
|
2020-10-01 00:07:41 +02:00 |
pictuga
|
056a1b143f
|
crawler: autotrim: make ctrl+c working
|
2020-10-01 00:04:36 +02:00 |
pictuga
|
eed949736a
|
crawler: add ability to limit cache size
|
2020-09-30 23:59:55 +02:00 |
pictuga
|
2fc7cd391c
|
Shift __main__'s wsgi code where it belongs
|
2020-09-30 23:24:51 +02:00 |
pictuga
|
d9f46b23a6
|
crawler: default value for MYSQL_HOST (localhost)
|
2020-09-30 13:17:02 +02:00 |
pictuga
|
bbada0436a
|
Quick guide to ignore SSL certs
|
2020-09-27 16:48:22 +02:00 |
pictuga
|
039a672f4e
|
wsgi: clean up url reconstruction
|
2020-09-27 16:28:26 +02:00 |
pictuga
|
9ecf856f10
|
Add :resolve to remove (some?) tracking links
|
2020-09-15 22:57:52 +02:00 |
pictuga
|
56e0c2391d
|
Missing import for served files
|
2020-08-28 20:53:03 +02:00 |
pictuga
|
679f406a12
|
Default mimetype for served files
|
2020-08-28 20:52:43 +02:00 |
pictuga
|
f6d641eeef
|
Serve any file in www/
Also fixes #41
|
2020-08-28 20:45:39 +02:00 |
pictuga
|
2456dd9bbc
|
Fix broken pieces
Including #43
|
2020-08-28 19:38:48 +02:00 |
pictuga
|
0f33db248a
|
Add license info in each file
|
2020-08-26 20:08:22 +02:00 |
pictuga
|
75935114e4
|
Remove leftover code
|
2020-08-23 19:07:12 +02:00 |
pictuga
|
aa9143302b
|
Remove now-unused isInt code
|
2020-08-23 18:51:09 +02:00 |
pictuga
|
0d62a7625b
|
Define http port via env vars as well
|
2020-08-23 18:50:18 +02:00 |
pictuga
|
bd0efb1529
|
crawler: missing os import
|
2020-08-23 18:45:44 +02:00 |
pictuga
|
47a17614ef
|
Rename morss/cgi.py into morss/wsgi.py
To avoid name collision with the built-in cgi lib
|
2020-08-23 18:44:49 +02:00 |
pictuga
|
4dfebe78f7
|
Pick caching backend via env vars
|
2020-08-23 18:43:18 +02:00 |
pictuga
|
dcd3e4a675
|
cgi.py: add missing impots
|
2020-08-23 18:31:05 +02:00 |
pictuga
|
e968b2ea7f
|
Remove leftover :debug code
|
2020-08-23 16:59:34 +02:00 |
pictuga
|
0ac590c798
|
Set MAX_/LIM_* settings via env var
|
2020-08-23 16:09:58 +02:00 |
pictuga
|
f65fb45030
|
:debug completely deprecated in favour of DEBUG=
|
2020-08-23 14:33:32 +02:00 |
pictuga
|
6dd40e5cc4
|
cli.py: fix Options code
|
2020-08-23 14:25:09 +02:00 |
pictuga
|
0acfce5a22
|
cli.py: remove log
|
2020-08-23 14:24:57 +02:00 |
pictuga
|
97ccc15db0
|
cgi.py: rename parseOptions to parse_options
|
2020-08-23 14:24:23 +02:00 |
pictuga
|
7a560181f7
|
Use env var for DEBUG
|
2020-08-23 14:23:45 +02:00 |
pictuga
|
baccd3b22b
|
Move parseOptions to cgi.py
As it is no longer used in cli.py
|
2020-08-22 00:37:34 +02:00 |
pictuga
|
f79938ab11
|
Add :silent to readme & argparse
|
2020-08-22 00:02:08 +02:00 |
pictuga
|
5b8bd47829
|
cli.py: remove draft code
|
2020-08-21 23:59:12 +02:00 |
pictuga
|
b5b355aa6e
|
readabilite: increase penalty for high link density
|
2020-08-21 23:55:04 +02:00 |
pictuga
|
bd182bcb85
|
Move cli code to argParse
Related code changes (incl. :format=xyz)
|
2020-08-21 23:52:56 +02:00 |
pictuga
|
c7c2c5d749
|
Removed unused filterOptions code
|
2020-08-21 23:23:33 +02:00 |
pictuga
|
c6b52e625f
|
split morss.py into __main__/cgi/cli.py
Should hopefully allow cleaner code in the future
|
2020-08-21 22:17:55 +02:00 |
pictuga
|
c6d3a0eb53
|
readabilite: clean up code
|
2020-07-15 00:49:34 +02:00 |
pictuga
|
6021b912ff
|
morss: fix item removal
Usual issue when editing a list while looping over it
|
2020-07-06 19:25:48 +02:00 |
pictuga
|
f18a128ee6
|
Change :first for :newest
i.e. toggle default for the more-obvious option
|
2020-07-06 19:25:17 +02:00 |
pictuga
|
64af86c11e
|
crawler: catch html parsing errors
|
2020-07-06 12:25:38 +02:00 |
pictuga
|
15951d228c
|
Add :first to NOT sort items by date
|
2020-07-06 11:39:08 +02:00 |
pictuga
|
c1b1f5f58a
|
morss: restrict iframe use from :get to avoid abuse
|
2020-06-09 12:33:37 +02:00 |
pictuga
|
985185f47f
|
morss: more flexible feed creator auto-detection
|
2020-06-08 13:03:24 +02:00 |
pictuga
|
3190d1ec5a
|
feeds: remove useless if(len) before loop
|
2020-06-02 13:57:45 +02:00 |
pictuga
|
ce4cf01aa6
|
crawler: clean up encoding detection code
|
2020-05-27 21:35:24 +02:00 |
pictuga
|
dcfdb75a15
|
crawler: fix chinese encoding support
|
2020-05-27 21:34:43 +02:00 |
pictuga
|
4ccc0dafcd
|
Basic help for sub-lib interactive use
|
2020-05-26 19:34:20 +02:00 |
pictuga
|
2fe3e0b8ee
|
feeds: clean up other stylesheets before putting ours
|
2020-05-26 19:26:36 +02:00 |
pictuga
|
68c46a1823
|
morss: remove deprecated twitter/fb link handling
|
2020-05-13 12:31:09 +02:00 |
pictuga
|
91be2d229e
|
morss: ability to use first link from desc instead of default link
|
2020-05-13 12:29:53 +02:00 |
pictuga
|
038f267ea2
|
Rename :theforce into :force
|
2020-05-13 11:49:15 +02:00 |
pictuga
|
22005065e8
|
Use etree.tostring 'method' arg
Gives appropriately formatted html code.
Some pages might otherwise be rendered as blank.
|
2020-05-13 11:44:34 +02:00 |
pictuga
|
7d0d416610
|
morss: cache articles for 24hrs
Also make it possible to refetch articles, regardless of cache
|
2020-05-12 21:10:31 +02:00 |
pictuga
|
5dac4c69a1
|
crawler: more code comments
|
2020-05-12 20:44:25 +02:00 |
pictuga
|
36e2a1c3fd
|
crawler: increase size limit from 100KiB to 500
I'm looking at you, worldbankgroup.csod.com/ats/careersite/search.aspx
|
2020-05-12 19:34:16 +02:00 |
pictuga
|
83dd2925d3
|
readabilite: better parsing
Keeping blank_text keeps the tree more as-it, making the final output closer to expectations
|
2020-05-12 14:15:53 +02:00 |
pictuga
|
e09d0abf54
|
morss: remove deprecated peace of code
|
2020-05-07 16:05:30 +02:00 |
pictuga
|
ff26a560cb
|
Shift safari work around to morss.py
|
2020-05-07 16:04:54 +02:00 |
pictuga
|
f685139137
|
crawler: use UPSERT statements
Avoid potential race conditions
|
2020-05-03 21:27:45 +02:00 |
pictuga
|
73b477665e
|
morss: separate :clip with <hr> instead of stars
|
2020-05-02 19:19:54 +02:00 |
pictuga
|
b425992783
|
morss: don't follow alt=rss with custom feeds
To have the same page as with :get=page and to avoid shitty feeds
|
2020-05-02 19:18:58 +02:00 |
pictuga
|
271ac8f80f
|
crawler: comment code a bit
|
2020-05-02 19:18:01 +02:00 |
pictuga
|
64e41b807d
|
crawler: handle http:/ (single slash)
Fixing one more corner case! malayalam.oneindia.com
|
2020-05-02 19:17:15 +02:00 |
pictuga
|
27a42c47aa
|
morss: use final request url
Code is not very elegant...
|
2020-04-28 22:30:21 +02:00 |
pictuga
|
c27c38f7c7
|
crawler: return dict instead of tuple
|
2020-04-28 22:29:07 +02:00 |
pictuga
|
a1dc96cb50
|
feeds: remove mimetype from function call as no longer used
|
2020-04-28 22:07:25 +02:00 |
pictuga
|
749acc87fc
|
Centralize url clean up in crawler.py
|
2020-04-28 22:03:49 +02:00 |
pictuga
|
cb69e3167f
|
crawler: accept non-ascii urls
Covering one more corner case!
|
2020-04-28 14:47:23 +02:00 |
pictuga
|
c3f06da947
|
morss: process(): specify encoding for clarity
|
2020-04-28 14:45:00 +02:00 |
pictuga
|
44a3e0edc4
|
readabilite: specify in- and out-going encoding
|
2020-04-28 14:44:35 +02:00 |
pictuga
|
818cdaaa9b
|
Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
|
2020-04-27 18:00:14 +02:00 |
pictuga
|
2806c64326
|
Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
|
2020-04-27 17:19:31 +02:00 |
pictuga
|
f6bc23927f
|
readabilite: drop dangerous tags (script, style)
|
2020-04-25 12:25:02 +02:00 |
pictuga
|
c86572374e
|
readabilite: minimum score requirement
|
2020-04-25 12:24:36 +02:00 |
pictuga
|
59ef5af9e2
|
feeds: fix bug when deleting attr in html
|
2020-04-24 22:12:05 +02:00 |
pictuga
|
6a0531ca03
|
crawler: randomize user agent
|
2020-04-24 11:28:39 +02:00 |
pictuga
|
8187876a06
|
crawler: stop at first alternative link
Should save a few ms and the first one is usually (?) the most relevant/generic
|
2020-04-23 11:23:45 +02:00 |
pictuga
|
325a373e3e
|
feeds: add SyntaxError catch
|
2020-04-20 16:15:15 +02:00 |
pictuga
|
2719bd6776
|
crawler: fix chinese encoding
|
2020-04-20 16:14:55 +02:00 |
pictuga
|
ec8edb02f1
|
Various small bug fixes
|
2020-04-19 12:54:02 +02:00 |
pictuga
|
d01b943597
|
Remove leftover threading var
|
2020-04-19 12:51:11 +02:00 |
pictuga
|
b361aa2867
|
Add timeout to :get
|
2020-04-19 12:50:26 +02:00 |
pictuga
|
4ce3c7cb32
|
Small code clean ups
|
2020-04-19 12:50:05 +02:00 |
pictuga
|
7e45b2611d
|
Disable multi-threading
Impact was mostly negative due to locks
|
2020-04-19 12:29:52 +02:00 |
pictuga
|
036e5190f1
|
crawler: remove unused code
|
2020-04-18 21:40:02 +02:00 |
pictuga
|
e99c5b3b71
|
morss: more sensible default MAX/LIM values
|
2020-04-18 17:21:45 +02:00 |
pictuga
|
7375adce33
|
sheet.xsl: fix & improve
|
2020-04-15 23:34:28 +02:00 |
pictuga
|
fe82b19c91
|
Merge .xsl & html template
Turns out they somehow serve a similar purpose
|
2020-04-15 22:30:45 +02:00 |
pictuga
|
0b31e97492
|
morss: remove debug code in http file handler
|
2020-04-14 23:20:03 +02:00 |
pictuga
|
b0ad7c259d
|
Add README & LICENSE to data_files
|
2020-04-14 19:34:12 +02:00 |
pictuga
|
59139272fd
|
Auto-detect the location of www/
Either ../www or /usr/share/morss
Adapted README accordingly
|
2020-04-14 18:07:19 +02:00 |
pictuga
|
e6b7c0eb33
|
Fix app definition for uwsgi
|
2020-04-13 15:30:09 +02:00 |
pictuga
|
67c096ad5b
|
feeds: add fake path to default html parser
Without it, some websites were accidentally matching it (false positives)
|
2020-04-12 13:00:56 +02:00 |
pictuga
|
f018437544
|
crawler: make mysql backend thread safe
|
2020-04-12 12:53:05 +02:00 |
pictuga
|
8e5e8d24a4
|
Timezone fixes
|
2020-04-10 20:33:59 +02:00 |
pictuga
|
ee78a7875a
|
morss: focus on the most recent feed items
|
2020-04-10 16:08:13 +02:00 |