pictuga
f685139137
crawler: use UPSERT statements
...
Avoid potential race conditions
2020-05-03 21:27:45 +02:00
pictuga
271ac8f80f
crawler: comment code a bit
2020-05-02 19:18:01 +02:00
pictuga
64e41b807d
crawler: handle http:/ (single slash)
...
Fixing one more corner case! malayalam.oneindia.com
2020-05-02 19:17:15 +02:00
pictuga
c27c38f7c7
crawler: return dict instead of tuple
2020-04-28 22:29:07 +02:00
pictuga
749acc87fc
Centralize url clean up in crawler.py
2020-04-28 22:03:49 +02:00
pictuga
cb69e3167f
crawler: accept non-ascii urls
...
Covering one more corner case!
2020-04-28 14:47:23 +02:00
pictuga
818cdaaa9b
Make it possible to call sub-libs in non interactive mode
...
Run `python -m morss.feeds http://lemonde.fr ` and so on
2020-04-27 18:00:14 +02:00
pictuga
2806c64326
Make it possible to directly run sub-libs (feeds, crawler, readabilite)
...
Run `python -im morss.feeds http://website.sample/rss.xml ` and so on
2020-04-27 17:19:31 +02:00
pictuga
6a0531ca03
crawler: randomize user agent
2020-04-24 11:28:39 +02:00
pictuga
8187876a06
crawler: stop at first alternative link
...
Should save a few ms and the first one is usually (?) the most relevant/generic
2020-04-23 11:23:45 +02:00
pictuga
2719bd6776
crawler: fix chinese encoding
2020-04-20 16:14:55 +02:00
pictuga
ec8edb02f1
Various small bug fixes
2020-04-19 12:54:02 +02:00
pictuga
4ce3c7cb32
Small code clean ups
2020-04-19 12:50:05 +02:00
pictuga
036e5190f1
crawler: remove unused code
2020-04-18 21:40:02 +02:00
pictuga
f018437544
crawler: make mysql backend thread safe
2020-04-12 12:53:05 +02:00
pictuga
e5a82ff1f4
crawler: drop auto-referer
...
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
pictuga
7691df5257
Use wrapper for http calls
2020-04-07 10:30:17 +02:00
pictuga
eeac630855
crawler: add more "realistic" headers
2020-04-05 21:11:57 +02:00
pictuga
99461ea185
crawler: fix var name issues (private_cache)
2020-04-05 16:11:36 +02:00
pictuga
bf86c1e962
crawler: make AutoUA match http(s) type
2020-04-05 16:07:51 +02:00
pictuga
d20f6237bd
crawler: replace ContentNegoHandler with AlternateHandler
...
More basic. Sends the same headers no matter what. Make requests more "replicable".
Also, drop "text/xml" from RSS contenttype, too broad, matches garbage
2020-04-05 16:05:59 +02:00
pictuga
8a4d68d72c
crawler: drop 'basic' toggle
...
Can't even remember the use case
2020-04-05 16:03:06 +02:00
pictuga
5288cc8796
Clean up unused import's
2020-03-19 15:09:53 +01:00
pictuga
90110a4661
crawler: reduce max file size
2018-10-25 01:15:09 +02:00
pictuga
91a084e5ed
crawler: make py2/3 code distinction clearer
2018-10-25 01:14:46 +02:00
pictuga
945e0dceab
crawler: typo in comment
2018-09-30 21:59:50 +02:00
pictuga
f9217102f3
crawler: fix sqlite/binary issue
2017-11-25 19:58:14 +01:00
pictuga
21480f90de
Move from gzip to zlib to decompress data
...
Faster on incomplete files
2017-11-25 19:57:41 +01:00
pictuga
d091e74d56
crawler: add MySQL backend
...
With extra dependency
2017-11-04 14:51:41 +01:00
pictuga
f29a107a09
crawler: make SQLiteCache inherit from BaseCache
...
Saves some time for other cache backends
2017-11-04 14:48:00 +01:00
pictuga
b7db78f631
crawler: use BLOB in sqlite and drop "buffer"
...
Can't really remember why "buffer" was introduced in the first place
2017-11-04 13:54:40 +01:00
pictuga
194465544a
crawler: separate CacheHander and actual caching
...
Default cache is now just an in-memory {}
2017-11-04 12:41:56 +01:00
pictuga
523b250907
crawler: SQL request in CAPS for readability
2017-11-04 12:36:58 +01:00
pictuga
a8c2df7f41
crawler: fix truncated gzip reader
...
For python 3
2017-11-04 12:07:08 +01:00
pictuga
d39d0f4cae
crawler: properly define default sqlite file
2017-11-02 22:50:40 +01:00
pictuga
0df6409b0e
crawler: use `with con` to commit, journal WAL for perf
2017-10-28 01:28:47 +02:00
pictuga
7b85f692a0
crawler: fix encoding detection
2017-10-27 23:14:08 +02:00
pictuga
840842d246
crawler: limit download to 500KiB
...
More can only be linked to a fraudulent/incorrect use of the service
2017-10-27 23:12:40 +02:00
pictuga
fbe811384a
crawler: add (unused) DebugHandler to output headers sent/received
...
Saves a lot of time when debugging
2017-10-27 23:10:03 +02:00
pictuga
df22396838
Only use chardet on 2k letters
...
Takes forever otherwise
2017-07-16 23:59:06 +02:00
pictuga
6f0efd5802
crawler: add cookies support
...
Somehow got dropped when splitting the big handler
2017-03-25 19:51:42 -10:00
pictuga
505b02d70d
crawler: remove debugging print()
2017-03-25 13:45:12 -10:00
pictuga
9c331300eb
crawler: move UAHandler to basic
...
Fuck u feedburner
2017-03-19 01:49:17 -10:00
pictuga
99f3c519f2
crawler: fix accept code
2017-03-18 23:37:51 -10:00
pictuga
67f5a21019
Move build_opener to crawler
...
Forgotten
2017-03-18 23:03:04 -10:00
pictuga
f7d570d4c8
crawler: add some broken as rss mimetype
...
Seen out there
2017-03-18 23:00:13 -10:00
pictuga
2003e2760b
Move custom_handler to crawler
...
Makes more sense. Easier to reuse. Also cleaned up a bit the code
2017-03-18 22:51:27 -10:00
pictuga
e1a13a623c
crawler: remove unefficient feedburner-specific code
2017-03-18 22:31:03 -10:00
pictuga
e3ab3c6823
crawler: use less tertiary operator
...
Inherited from fork
2017-03-18 22:23:39 -10:00
pictuga
65055290d4
crawler: better use of chardet
...
Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.
2017-03-18 22:19:54 -10:00