|
fbe811384a
|
crawler: add (unused) DebugHandler to output headers sent/received
Saves a lot of time when debugging
|
2017-10-27 23:10:03 +02:00 |
|
|
df22396838
|
Only use chardet on 2k letters
Takes forever otherwise
|
2017-07-16 23:59:06 +02:00 |
|
|
6f0efd5802
|
crawler: add cookies support
Somehow got dropped when splitting the big handler
|
2017-03-25 19:51:42 -10:00 |
|
|
505b02d70d
|
crawler: remove debugging print()
|
2017-03-25 13:45:12 -10:00 |
|
|
9c331300eb
|
crawler: move UAHandler to basic
Fuck u feedburner
|
2017-03-19 01:49:17 -10:00 |
|
|
99f3c519f2
|
crawler: fix accept code
|
2017-03-18 23:37:51 -10:00 |
|
|
67f5a21019
|
Move build_opener to crawler
Forgotten
|
2017-03-18 23:03:04 -10:00 |
|
|
f7d570d4c8
|
crawler: add some broken as rss mimetype
Seen out there
|
2017-03-18 23:00:13 -10:00 |
|
|
2003e2760b
|
Move custom_handler to crawler
Makes more sense. Easier to reuse. Also cleaned up a bit the code
|
2017-03-18 22:51:27 -10:00 |
|
|
e1a13a623c
|
crawler: remove unefficient feedburner-specific code
|
2017-03-18 22:31:03 -10:00 |
|
|
e3ab3c6823
|
crawler: use less tertiary operator
Inherited from fork
|
2017-03-18 22:23:39 -10:00 |
|
|
65055290d4
|
crawler: better use of chardet
Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.
|
2017-03-18 22:19:54 -10:00 |
|
|
9ee6ff60e1
|
crawler: 301 http code doesn't respect headers
More or less according to the specs
|
2017-03-18 22:18:10 -10:00 |
|
|
c952b85d92
|
crawler: cache 301 HTTP code, for a week
|
2017-03-09 09:37:05 -10:00 |
|
|
e8023e4336
|
crawler: remove unused NotInCache error-class
|
2017-03-09 09:35:40 -10:00 |
|
Florian Muenchbach
|
993ac638a3
|
Added override for auto-detected character encoding of parsed pages.
|
2017-03-08 18:45:20 -10:00 |
|
|
e5f8e43659
|
Shifted the <link rel='alternate'/> redirect to crawler
Now using MIMETYPE var from crawler within morss.py
|
2017-03-08 18:03:34 -10:00 |
|
|
fb8825b410
|
crawler: parse html to get http-equiv
For sure slower, but way cleaner (and probably more stable)
|
2017-03-08 17:50:57 -10:00 |
|
|
ad9bf946ec
|
crawler: use chardet again
Always nice in case no encoding is specified. Somehow got dropped with commit 245ba99. Most probably by accident
|
2017-03-08 11:37:12 -10:00 |
|
|
026903ce73
|
crawler: change http header after uncompressing
Change content-encoding to "identity"
|
2017-02-25 18:10:43 -10:00 |
|
|
8a1c00abf0
|
Typo in python version check
|
2015-08-28 19:29:09 +02:00 |
|
Massimo Vannucci
|
8656e53b84
|
Correct Python version check
|
2015-08-05 23:36:11 +02:00 |
|
|
931fd53da6
|
Fix 304-cache handling
To make sure that the cached request also gets processed (by GZip and stuff)
|
2015-05-04 22:25:26 +08:00 |
|
|
131ba09207
|
Change :cache mode behavior
Makes underlying code way cleaner
|
2015-04-07 09:38:22 +08:00 |
|
|
32aa96afa7
|
Cache HTTP content using a custom Handler
Much much cleaner. Nothing comparable
|
2015-04-06 23:26:12 +08:00 |
|
|
1b4fc88ad0
|
Replace MetaRedirect handler with two cleaner ones
One for <meta http-equiv> and one for HTTP 'refresh' header
|
2015-04-06 23:03:17 +08:00 |
|
|
f2fe4fc364
|
Drop HTTPS SSL certificate verification
Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish
|
2015-04-06 22:54:59 +08:00 |
|
|
29d9e4702f
|
Force enc det to return utf-8 rather than nothing
|
2015-03-24 23:22:56 +08:00 |
|
|
656b29e0ef
|
2to3: using unicode/str to please py3
|
2015-03-11 01:05:02 +08:00 |
|
|
cbeb01e555
|
2to3: fix urllib header retrieval
|
2015-03-11 01:03:16 +08:00 |
|
|
2f542005d1
|
2to3: urllib host
|
2015-03-03 00:59:00 +08:00 |
|
|
dbb3883516
|
2to3: urllib mimetype
|
2015-03-03 00:55:58 +08:00 |
|
|
7bd448789d
|
2to3: first attempt to fix strings
|
2015-02-26 00:50:23 +08:00 |
|
|
a0f2e0d995
|
2to3: crawler.py improve except
|
2015-02-25 18:07:09 +08:00 |
|
|
6a06b742f9
|
2to3: crawler.py port try as
|
2015-02-25 18:03:54 +08:00 |
|
|
c2d85e2bf9
|
2to3: crawler.py port httplib
|
2015-02-25 18:02:29 +08:00 |
|
|
4f224888d8
|
2to3: crawler.py port urllib2 and StringIO
|
2015-02-25 17:53:36 +08:00 |
|
|
27cf8f6498
|
2to3: (iter)items to list
|
2015-02-25 12:02:53 +08:00 |
|
|
8131ea2244
|
HTTPS SSL certificate validation
Specific error message added
|
2014-11-19 11:59:59 +01:00 |
|
|
1b26c5f0e3
|
Split SimpleDownload in a lot of Handlers
Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation.
Still have to clean up the code of AcceptHeadersHandler.
|
2014-11-19 11:57:40 +01:00 |
|