Commit Graph

148 Commits (eac2e7a79aa678b2dd055f7684bf44cd8fde769e)

Author SHA1 Message Date
pictuga ae5e947417 Add support for facebook, google links 2013-10-01 19:49:53 +02:00
pictuga cf5040020e Use desktop browser UA for xml
Convenient for rss links detection and "feedify"
2013-10-01 19:47:45 +02:00
pictuga e3c1cd8619 Accept more types as "text" before readability 2013-10-01 19:47:06 +02:00
pictuga 78706952fe Remove "clip" from Fill
Put that in Gather. Also removed from feeds.py. "alone" mode was also added (it removes the description).
2013-10-01 19:45:54 +02:00
pictuga 1b7fe8fbee Use "options" in Gather instead of "progress"
Also made it possible to toggle Fill's toggle through parameters
2013-09-29 15:32:58 +02:00
pictuga a5a327388a Add ability not to fetch an item's article 2013-09-25 13:47:05 +02:00
pictuga 0657077191 Add support for twitter
Grabs "feed" from the html page, clips tweet and article together.
2013-09-25 12:37:14 +02:00
pictuga da14242bcf Add feedify, and use it in morss 2013-09-25 12:36:21 +02:00
pictuga 9bc4417be3 More flexible xml caching
New includes a 'type' var, to remember what we did out of it (normal, nothing, grabbed xml link, etc). xml/html mimetype are now saved in a dict, for easier editing, and consistency.
2013-09-25 12:32:40 +02:00
pictuga edff54a016 Add pushContent in feeds.py
Useful for twitter (later) for it's "clip" toggle, which keeps the original desc/content above the article. Makes changing the content, while keeping the original stub in place, easier.
2013-09-25 12:18:22 +02:00
pictuga 208d70d3db Use separate var in Fill for final url
That way the url can be changed altogether for the article-fetching part, without changing the item link itself. Useful for upcoming twitter feeds.
2013-09-25 11:51:48 +02:00
pictuga fd1501a0c0 Check relative url earlier 2013-09-25 11:49:45 +02:00
pictuga 1e621099e0 Log cache hash in Gather 2013-09-25 11:15:11 +02:00
pictuga 3d6d7e70b6 Remove useless "as" in error catch 2013-09-25 11:14:22 +02:00
pictuga e73cbf56c2 Add 'html' option, usefull to see error on server 2013-09-25 11:13:33 +02:00
pictuga 03014a8cbf Typo in UA_HTML var name 2013-09-25 11:11:11 +02:00
pictuga 4a5cbcfd18 Move httplib in common code
Needed for error catch
2013-09-25 11:10:16 +02:00
pictuga 3fd34ff1a6 decodeHTML works without connection object 2013-09-25 11:08:58 +02:00
pictuga 658f51e5a9 Support feeds handed out as text/html
<http://www.pro-linux.de/rss/index1.xml> and <http://tehrantimes.com/index.php?option=com_ninjarsssyndicator&feed_id=1&format=raw> are on an equal footing…
2013-09-16 00:33:24 +02:00
pictuga 8eb2f7c249 Added another letter to feedsportal table 2013-09-15 19:38:59 +02:00
pictuga 23246ca6c1 Save the key in cache file 2013-09-15 19:20:51 +02:00
pictuga 1b7777c331 Find RSS links within html pages' <head>
And cache those links
2013-09-15 19:19:50 +02:00
pictuga 1bd17f1365 Faster relative link resolution 2013-09-15 19:18:39 +02:00
pictuga 7575291f8f Log url in Gather
Useful for upcoming commits
2013-09-15 18:53:35 +02:00
pictuga 532852a408 Use cleaner http error catch
One error type was inheriting from another one
2013-09-15 18:52:34 +02:00
pictuga 43bf021f23 Catch more http exceptions
Such as InvalidURL. Subclasses of httplib.HTTPException
2013-09-15 17:19:33 +02:00
pictuga 9252e75923 Ensure var in parseOptions are defined
Caused a bug on morss.it's server
2013-09-15 15:56:08 +02:00
pictuga 75b51fc2c2 Add ability to bypass ETag support
Add the ":force" argument over http to bypass ETag support, which is convenient to debug code
2013-09-15 15:54:42 +02:00
pictuga d2de6cf23d Extra doc for DELAY, for xml cache, & now for etag 2013-09-15 15:45:15 +02:00
pictuga 89187ab6a6 Log generation time 2013-09-15 15:44:25 +02:00
pictuga 04840d9843 More flexible parameters can be passed
Multiple parameters can now be passed. HTTP "API" has been improved, and url now have to be like "http://<path to morss>/:<param1>:<param2>/<url>". The code handling the parameters parsing is now way cleaner. Debug toggle is now a var, which can be changed with parameters. Also http logging is no longer done into a file, which tended to grow way too fast, while lacking an "error 403 protection", but instead the parameter ":debug" can be passed in the url, and the page will be delivered as "text/plain" with the debug written into it. Therefore some logging had to be moved around, so as not to output anything during http headers definition.
2013-09-15 15:38:03 +02:00
pictuga c25aec7107 Only perform <meta> redirects on html pages 2013-09-15 15:33:14 +02:00
pictuga 3ba74649f6 Test if linked pages are text documents
Useful for feeds such as HackerNews
2013-09-10 15:25:55 +02:00
pictuga 5ebd84ee55 Fix broken feeds.py calls for items count 2013-09-08 15:47:15 +02:00
pictuga d3c163fb74 Use ETag for user-side caching
Pretty hard-code ETag use. ETag is just a timestamp, and the server checks whether it's recent enough.
2013-08-24 23:43:32 +02:00
pictuga e2c3375eb6 Log url earlier
Now logging it in both use cases
2013-08-24 23:41:40 +02:00
pictuga 0c6e28205a Use seconds for every parameter 2013-08-24 23:40:37 +02:00
pictuga b350602232 Remove legacy "xml map" declaration 2013-08-24 23:16:23 +02:00
pictuga 1ba22516fe Small help for etag handler 2013-07-19 00:02:52 +02:00
pictuga 90efb84c57 Don't log word counts
Nobody cares
2013-07-18 23:55:58 +02:00
pictuga 9e324465e4 Use etag/last-modified to fetch xml feeds 2013-07-18 23:54:13 +02:00
pictuga 70df746416 Accept None as value to cache 2013-07-18 23:51:11 +02:00
pictuga 71129b5898 Fix headers definition
Based on what's done inside urllib2.py.
2013-07-17 14:41:29 +02:00
pictuga d3213ea1e7 Implement user-agent in HTMLDownloader
It was forgotten in the previous commit
2013-07-17 14:40:29 +02:00
pictuga 918dede4be Extend urllib2 to download pages, use gzip
Cleaner than dirty function. Handles decoding, gzip decompression, meta redirects (eg. Washington Post). Might need extra testing.
2013-07-16 23:33:45 +02:00
pictuga 1fa8c4c535 Remove cleanXML()
This function is way too strong, and no longer needed (even for the targeted feed). It lead to other bugs with other feeds, where needed spaces were stripped.
2013-07-15 11:10:19 +02:00
pictuga 0718303eb7 Use ' instead of " when possible 2013-07-14 19:00:16 +02:00
pictuga 7275bb1a59 Better content insertion
Also takes care of description, by creating one, when missing.
2013-07-14 18:58:48 +02:00
pictuga 054f5c0846 Detect provided content with word count
This is instead of character count.
2013-07-14 18:57:12 +02:00
pictuga 7fa183d713 Change morss.py to use feeds.py
No other changes should appear in this commit
2013-07-14 18:44:11 +02:00
pictuga cf3934a513 Change http output mimetype to xml 2013-06-28 13:34:12 +02:00
pictuga 1f4c219880 Common code for url/options handling 2013-06-25 13:13:23 +02:00
pictuga d2418a47c2 Add support for reddit.com feeds
The content of the linked article is used for the content. The original content (with a link to comments) is still available in the "description" of the feed item.
2013-06-11 13:02:47 +02:00
pictuga f0b237364f Better annotation of feedsburner/feedsportal code 2013-06-11 13:02:16 +02:00
pictuga 0978e76356 str.decode() within EncDownload() 2013-06-08 17:32:55 +02:00
pictuga 89354e1528 Use file's built-in readlines() to split file 2013-06-08 17:30:53 +02:00
pictuga bbf5c92ba2 Fix lenHTML() with empty string 2013-06-08 17:30:11 +02:00
pictuga e05d1c9deb Replace uppercase title with "title-case" 2013-06-02 23:45:41 +02:00
pictuga b78f0bfba5 Improve options and limits
New limits are possible: time limit, max number of item fetched, and max number of item taken from cache. Fill third argument is now Fast=True, which is self-explicit. (Complexity of the changes made separate commits impossible).
2013-05-15 17:56:58 +02:00
pictuga 2a71fe07f2 Improve Cache code
Removed _new flag. Slightly more stable and cleaner.
2013-05-15 17:48:39 +02:00
pictuga bf647ba5f8 Make Fill return True when it had done sth useful 2013-05-15 17:38:52 +02:00
pictuga 9694a31052 Add 'feedurl' argument to Fill
Was needed for commit f3c2c34
2013-05-15 17:36:00 +02:00
pictuga 8e2aab55e7 Check url before looking for provided content
Also use lenHTML() function defined a lately
2013-05-15 17:32:42 +02:00
pictuga 85e40cde4e Check article length is big enough
Avoids replacing rather useful descriptions with empty string
2013-05-15 17:24:27 +02:00
pictuga 222b1369e5 Support for relative urls in feed 2013-05-15 17:13:57 +02:00
pictuga d88719c87f Use urlparse library to check feed urls 2013-05-15 17:12:59 +02:00
pictuga 1506a5c0cd Fix string output in XMLMap 2013-05-05 16:04:42 +02:00
pictuga adebe23232 Better logging when running as Liferea hook 2013-05-05 15:33:46 +02:00
pictuga 32514941b4 Try to improve support for bogus xml feed 2013-05-05 15:32:57 +02:00
pictuga b34ecb8ad3 Fix cache crash with one entry with empty value 2013-05-05 15:32:05 +02:00
pictuga e518f2cced Better timeout error handling
For older versions of Python
2013-05-05 15:31:11 +02:00
pictuga 03501edccd Add/fix extra modes
'progress' mode now works on Chrome. 'cache' mode only relies on cache to load faster.
2013-05-05 15:30:06 +02:00
pictuga 65090870ac Remove temp debug print statement 2013-05-05 15:28:32 +02:00
pictuga e77278dda9 Remove leftover SERVER var from source code 2013-05-01 19:31:24 +02:00
pictuga 949582ba19 Add progress view. 2013-05-01 17:57:09 +02:00
pictuga 5ee5dbf359 Cache http errors to save time. 2013-05-01 17:56:03 +02:00
pictuga 2f1ae1ce91 Use less suspicious user-agents. 2013-05-01 17:54:17 +02:00
pictuga 0a97a2a2b5 Support for combined feedsportal and feedburner. 2013-05-01 17:43:43 +02:00
pictuga 93b098ab11 Added http timeout. 2013-04-30 19:54:32 +02:00
pictuga 9f175994c6 Fix regex implementation. 2013-04-30 19:51:29 +02:00
pictuga 93f971896b Improved feedsportal url recognition. 2013-04-28 10:10:58 +02:00
pictuga fa7cd957df Save Cache when it's new.
So as to avoid crashes on first fetch.
2013-04-23 00:24:41 +02:00
pictuga ca90d082c3 Library import list made cleaner. 2013-04-23 00:04:44 +02:00
pictuga 1480bd7af4 Auto-detection of server-mode, better caching.
The SERVER variable is no longer needed. RSS .xml file is now cached for a very short time, so as to make loading faster, and hopefully reduce bann a little. Use a more common User-Agent to try to cut down bann. Added ability to test whether a key is in the Cache.
2013-04-23 00:00:07 +02:00
pictuga a616c96e32 Removed another unused var. 2013-04-22 23:58:20 +02:00
pictuga f95c5dcf0d Fixed caching. 2013-04-22 22:56:38 +02:00
pictuga 83d0dcce4d Delete unused var declaration. 2013-04-22 22:56:21 +02:00
pictuga 2d05653190 Better detection of feedportal, extra url logging. 2013-04-19 11:44:25 +02:00
pictuga 8ce9812dfd Meta redirects are now supported. 2013-04-19 11:43:47 +02:00
pictuga 80ba60d295 Better detection of feeds with content provided. 2013-04-19 11:42:54 +02:00
pictuga d2b74819b4 Improved caching.
No longer writes everytime a value is added, since it could cause some issues if two instances of the script were run at the same time. Now it only writes when the  Cache object is no longer in use (ie. garbage colllected).
2013-04-19 11:40:35 +02:00
pictuga 4abf7b699c Use readability to fetch article content.
Makes the whole "xpath rules" things useless. Almost any feed is now supported. CSS liferea stylesheets are also uneeded now, since readability cleans up html code a more efficient way. README was updated.
2013-04-19 11:37:43 +02:00
pictuga 17db2584da Fixed caching.
For scary reasons, re-used cache was deleted everytime. This is now fixed. Loading in now *really* fast.
2013-04-16 16:13:42 +02:00
pictuga 5a74babf24 Improved logging on server. 2013-04-16 16:13:14 +02:00
pictuga 7b1c32eac2 Added support for 404 redirect.
ie. http://domain.com/bbc.co.uk/feed.xml will redirect to http://domain.com/morss.py/bbc.co.uk/feed.xml and work.
2013-04-16 16:11:34 +02:00
pictuga af8879049f Another huge commit.
Now uses OOP where it fits. Atom feeds are supported, but no real tests were made. Unix globbing is now possible for urls. Caching is done a cleaner way. Feedburner links are also replaced. HTML is cleaned a more efficient way. Code is now much cleaner, using lxml.objectify and a small wrapper to access Atom feeds as if they were RSS feeds (and much faster than feedparser). README has been updated.
2013-04-15 18:51:55 +02:00
pictuga d6e6d61199 Bypass feedsportal. 2013-04-04 19:29:22 +02:00
pictuga 851dacdfbc Renamed to .py. 2013-04-04 18:17:12 +02:00