Compare commits
202 Commits
v1.0
...
4a88886767
Author | SHA1 | Date | |
---|---|---|---|
4a88886767 | |||
1653394cf7 | |||
a8a90cf414 | |||
bdbaf0f8a7 | |||
d0e447a2a6 | |||
e6817e01b4 | |||
7c3091d64c | |||
37b4e144a9 | |||
bd4b7b5bb2 | |||
68d920d4b5 | |||
758ff404a8 | |||
463530f02c | |||
ec0a28a91d | |||
421acb439d | |||
42c5d09ccb | |||
056de12484 | |||
961a31141f | |||
a7b01ee85e | |||
fbcb23cf88 | |||
a0e8e84a67 | |||
a90fd682db | |||
2c245f9506 | |||
3d45451fef | |||
4d785820d9 | |||
6a01fc439e | |||
d24734110a | |||
a41c2a3a62 | |||
dd2651061f | |||
912c323c40 | |||
5705a0be17 | |||
4735ffba45 | |||
08e39f5631 | |||
765a43511e | |||
5865af64f9 | |||
ae3bd58386 | |||
e3be9b5a9e | |||
f8c09af563 | |||
48beeede97 | |||
b3b90c067a | |||
2b50b7a570 | |||
a1c8093b6e | |||
e681dd56c2 | |||
5288cc8796 | |||
d12d44a500 | |||
ee8c57c1fc | |||
bda51b0fc7 | |||
c09b457168 | |||
b47e40246c | |||
9cf933723f | |||
d26795dce8 | |||
2704e91a3d | |||
f48961a7e4 | |||
fd51f74eb5 | |||
d3fe51cea5 | |||
449bc3c695 | |||
13ea52ef80 | |||
aa2b56b266 | |||
296e9a6c13 | |||
ab145813d6 | |||
9dbe061fd6 | |||
dbb9cccc42 | |||
4a70aa9dfa | |||
c2f85da94a | |||
e3528a8f36 | |||
2dd9ae202d | |||
8e3d32f24c | |||
6ce616106b | |||
186fa2b408 | |||
e9d46cb6a9 | |||
7644c550ec | |||
e460fdc8f4 | |||
1e714ab34b | |||
4ba4d73ce6 | |||
6d5aa8c222 | |||
f10727f94a | |||
d42e19a165 | |||
fe46c6c522 | |||
9c557ea02c | |||
8a4f86210c | |||
ce30952fa2 | |||
3fb6ff891c | |||
f5acd2c14c | |||
7cb3b29ef2 | |||
9cb2d5bb86 | |||
e606c5eefb | |||
24c8a0ecd0 | |||
9a62e6ae75 | |||
adbaed9e54 | |||
3581f34db7 | |||
966559bdd3 | |||
4fb98bc2ed | |||
4536902530 | |||
679628c7fa | |||
399e867c94 | |||
9b492f310e | |||
c5d8b064ae | |||
c2a6ea7cfe | |||
221e1f85ad | |||
857bb9c366 | |||
75f691b009 | |||
401dfbc1ff | |||
8aceda4957 | |||
024466733c | |||
92b06bea6d | |||
94372af868 | |||
6d28323e3a | |||
5a4a86d622 | |||
d321550166 | |||
d1aab99b80 | |||
16f3ffa96e | |||
02b7e07097 | |||
8487a43c6c | |||
081d560bc4 | |||
cfd758b6b5 | |||
4e144487db | |||
d13362c4ac | |||
17856929fe | |||
90110a4661 | |||
91a084e5ed | |||
5d93d68f62 | |||
8d7e1811fd | |||
72d03f21fe | |||
1d6d0b8ff1 | |||
7d005e9a65 | |||
58fe5243af | |||
f044c242ef | |||
a6befad136 | |||
9e71de8d40 | |||
787d90fac0 | |||
040d2cb889 | |||
9fcef826f5 | |||
e72ca3f984 | |||
2ccf36617a | |||
945e0dceab | |||
3c09dbdf31 | |||
ba673392d7 | |||
5111d40011 | |||
f9217102f3 | |||
21480f90de | |||
d091e74d56 | |||
f29a107a09 | |||
2d5bf7b38b | |||
b7db78f631 | |||
203ba10dbd | |||
194465544a | |||
523b250907 | |||
2d7d0fcdca | |||
a8c2df7f41 | |||
d39d0f4cae | |||
f563040809 | |||
0df6409b0e | |||
7b85f692a0 | |||
840842d246 | |||
fbe811384a | |||
e0092387b1 | |||
b8d8d9ea20 | |||
64babd6713 | |||
bbca5dca6b | |||
10cbebb80c | |||
3bfad54add | |||
386bafd391 | |||
a61b259792 | |||
c52b47616d | |||
bfdda18b9c | |||
2afea497a3 | |||
843dc97fbf | |||
df22396838 | |||
6f0efd5802 | |||
d3bc2926fc | |||
505b02d70d | |||
3ca6ed5bb0 | |||
4aa25bf3d8 | |||
bfefa8d599 | |||
91da0f36dc | |||
67889a1d14 | |||
6024728341 | |||
c929ce6278 | |||
167e3e4a15 | |||
bf3ef586c2 | |||
08f08ef704 | |||
1b4341f741 | |||
f965566054 | |||
d6882e0a6a | |||
79a8ada9f4 | |||
4a5150e030 | |||
e65c88abf8 | |||
9c331300eb | |||
5e61686373 | |||
0b6e553054 | |||
d4937812a8 | |||
99f3c519f2 | |||
67f5a21019 | |||
f7d570d4c8 | |||
2003e2760b | |||
beec6469cc | |||
10fef6be4e | |||
e1a13a623c | |||
367f86987d | |||
e3ab3c6823 | |||
65055290d4 | |||
9ee6ff60e1 | |||
f4abc4e8a4 |
661
LICENSE
Normal file
661
LICENSE
Normal file
@@ -0,0 +1,661 @@
|
|||||||
|
GNU AFFERO GENERAL PUBLIC LICENSE
|
||||||
|
Version 3, 19 November 2007
|
||||||
|
|
||||||
|
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
|
||||||
|
Everyone is permitted to copy and distribute verbatim copies
|
||||||
|
of this license document, but changing it is not allowed.
|
||||||
|
|
||||||
|
Preamble
|
||||||
|
|
||||||
|
The GNU Affero General Public License is a free, copyleft license for
|
||||||
|
software and other kinds of works, specifically designed to ensure
|
||||||
|
cooperation with the community in the case of network server software.
|
||||||
|
|
||||||
|
The licenses for most software and other practical works are designed
|
||||||
|
to take away your freedom to share and change the works. By contrast,
|
||||||
|
our General Public Licenses are intended to guarantee your freedom to
|
||||||
|
share and change all versions of a program--to make sure it remains free
|
||||||
|
software for all its users.
|
||||||
|
|
||||||
|
When we speak of free software, we are referring to freedom, not
|
||||||
|
price. Our General Public Licenses are designed to make sure that you
|
||||||
|
have the freedom to distribute copies of free software (and charge for
|
||||||
|
them if you wish), that you receive source code or can get it if you
|
||||||
|
want it, that you can change the software or use pieces of it in new
|
||||||
|
free programs, and that you know you can do these things.
|
||||||
|
|
||||||
|
Developers that use our General Public Licenses protect your rights
|
||||||
|
with two steps: (1) assert copyright on the software, and (2) offer
|
||||||
|
you this License which gives you legal permission to copy, distribute
|
||||||
|
and/or modify the software.
|
||||||
|
|
||||||
|
A secondary benefit of defending all users' freedom is that
|
||||||
|
improvements made in alternate versions of the program, if they
|
||||||
|
receive widespread use, become available for other developers to
|
||||||
|
incorporate. Many developers of free software are heartened and
|
||||||
|
encouraged by the resulting cooperation. However, in the case of
|
||||||
|
software used on network servers, this result may fail to come about.
|
||||||
|
The GNU General Public License permits making a modified version and
|
||||||
|
letting the public access it on a server without ever releasing its
|
||||||
|
source code to the public.
|
||||||
|
|
||||||
|
The GNU Affero General Public License is designed specifically to
|
||||||
|
ensure that, in such cases, the modified source code becomes available
|
||||||
|
to the community. It requires the operator of a network server to
|
||||||
|
provide the source code of the modified version running there to the
|
||||||
|
users of that server. Therefore, public use of a modified version, on
|
||||||
|
a publicly accessible server, gives the public access to the source
|
||||||
|
code of the modified version.
|
||||||
|
|
||||||
|
An older license, called the Affero General Public License and
|
||||||
|
published by Affero, was designed to accomplish similar goals. This is
|
||||||
|
a different license, not a version of the Affero GPL, but Affero has
|
||||||
|
released a new version of the Affero GPL which permits relicensing under
|
||||||
|
this license.
|
||||||
|
|
||||||
|
The precise terms and conditions for copying, distribution and
|
||||||
|
modification follow.
|
||||||
|
|
||||||
|
TERMS AND CONDITIONS
|
||||||
|
|
||||||
|
0. Definitions.
|
||||||
|
|
||||||
|
"This License" refers to version 3 of the GNU Affero General Public License.
|
||||||
|
|
||||||
|
"Copyright" also means copyright-like laws that apply to other kinds of
|
||||||
|
works, such as semiconductor masks.
|
||||||
|
|
||||||
|
"The Program" refers to any copyrightable work licensed under this
|
||||||
|
License. Each licensee is addressed as "you". "Licensees" and
|
||||||
|
"recipients" may be individuals or organizations.
|
||||||
|
|
||||||
|
To "modify" a work means to copy from or adapt all or part of the work
|
||||||
|
in a fashion requiring copyright permission, other than the making of an
|
||||||
|
exact copy. The resulting work is called a "modified version" of the
|
||||||
|
earlier work or a work "based on" the earlier work.
|
||||||
|
|
||||||
|
A "covered work" means either the unmodified Program or a work based
|
||||||
|
on the Program.
|
||||||
|
|
||||||
|
To "propagate" a work means to do anything with it that, without
|
||||||
|
permission, would make you directly or secondarily liable for
|
||||||
|
infringement under applicable copyright law, except executing it on a
|
||||||
|
computer or modifying a private copy. Propagation includes copying,
|
||||||
|
distribution (with or without modification), making available to the
|
||||||
|
public, and in some countries other activities as well.
|
||||||
|
|
||||||
|
To "convey" a work means any kind of propagation that enables other
|
||||||
|
parties to make or receive copies. Mere interaction with a user through
|
||||||
|
a computer network, with no transfer of a copy, is not conveying.
|
||||||
|
|
||||||
|
An interactive user interface displays "Appropriate Legal Notices"
|
||||||
|
to the extent that it includes a convenient and prominently visible
|
||||||
|
feature that (1) displays an appropriate copyright notice, and (2)
|
||||||
|
tells the user that there is no warranty for the work (except to the
|
||||||
|
extent that warranties are provided), that licensees may convey the
|
||||||
|
work under this License, and how to view a copy of this License. If
|
||||||
|
the interface presents a list of user commands or options, such as a
|
||||||
|
menu, a prominent item in the list meets this criterion.
|
||||||
|
|
||||||
|
1. Source Code.
|
||||||
|
|
||||||
|
The "source code" for a work means the preferred form of the work
|
||||||
|
for making modifications to it. "Object code" means any non-source
|
||||||
|
form of a work.
|
||||||
|
|
||||||
|
A "Standard Interface" means an interface that either is an official
|
||||||
|
standard defined by a recognized standards body, or, in the case of
|
||||||
|
interfaces specified for a particular programming language, one that
|
||||||
|
is widely used among developers working in that language.
|
||||||
|
|
||||||
|
The "System Libraries" of an executable work include anything, other
|
||||||
|
than the work as a whole, that (a) is included in the normal form of
|
||||||
|
packaging a Major Component, but which is not part of that Major
|
||||||
|
Component, and (b) serves only to enable use of the work with that
|
||||||
|
Major Component, or to implement a Standard Interface for which an
|
||||||
|
implementation is available to the public in source code form. A
|
||||||
|
"Major Component", in this context, means a major essential component
|
||||||
|
(kernel, window system, and so on) of the specific operating system
|
||||||
|
(if any) on which the executable work runs, or a compiler used to
|
||||||
|
produce the work, or an object code interpreter used to run it.
|
||||||
|
|
||||||
|
The "Corresponding Source" for a work in object code form means all
|
||||||
|
the source code needed to generate, install, and (for an executable
|
||||||
|
work) run the object code and to modify the work, including scripts to
|
||||||
|
control those activities. However, it does not include the work's
|
||||||
|
System Libraries, or general-purpose tools or generally available free
|
||||||
|
programs which are used unmodified in performing those activities but
|
||||||
|
which are not part of the work. For example, Corresponding Source
|
||||||
|
includes interface definition files associated with source files for
|
||||||
|
the work, and the source code for shared libraries and dynamically
|
||||||
|
linked subprograms that the work is specifically designed to require,
|
||||||
|
such as by intimate data communication or control flow between those
|
||||||
|
subprograms and other parts of the work.
|
||||||
|
|
||||||
|
The Corresponding Source need not include anything that users
|
||||||
|
can regenerate automatically from other parts of the Corresponding
|
||||||
|
Source.
|
||||||
|
|
||||||
|
The Corresponding Source for a work in source code form is that
|
||||||
|
same work.
|
||||||
|
|
||||||
|
2. Basic Permissions.
|
||||||
|
|
||||||
|
All rights granted under this License are granted for the term of
|
||||||
|
copyright on the Program, and are irrevocable provided the stated
|
||||||
|
conditions are met. This License explicitly affirms your unlimited
|
||||||
|
permission to run the unmodified Program. The output from running a
|
||||||
|
covered work is covered by this License only if the output, given its
|
||||||
|
content, constitutes a covered work. This License acknowledges your
|
||||||
|
rights of fair use or other equivalent, as provided by copyright law.
|
||||||
|
|
||||||
|
You may make, run and propagate covered works that you do not
|
||||||
|
convey, without conditions so long as your license otherwise remains
|
||||||
|
in force. You may convey covered works to others for the sole purpose
|
||||||
|
of having them make modifications exclusively for you, or provide you
|
||||||
|
with facilities for running those works, provided that you comply with
|
||||||
|
the terms of this License in conveying all material for which you do
|
||||||
|
not control copyright. Those thus making or running the covered works
|
||||||
|
for you must do so exclusively on your behalf, under your direction
|
||||||
|
and control, on terms that prohibit them from making any copies of
|
||||||
|
your copyrighted material outside their relationship with you.
|
||||||
|
|
||||||
|
Conveying under any other circumstances is permitted solely under
|
||||||
|
the conditions stated below. Sublicensing is not allowed; section 10
|
||||||
|
makes it unnecessary.
|
||||||
|
|
||||||
|
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
|
||||||
|
|
||||||
|
No covered work shall be deemed part of an effective technological
|
||||||
|
measure under any applicable law fulfilling obligations under article
|
||||||
|
11 of the WIPO copyright treaty adopted on 20 December 1996, or
|
||||||
|
similar laws prohibiting or restricting circumvention of such
|
||||||
|
measures.
|
||||||
|
|
||||||
|
When you convey a covered work, you waive any legal power to forbid
|
||||||
|
circumvention of technological measures to the extent such circumvention
|
||||||
|
is effected by exercising rights under this License with respect to
|
||||||
|
the covered work, and you disclaim any intention to limit operation or
|
||||||
|
modification of the work as a means of enforcing, against the work's
|
||||||
|
users, your or third parties' legal rights to forbid circumvention of
|
||||||
|
technological measures.
|
||||||
|
|
||||||
|
4. Conveying Verbatim Copies.
|
||||||
|
|
||||||
|
You may convey verbatim copies of the Program's source code as you
|
||||||
|
receive it, in any medium, provided that you conspicuously and
|
||||||
|
appropriately publish on each copy an appropriate copyright notice;
|
||||||
|
keep intact all notices stating that this License and any
|
||||||
|
non-permissive terms added in accord with section 7 apply to the code;
|
||||||
|
keep intact all notices of the absence of any warranty; and give all
|
||||||
|
recipients a copy of this License along with the Program.
|
||||||
|
|
||||||
|
You may charge any price or no price for each copy that you convey,
|
||||||
|
and you may offer support or warranty protection for a fee.
|
||||||
|
|
||||||
|
5. Conveying Modified Source Versions.
|
||||||
|
|
||||||
|
You may convey a work based on the Program, or the modifications to
|
||||||
|
produce it from the Program, in the form of source code under the
|
||||||
|
terms of section 4, provided that you also meet all of these conditions:
|
||||||
|
|
||||||
|
a) The work must carry prominent notices stating that you modified
|
||||||
|
it, and giving a relevant date.
|
||||||
|
|
||||||
|
b) The work must carry prominent notices stating that it is
|
||||||
|
released under this License and any conditions added under section
|
||||||
|
7. This requirement modifies the requirement in section 4 to
|
||||||
|
"keep intact all notices".
|
||||||
|
|
||||||
|
c) You must license the entire work, as a whole, under this
|
||||||
|
License to anyone who comes into possession of a copy. This
|
||||||
|
License will therefore apply, along with any applicable section 7
|
||||||
|
additional terms, to the whole of the work, and all its parts,
|
||||||
|
regardless of how they are packaged. This License gives no
|
||||||
|
permission to license the work in any other way, but it does not
|
||||||
|
invalidate such permission if you have separately received it.
|
||||||
|
|
||||||
|
d) If the work has interactive user interfaces, each must display
|
||||||
|
Appropriate Legal Notices; however, if the Program has interactive
|
||||||
|
interfaces that do not display Appropriate Legal Notices, your
|
||||||
|
work need not make them do so.
|
||||||
|
|
||||||
|
A compilation of a covered work with other separate and independent
|
||||||
|
works, which are not by their nature extensions of the covered work,
|
||||||
|
and which are not combined with it such as to form a larger program,
|
||||||
|
in or on a volume of a storage or distribution medium, is called an
|
||||||
|
"aggregate" if the compilation and its resulting copyright are not
|
||||||
|
used to limit the access or legal rights of the compilation's users
|
||||||
|
beyond what the individual works permit. Inclusion of a covered work
|
||||||
|
in an aggregate does not cause this License to apply to the other
|
||||||
|
parts of the aggregate.
|
||||||
|
|
||||||
|
6. Conveying Non-Source Forms.
|
||||||
|
|
||||||
|
You may convey a covered work in object code form under the terms
|
||||||
|
of sections 4 and 5, provided that you also convey the
|
||||||
|
machine-readable Corresponding Source under the terms of this License,
|
||||||
|
in one of these ways:
|
||||||
|
|
||||||
|
a) Convey the object code in, or embodied in, a physical product
|
||||||
|
(including a physical distribution medium), accompanied by the
|
||||||
|
Corresponding Source fixed on a durable physical medium
|
||||||
|
customarily used for software interchange.
|
||||||
|
|
||||||
|
b) Convey the object code in, or embodied in, a physical product
|
||||||
|
(including a physical distribution medium), accompanied by a
|
||||||
|
written offer, valid for at least three years and valid for as
|
||||||
|
long as you offer spare parts or customer support for that product
|
||||||
|
model, to give anyone who possesses the object code either (1) a
|
||||||
|
copy of the Corresponding Source for all the software in the
|
||||||
|
product that is covered by this License, on a durable physical
|
||||||
|
medium customarily used for software interchange, for a price no
|
||||||
|
more than your reasonable cost of physically performing this
|
||||||
|
conveying of source, or (2) access to copy the
|
||||||
|
Corresponding Source from a network server at no charge.
|
||||||
|
|
||||||
|
c) Convey individual copies of the object code with a copy of the
|
||||||
|
written offer to provide the Corresponding Source. This
|
||||||
|
alternative is allowed only occasionally and noncommercially, and
|
||||||
|
only if you received the object code with such an offer, in accord
|
||||||
|
with subsection 6b.
|
||||||
|
|
||||||
|
d) Convey the object code by offering access from a designated
|
||||||
|
place (gratis or for a charge), and offer equivalent access to the
|
||||||
|
Corresponding Source in the same way through the same place at no
|
||||||
|
further charge. You need not require recipients to copy the
|
||||||
|
Corresponding Source along with the object code. If the place to
|
||||||
|
copy the object code is a network server, the Corresponding Source
|
||||||
|
may be on a different server (operated by you or a third party)
|
||||||
|
that supports equivalent copying facilities, provided you maintain
|
||||||
|
clear directions next to the object code saying where to find the
|
||||||
|
Corresponding Source. Regardless of what server hosts the
|
||||||
|
Corresponding Source, you remain obligated to ensure that it is
|
||||||
|
available for as long as needed to satisfy these requirements.
|
||||||
|
|
||||||
|
e) Convey the object code using peer-to-peer transmission, provided
|
||||||
|
you inform other peers where the object code and Corresponding
|
||||||
|
Source of the work are being offered to the general public at no
|
||||||
|
charge under subsection 6d.
|
||||||
|
|
||||||
|
A separable portion of the object code, whose source code is excluded
|
||||||
|
from the Corresponding Source as a System Library, need not be
|
||||||
|
included in conveying the object code work.
|
||||||
|
|
||||||
|
A "User Product" is either (1) a "consumer product", which means any
|
||||||
|
tangible personal property which is normally used for personal, family,
|
||||||
|
or household purposes, or (2) anything designed or sold for incorporation
|
||||||
|
into a dwelling. In determining whether a product is a consumer product,
|
||||||
|
doubtful cases shall be resolved in favor of coverage. For a particular
|
||||||
|
product received by a particular user, "normally used" refers to a
|
||||||
|
typical or common use of that class of product, regardless of the status
|
||||||
|
of the particular user or of the way in which the particular user
|
||||||
|
actually uses, or expects or is expected to use, the product. A product
|
||||||
|
is a consumer product regardless of whether the product has substantial
|
||||||
|
commercial, industrial or non-consumer uses, unless such uses represent
|
||||||
|
the only significant mode of use of the product.
|
||||||
|
|
||||||
|
"Installation Information" for a User Product means any methods,
|
||||||
|
procedures, authorization keys, or other information required to install
|
||||||
|
and execute modified versions of a covered work in that User Product from
|
||||||
|
a modified version of its Corresponding Source. The information must
|
||||||
|
suffice to ensure that the continued functioning of the modified object
|
||||||
|
code is in no case prevented or interfered with solely because
|
||||||
|
modification has been made.
|
||||||
|
|
||||||
|
If you convey an object code work under this section in, or with, or
|
||||||
|
specifically for use in, a User Product, and the conveying occurs as
|
||||||
|
part of a transaction in which the right of possession and use of the
|
||||||
|
User Product is transferred to the recipient in perpetuity or for a
|
||||||
|
fixed term (regardless of how the transaction is characterized), the
|
||||||
|
Corresponding Source conveyed under this section must be accompanied
|
||||||
|
by the Installation Information. But this requirement does not apply
|
||||||
|
if neither you nor any third party retains the ability to install
|
||||||
|
modified object code on the User Product (for example, the work has
|
||||||
|
been installed in ROM).
|
||||||
|
|
||||||
|
The requirement to provide Installation Information does not include a
|
||||||
|
requirement to continue to provide support service, warranty, or updates
|
||||||
|
for a work that has been modified or installed by the recipient, or for
|
||||||
|
the User Product in which it has been modified or installed. Access to a
|
||||||
|
network may be denied when the modification itself materially and
|
||||||
|
adversely affects the operation of the network or violates the rules and
|
||||||
|
protocols for communication across the network.
|
||||||
|
|
||||||
|
Corresponding Source conveyed, and Installation Information provided,
|
||||||
|
in accord with this section must be in a format that is publicly
|
||||||
|
documented (and with an implementation available to the public in
|
||||||
|
source code form), and must require no special password or key for
|
||||||
|
unpacking, reading or copying.
|
||||||
|
|
||||||
|
7. Additional Terms.
|
||||||
|
|
||||||
|
"Additional permissions" are terms that supplement the terms of this
|
||||||
|
License by making exceptions from one or more of its conditions.
|
||||||
|
Additional permissions that are applicable to the entire Program shall
|
||||||
|
be treated as though they were included in this License, to the extent
|
||||||
|
that they are valid under applicable law. If additional permissions
|
||||||
|
apply only to part of the Program, that part may be used separately
|
||||||
|
under those permissions, but the entire Program remains governed by
|
||||||
|
this License without regard to the additional permissions.
|
||||||
|
|
||||||
|
When you convey a copy of a covered work, you may at your option
|
||||||
|
remove any additional permissions from that copy, or from any part of
|
||||||
|
it. (Additional permissions may be written to require their own
|
||||||
|
removal in certain cases when you modify the work.) You may place
|
||||||
|
additional permissions on material, added by you to a covered work,
|
||||||
|
for which you have or can give appropriate copyright permission.
|
||||||
|
|
||||||
|
Notwithstanding any other provision of this License, for material you
|
||||||
|
add to a covered work, you may (if authorized by the copyright holders of
|
||||||
|
that material) supplement the terms of this License with terms:
|
||||||
|
|
||||||
|
a) Disclaiming warranty or limiting liability differently from the
|
||||||
|
terms of sections 15 and 16 of this License; or
|
||||||
|
|
||||||
|
b) Requiring preservation of specified reasonable legal notices or
|
||||||
|
author attributions in that material or in the Appropriate Legal
|
||||||
|
Notices displayed by works containing it; or
|
||||||
|
|
||||||
|
c) Prohibiting misrepresentation of the origin of that material, or
|
||||||
|
requiring that modified versions of such material be marked in
|
||||||
|
reasonable ways as different from the original version; or
|
||||||
|
|
||||||
|
d) Limiting the use for publicity purposes of names of licensors or
|
||||||
|
authors of the material; or
|
||||||
|
|
||||||
|
e) Declining to grant rights under trademark law for use of some
|
||||||
|
trade names, trademarks, or service marks; or
|
||||||
|
|
||||||
|
f) Requiring indemnification of licensors and authors of that
|
||||||
|
material by anyone who conveys the material (or modified versions of
|
||||||
|
it) with contractual assumptions of liability to the recipient, for
|
||||||
|
any liability that these contractual assumptions directly impose on
|
||||||
|
those licensors and authors.
|
||||||
|
|
||||||
|
All other non-permissive additional terms are considered "further
|
||||||
|
restrictions" within the meaning of section 10. If the Program as you
|
||||||
|
received it, or any part of it, contains a notice stating that it is
|
||||||
|
governed by this License along with a term that is a further
|
||||||
|
restriction, you may remove that term. If a license document contains
|
||||||
|
a further restriction but permits relicensing or conveying under this
|
||||||
|
License, you may add to a covered work material governed by the terms
|
||||||
|
of that license document, provided that the further restriction does
|
||||||
|
not survive such relicensing or conveying.
|
||||||
|
|
||||||
|
If you add terms to a covered work in accord with this section, you
|
||||||
|
must place, in the relevant source files, a statement of the
|
||||||
|
additional terms that apply to those files, or a notice indicating
|
||||||
|
where to find the applicable terms.
|
||||||
|
|
||||||
|
Additional terms, permissive or non-permissive, may be stated in the
|
||||||
|
form of a separately written license, or stated as exceptions;
|
||||||
|
the above requirements apply either way.
|
||||||
|
|
||||||
|
8. Termination.
|
||||||
|
|
||||||
|
You may not propagate or modify a covered work except as expressly
|
||||||
|
provided under this License. Any attempt otherwise to propagate or
|
||||||
|
modify it is void, and will automatically terminate your rights under
|
||||||
|
this License (including any patent licenses granted under the third
|
||||||
|
paragraph of section 11).
|
||||||
|
|
||||||
|
However, if you cease all violation of this License, then your
|
||||||
|
license from a particular copyright holder is reinstated (a)
|
||||||
|
provisionally, unless and until the copyright holder explicitly and
|
||||||
|
finally terminates your license, and (b) permanently, if the copyright
|
||||||
|
holder fails to notify you of the violation by some reasonable means
|
||||||
|
prior to 60 days after the cessation.
|
||||||
|
|
||||||
|
Moreover, your license from a particular copyright holder is
|
||||||
|
reinstated permanently if the copyright holder notifies you of the
|
||||||
|
violation by some reasonable means, this is the first time you have
|
||||||
|
received notice of violation of this License (for any work) from that
|
||||||
|
copyright holder, and you cure the violation prior to 30 days after
|
||||||
|
your receipt of the notice.
|
||||||
|
|
||||||
|
Termination of your rights under this section does not terminate the
|
||||||
|
licenses of parties who have received copies or rights from you under
|
||||||
|
this License. If your rights have been terminated and not permanently
|
||||||
|
reinstated, you do not qualify to receive new licenses for the same
|
||||||
|
material under section 10.
|
||||||
|
|
||||||
|
9. Acceptance Not Required for Having Copies.
|
||||||
|
|
||||||
|
You are not required to accept this License in order to receive or
|
||||||
|
run a copy of the Program. Ancillary propagation of a covered work
|
||||||
|
occurring solely as a consequence of using peer-to-peer transmission
|
||||||
|
to receive a copy likewise does not require acceptance. However,
|
||||||
|
nothing other than this License grants you permission to propagate or
|
||||||
|
modify any covered work. These actions infringe copyright if you do
|
||||||
|
not accept this License. Therefore, by modifying or propagating a
|
||||||
|
covered work, you indicate your acceptance of this License to do so.
|
||||||
|
|
||||||
|
10. Automatic Licensing of Downstream Recipients.
|
||||||
|
|
||||||
|
Each time you convey a covered work, the recipient automatically
|
||||||
|
receives a license from the original licensors, to run, modify and
|
||||||
|
propagate that work, subject to this License. You are not responsible
|
||||||
|
for enforcing compliance by third parties with this License.
|
||||||
|
|
||||||
|
An "entity transaction" is a transaction transferring control of an
|
||||||
|
organization, or substantially all assets of one, or subdividing an
|
||||||
|
organization, or merging organizations. If propagation of a covered
|
||||||
|
work results from an entity transaction, each party to that
|
||||||
|
transaction who receives a copy of the work also receives whatever
|
||||||
|
licenses to the work the party's predecessor in interest had or could
|
||||||
|
give under the previous paragraph, plus a right to possession of the
|
||||||
|
Corresponding Source of the work from the predecessor in interest, if
|
||||||
|
the predecessor has it or can get it with reasonable efforts.
|
||||||
|
|
||||||
|
You may not impose any further restrictions on the exercise of the
|
||||||
|
rights granted or affirmed under this License. For example, you may
|
||||||
|
not impose a license fee, royalty, or other charge for exercise of
|
||||||
|
rights granted under this License, and you may not initiate litigation
|
||||||
|
(including a cross-claim or counterclaim in a lawsuit) alleging that
|
||||||
|
any patent claim is infringed by making, using, selling, offering for
|
||||||
|
sale, or importing the Program or any portion of it.
|
||||||
|
|
||||||
|
11. Patents.
|
||||||
|
|
||||||
|
A "contributor" is a copyright holder who authorizes use under this
|
||||||
|
License of the Program or a work on which the Program is based. The
|
||||||
|
work thus licensed is called the contributor's "contributor version".
|
||||||
|
|
||||||
|
A contributor's "essential patent claims" are all patent claims
|
||||||
|
owned or controlled by the contributor, whether already acquired or
|
||||||
|
hereafter acquired, that would be infringed by some manner, permitted
|
||||||
|
by this License, of making, using, or selling its contributor version,
|
||||||
|
but do not include claims that would be infringed only as a
|
||||||
|
consequence of further modification of the contributor version. For
|
||||||
|
purposes of this definition, "control" includes the right to grant
|
||||||
|
patent sublicenses in a manner consistent with the requirements of
|
||||||
|
this License.
|
||||||
|
|
||||||
|
Each contributor grants you a non-exclusive, worldwide, royalty-free
|
||||||
|
patent license under the contributor's essential patent claims, to
|
||||||
|
make, use, sell, offer for sale, import and otherwise run, modify and
|
||||||
|
propagate the contents of its contributor version.
|
||||||
|
|
||||||
|
In the following three paragraphs, a "patent license" is any express
|
||||||
|
agreement or commitment, however denominated, not to enforce a patent
|
||||||
|
(such as an express permission to practice a patent or covenant not to
|
||||||
|
sue for patent infringement). To "grant" such a patent license to a
|
||||||
|
party means to make such an agreement or commitment not to enforce a
|
||||||
|
patent against the party.
|
||||||
|
|
||||||
|
If you convey a covered work, knowingly relying on a patent license,
|
||||||
|
and the Corresponding Source of the work is not available for anyone
|
||||||
|
to copy, free of charge and under the terms of this License, through a
|
||||||
|
publicly available network server or other readily accessible means,
|
||||||
|
then you must either (1) cause the Corresponding Source to be so
|
||||||
|
available, or (2) arrange to deprive yourself of the benefit of the
|
||||||
|
patent license for this particular work, or (3) arrange, in a manner
|
||||||
|
consistent with the requirements of this License, to extend the patent
|
||||||
|
license to downstream recipients. "Knowingly relying" means you have
|
||||||
|
actual knowledge that, but for the patent license, your conveying the
|
||||||
|
covered work in a country, or your recipient's use of the covered work
|
||||||
|
in a country, would infringe one or more identifiable patents in that
|
||||||
|
country that you have reason to believe are valid.
|
||||||
|
|
||||||
|
If, pursuant to or in connection with a single transaction or
|
||||||
|
arrangement, you convey, or propagate by procuring conveyance of, a
|
||||||
|
covered work, and grant a patent license to some of the parties
|
||||||
|
receiving the covered work authorizing them to use, propagate, modify
|
||||||
|
or convey a specific copy of the covered work, then the patent license
|
||||||
|
you grant is automatically extended to all recipients of the covered
|
||||||
|
work and works based on it.
|
||||||
|
|
||||||
|
A patent license is "discriminatory" if it does not include within
|
||||||
|
the scope of its coverage, prohibits the exercise of, or is
|
||||||
|
conditioned on the non-exercise of one or more of the rights that are
|
||||||
|
specifically granted under this License. You may not convey a covered
|
||||||
|
work if you are a party to an arrangement with a third party that is
|
||||||
|
in the business of distributing software, under which you make payment
|
||||||
|
to the third party based on the extent of your activity of conveying
|
||||||
|
the work, and under which the third party grants, to any of the
|
||||||
|
parties who would receive the covered work from you, a discriminatory
|
||||||
|
patent license (a) in connection with copies of the covered work
|
||||||
|
conveyed by you (or copies made from those copies), or (b) primarily
|
||||||
|
for and in connection with specific products or compilations that
|
||||||
|
contain the covered work, unless you entered into that arrangement,
|
||||||
|
or that patent license was granted, prior to 28 March 2007.
|
||||||
|
|
||||||
|
Nothing in this License shall be construed as excluding or limiting
|
||||||
|
any implied license or other defenses to infringement that may
|
||||||
|
otherwise be available to you under applicable patent law.
|
||||||
|
|
||||||
|
12. No Surrender of Others' Freedom.
|
||||||
|
|
||||||
|
If conditions are imposed on you (whether by court order, agreement or
|
||||||
|
otherwise) that contradict the conditions of this License, they do not
|
||||||
|
excuse you from the conditions of this License. If you cannot convey a
|
||||||
|
covered work so as to satisfy simultaneously your obligations under this
|
||||||
|
License and any other pertinent obligations, then as a consequence you may
|
||||||
|
not convey it at all. For example, if you agree to terms that obligate you
|
||||||
|
to collect a royalty for further conveying from those to whom you convey
|
||||||
|
the Program, the only way you could satisfy both those terms and this
|
||||||
|
License would be to refrain entirely from conveying the Program.
|
||||||
|
|
||||||
|
13. Remote Network Interaction; Use with the GNU General Public License.
|
||||||
|
|
||||||
|
Notwithstanding any other provision of this License, if you modify the
|
||||||
|
Program, your modified version must prominently offer all users
|
||||||
|
interacting with it remotely through a computer network (if your version
|
||||||
|
supports such interaction) an opportunity to receive the Corresponding
|
||||||
|
Source of your version by providing access to the Corresponding Source
|
||||||
|
from a network server at no charge, through some standard or customary
|
||||||
|
means of facilitating copying of software. This Corresponding Source
|
||||||
|
shall include the Corresponding Source for any work covered by version 3
|
||||||
|
of the GNU General Public License that is incorporated pursuant to the
|
||||||
|
following paragraph.
|
||||||
|
|
||||||
|
Notwithstanding any other provision of this License, you have
|
||||||
|
permission to link or combine any covered work with a work licensed
|
||||||
|
under version 3 of the GNU General Public License into a single
|
||||||
|
combined work, and to convey the resulting work. The terms of this
|
||||||
|
License will continue to apply to the part which is the covered work,
|
||||||
|
but the work with which it is combined will remain governed by version
|
||||||
|
3 of the GNU General Public License.
|
||||||
|
|
||||||
|
14. Revised Versions of this License.
|
||||||
|
|
||||||
|
The Free Software Foundation may publish revised and/or new versions of
|
||||||
|
the GNU Affero General Public License from time to time. Such new versions
|
||||||
|
will be similar in spirit to the present version, but may differ in detail to
|
||||||
|
address new problems or concerns.
|
||||||
|
|
||||||
|
Each version is given a distinguishing version number. If the
|
||||||
|
Program specifies that a certain numbered version of the GNU Affero General
|
||||||
|
Public License "or any later version" applies to it, you have the
|
||||||
|
option of following the terms and conditions either of that numbered
|
||||||
|
version or of any later version published by the Free Software
|
||||||
|
Foundation. If the Program does not specify a version number of the
|
||||||
|
GNU Affero General Public License, you may choose any version ever published
|
||||||
|
by the Free Software Foundation.
|
||||||
|
|
||||||
|
If the Program specifies that a proxy can decide which future
|
||||||
|
versions of the GNU Affero General Public License can be used, that proxy's
|
||||||
|
public statement of acceptance of a version permanently authorizes you
|
||||||
|
to choose that version for the Program.
|
||||||
|
|
||||||
|
Later license versions may give you additional or different
|
||||||
|
permissions. However, no additional obligations are imposed on any
|
||||||
|
author or copyright holder as a result of your choosing to follow a
|
||||||
|
later version.
|
||||||
|
|
||||||
|
15. Disclaimer of Warranty.
|
||||||
|
|
||||||
|
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
|
||||||
|
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
|
||||||
|
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
|
||||||
|
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
|
||||||
|
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||||
|
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
|
||||||
|
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
|
||||||
|
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
|
||||||
|
|
||||||
|
16. Limitation of Liability.
|
||||||
|
|
||||||
|
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
||||||
|
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
|
||||||
|
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
|
||||||
|
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
|
||||||
|
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
|
||||||
|
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
|
||||||
|
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
|
||||||
|
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
|
||||||
|
SUCH DAMAGES.
|
||||||
|
|
||||||
|
17. Interpretation of Sections 15 and 16.
|
||||||
|
|
||||||
|
If the disclaimer of warranty and limitation of liability provided
|
||||||
|
above cannot be given local legal effect according to their terms,
|
||||||
|
reviewing courts shall apply local law that most closely approximates
|
||||||
|
an absolute waiver of all civil liability in connection with the
|
||||||
|
Program, unless a warranty or assumption of liability accompanies a
|
||||||
|
copy of the Program in return for a fee.
|
||||||
|
|
||||||
|
END OF TERMS AND CONDITIONS
|
||||||
|
|
||||||
|
How to Apply These Terms to Your New Programs
|
||||||
|
|
||||||
|
If you develop a new program, and you want it to be of the greatest
|
||||||
|
possible use to the public, the best way to achieve this is to make it
|
||||||
|
free software which everyone can redistribute and change under these terms.
|
||||||
|
|
||||||
|
To do so, attach the following notices to the program. It is safest
|
||||||
|
to attach them to the start of each source file to most effectively
|
||||||
|
state the exclusion of warranty; and each file should have at least
|
||||||
|
the "copyright" line and a pointer to where the full notice is found.
|
||||||
|
|
||||||
|
<one line to give the program's name and a brief idea of what it does.>
|
||||||
|
Copyright (C) <year> <name of author>
|
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify
|
||||||
|
it under the terms of the GNU Affero General Public License as published
|
||||||
|
by the Free Software Foundation, either version 3 of the License, or
|
||||||
|
(at your option) any later version.
|
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||||
|
GNU Affero General Public License for more details.
|
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License
|
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
|
||||||
|
Also add information on how to contact you by electronic and paper mail.
|
||||||
|
|
||||||
|
If your software can interact with users remotely through a computer
|
||||||
|
network, you should also make sure that it provides a way for users to
|
||||||
|
get its source. For example, if your program is a web application, its
|
||||||
|
interface could display a "Source" link that leads users to an archive
|
||||||
|
of the code. There are many ways you could offer source, and different
|
||||||
|
solutions will be better for different programs; see section 13 for the
|
||||||
|
specific requirements.
|
||||||
|
|
||||||
|
You should also get your employer (if you work as a programmer) or school,
|
||||||
|
if any, to sign a "copyright disclaimer" for the program, if necessary.
|
||||||
|
For more information on this, and how to apply and follow the GNU AGPL, see
|
||||||
|
<https://www.gnu.org/licenses/>.
|
80
README.md
80
README.md
@@ -1,5 +1,11 @@
|
|||||||
# Morss - Get full-text RSS feeds
|
# Morss - Get full-text RSS feeds
|
||||||
|
|
||||||
|
_GNU AGPLv3 code_
|
||||||
|
|
||||||
|
Upstream source code: https://git.pictuga.com/pictuga/morss
|
||||||
|
Github mirror (for Issues & Pull requests): https://github.com/pictuga/morss
|
||||||
|
Homepage: https://morss.it/
|
||||||
|
|
||||||
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
|
This tool's goal is to get full-text RSS feeds out of striped RSS feeds,
|
||||||
commonly available on internet. Indeed most newspapers only make a small
|
commonly available on internet. Indeed most newspapers only make a small
|
||||||
description available to users in their rss feeds, which makes the RSS feed
|
description available to users in their rss feeds, which makes the RSS feed
|
||||||
@@ -21,8 +27,20 @@ html structure changes on the target website.
|
|||||||
Additionally morss can grab the source xml feed of iTunes podcast, and detect
|
Additionally morss can grab the source xml feed of iTunes podcast, and detect
|
||||||
rss feeds in html pages' `<meta>`.
|
rss feeds in html pages' `<meta>`.
|
||||||
|
|
||||||
You can use this program online for free at **[morss.it](http://morss.it/)**
|
You can use this program online for free at **[morss.it](https://morss.it/)**.
|
||||||
(there's also a [test](http://test.morss.it/) version).
|
|
||||||
|
Some features of morss:
|
||||||
|
- Read RSS/Atom feeds
|
||||||
|
- Create RSS feeds from json/html pages
|
||||||
|
- Convert iTunes podcast links into xml links
|
||||||
|
- Export feeds as RSS/JSON/CSV/HTML
|
||||||
|
- Fetch full-text content of feed items
|
||||||
|
- Follow 301/meta redirects
|
||||||
|
- Recover xml feeds with corrupt encoding
|
||||||
|
- Supports gzip-compressed http content
|
||||||
|
- HTTP caching with 3 different backends (in-memory/sqlite/mysql)
|
||||||
|
- Works as server/cli tool
|
||||||
|
- Deobfuscate various tracking links
|
||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
@@ -30,23 +48,23 @@ You do need:
|
|||||||
|
|
||||||
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
|
- [python](http://www.python.org/) >= 2.6 (python 3 is supported)
|
||||||
- [lxml](http://lxml.de/) for xml parsing
|
- [lxml](http://lxml.de/) for xml parsing
|
||||||
|
- [bs4](https://pypi.org/project/bs4/) for badly-formatted html pages
|
||||||
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
|
- [dateutil](http://labix.org/python-dateutil) to parse feed dates
|
||||||
- [html2text](http://www.aaronsw.com/2002/html2text/)
|
|
||||||
- [OrderedDict](https://pypi.python.org/pypi/ordereddict) if using python < 2.7
|
|
||||||
- [wheezy.template](https://pypi.python.org/pypi/wheezy.template) to generate HTML pages
|
|
||||||
- [chardet](https://pypi.python.org/pypi/chardet)
|
- [chardet](https://pypi.python.org/pypi/chardet)
|
||||||
|
- [six](https://pypi.python.org/pypi/six), a dependency of chardet
|
||||||
|
- pymysql
|
||||||
|
|
||||||
Simplest way to get these:
|
Simplest way to get these:
|
||||||
|
|
||||||
|
```shell
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
You may also need:
|
You may also need:
|
||||||
|
|
||||||
- Apache, with python-cgi support, to run on a server
|
- Apache, with python-cgi support, to run on a server
|
||||||
- a fast internet connection
|
- a fast internet connection
|
||||||
|
|
||||||
GPL3 code.
|
|
||||||
|
|
||||||
## Arguments
|
## Arguments
|
||||||
|
|
||||||
morss accepts some arguments, to lightly alter the output of morss. Arguments
|
morss accepts some arguments, to lightly alter the output of morss. Arguments
|
||||||
@@ -63,11 +81,9 @@ The arguments are:
|
|||||||
- `search=STRING`: does a basic case-sensitive search in the feed
|
- `search=STRING`: does a basic case-sensitive search in the feed
|
||||||
- Advanced
|
- Advanced
|
||||||
- `csv`: export to csv
|
- `csv`: export to csv
|
||||||
- `md`: convert articles to Markdown
|
|
||||||
- `indent`: returns indented XML or JSON, takes more place, but human-readable
|
- `indent`: returns indented XML or JSON, takes more place, but human-readable
|
||||||
- `nolink`: drop links, but keeps links' inner text
|
- `nolink`: drop links, but keeps links' inner text
|
||||||
- `noref`: drop items' link
|
- `noref`: drop items' link
|
||||||
- `hungry`: grab full-article even if it already looks long enough
|
|
||||||
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
|
- `cache`: only take articles from the cache (ie. don't grab new articles' content), so as to save time
|
||||||
- `debug`: to have some feedback from the script execution. Useful for debugging
|
- `debug`: to have some feedback from the script execution. Useful for debugging
|
||||||
- `mono`: disable multithreading while fetching, makes debugging easier
|
- `mono`: disable multithreading while fetching, makes debugging easier
|
||||||
@@ -79,9 +95,12 @@ The arguments are:
|
|||||||
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
|
- `cors`: allow Cross-origin resource sharing (allows XHR calls from other servers)
|
||||||
- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
|
- `html`: changes the http content-type to html, so that python cgi erros (written in html) are readable in a web browser
|
||||||
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
|
- `txt`: changes the http content-type to txt (for faster "`view-source:`")
|
||||||
- Completely useless
|
- Custom feeds: you can turn any HTML page into a RSS feed using morss, using xpath rules. The article content will be fetched as usual (with readabilite). Please note that you will have to **replace** any `/` in your rule with a `|` when using morss as a webserver
|
||||||
- `strip`: remove all description and content from feed items
|
- `items`: (**mandatory** to activate the custom feeds function) xpath rule to match all the RSS entries
|
||||||
- `empty`: remove all feed items
|
- `item_link`: xpath rule relative to `items` to point to the entry's link
|
||||||
|
- `item_title`: entry's title
|
||||||
|
- `item_content`: entry's description
|
||||||
|
- `item_time`: entry's date & time (accepts a wide range of time formats)
|
||||||
|
|
||||||
## Use cases
|
## Use cases
|
||||||
|
|
||||||
@@ -124,7 +143,9 @@ ensure that the provided `/www/.htaccess` works well with your server.
|
|||||||
|
|
||||||
Running this command should do:
|
Running this command should do:
|
||||||
|
|
||||||
|
```shell
|
||||||
uwsgi --http :9090 --plugin python --wsgi-file main.py
|
uwsgi --http :9090 --plugin python --wsgi-file main.py
|
||||||
|
```
|
||||||
|
|
||||||
However, one problem might be how to serve the provided `index.html` file if it
|
However, one problem might be how to serve the provided `index.html` file if it
|
||||||
isn't in the same directory. Therefore you can add this at the end of the
|
isn't in the same directory. Therefore you can add this at the end of the
|
||||||
@@ -140,8 +161,12 @@ You can change the port and the location of the `www/` folder like this `python
|
|||||||
|
|
||||||
#### Passing arguments
|
#### Passing arguments
|
||||||
|
|
||||||
Then visit: **`http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL`**
|
Then visit:
|
||||||
|
```
|
||||||
|
http://PATH/TO/MORSS/[main.py/][:argwithoutvalue[:argwithvalue=value[...]]]/FEEDURL
|
||||||
|
```
|
||||||
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
|
For example: `http://morss.example/:clip/https://twitter.com/pictuga`
|
||||||
|
|
||||||
*(Brackets indicate optional text)*
|
*(Brackets indicate optional text)*
|
||||||
|
|
||||||
The `main.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.
|
The `main.py` part is only needed if your server doesn't support the Apache redirect rule set in the provided `.htaccess`.
|
||||||
@@ -150,8 +175,12 @@ Works like a charm with [Tiny Tiny RSS](http://tt-rss.org/redmine/projects/tt-rs
|
|||||||
|
|
||||||
### As a CLI application
|
### As a CLI application
|
||||||
|
|
||||||
Run: **`python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`**
|
Run:
|
||||||
|
```
|
||||||
|
python[2.7] -m morss [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||||
|
```
|
||||||
For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
For example: `python -m morss debug http://feeds.bbci.co.uk/news/rss.xml`
|
||||||
|
|
||||||
*(Brackets indicate optional text)*
|
*(Brackets indicate optional text)*
|
||||||
|
|
||||||
### As a newsreader hook
|
### As a newsreader hook
|
||||||
@@ -161,8 +190,12 @@ To use it, the newsreader [Liferea](http://lzone.de/liferea/) is required
|
|||||||
scripts can be run on top of the RSS feed, using its
|
scripts can be run on top of the RSS feed, using its
|
||||||
[output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
|
[output](http://lzone.de/liferea/scraping.htm) as an RSS feed.
|
||||||
|
|
||||||
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command: **`[python2.7] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL`**
|
To use this script, you have to enable "(Unix) command" in liferea feed settings, and use the command:
|
||||||
|
```
|
||||||
|
[python[2.7]] PATH/TO/MORSS/main.py [argwithoutvalue] [argwithvalue=value] [...] FEEDURL
|
||||||
|
```
|
||||||
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
|
For example: `python2.7 PATH/TO/MORSS/main.py http://feeds.bbci.co.uk/news/rss.xml`
|
||||||
|
|
||||||
*(Brackets indicate optional text)*
|
*(Brackets indicate optional text)*
|
||||||
|
|
||||||
### As a python library
|
### As a python library
|
||||||
@@ -180,7 +213,7 @@ Using cache and passing arguments:
|
|||||||
>>> import morss
|
>>> import morss
|
||||||
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
|
>>> url = 'http://feeds.bbci.co.uk/news/rss.xml'
|
||||||
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
|
>>> cache = '/tmp/morss-cache.db' # sqlite cache location
|
||||||
>>> options = {'csv':True, 'md':True}
|
>>> options = {'csv':True}
|
||||||
>>> xml_string = morss.process(url, cache, options)
|
>>> xml_string = morss.process(url, cache, options)
|
||||||
>>> xml_string[:50]
|
>>> xml_string[:50]
|
||||||
'{"title": "BBC News - Home", "desc": "The latest s'
|
'{"title": "BBC News - Home", "desc": "The latest s'
|
||||||
@@ -195,7 +228,7 @@ Doing it step-by-step:
|
|||||||
import morss, morss.crawler
|
import morss, morss.crawler
|
||||||
|
|
||||||
url = 'http://newspaper.example/feed.xml'
|
url = 'http://newspaper.example/feed.xml'
|
||||||
options = morss.Options(csv=True, md=True) # arguments
|
options = morss.Options(csv=True) # arguments
|
||||||
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
|
morss.crawler.sqlite_default = '/tmp/morss-cache.db' # sqlite cache location
|
||||||
|
|
||||||
rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
|
rss = morss.FeedFetch(url, options) # this only grabs the RSS feed
|
||||||
@@ -206,12 +239,12 @@ output = morss.Format(rss, options) # formats final feed
|
|||||||
|
|
||||||
## Cache information
|
## Cache information
|
||||||
|
|
||||||
morss uses a small cache directory to make the loading faster. Given the way
|
morss uses caching to make loading faster. There are 2 possible cache backends
|
||||||
it's designed, the cache doesn't need to be purged each while and then, unless
|
(visible in `morss/crawler.py`):
|
||||||
you stop following a big amount of feeds. Only in the case of mass un-subscribing,
|
|
||||||
you might want to delete the cache files corresponding to the bygone feeds. If
|
- `SQLiteCache`: sqlite3 cache. Default file location is in-memory (i.e. it will
|
||||||
morss is running as a server, the cache folder is at `MORSS_DIRECTORY/cache/`,
|
be cleared every time the program is run
|
||||||
and in `$HOME/.cache/morss` otherwise.
|
- `MySQLCacheHandler`: /!\ Does NOT support multi-threading
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
### Length limitation
|
### Length limitation
|
||||||
@@ -258,3 +291,4 @@ from this list:
|
|||||||
|
|
||||||
- Add ability to run morss.py as an update daemon
|
- Add ability to run morss.py as an update daemon
|
||||||
- Add ability to use custom xpath rule instead of readability
|
- Add ability to use custom xpath rule instead of readability
|
||||||
|
- More ideas here <https://github.com/pictuga/morss/issues/15>
|
||||||
|
1
main.py
1
main.py
@@ -1,4 +1,5 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
|
||||||
from morss import main, cgi_wrapper as application
|
from morss import main, cgi_wrapper as application
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
@@ -1 +1,2 @@
|
|||||||
|
# ran on `import morss`
|
||||||
from .morss import *
|
from .morss import *
|
||||||
|
@@ -1,3 +1,4 @@
|
|||||||
|
# ran on `python -m morss`
|
||||||
from .morss import main
|
from .morss import main
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
256
morss/crawler.py
256
morss/crawler.py
@@ -1,34 +1,111 @@
|
|||||||
import sys
|
import sys
|
||||||
|
|
||||||
import ssl
|
import zlib
|
||||||
import socket
|
|
||||||
|
|
||||||
from gzip import GzipFile
|
|
||||||
from io import BytesIO, StringIO
|
from io import BytesIO, StringIO
|
||||||
import re
|
import re
|
||||||
import chardet
|
import chardet
|
||||||
|
from cgi import parse_header
|
||||||
import lxml.html
|
import lxml.html
|
||||||
import sqlite3
|
|
||||||
import time
|
import time
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from urllib2 import BaseHandler, Request, addinfourl, parse_keqv_list, parse_http_list
|
# python 2
|
||||||
|
from urllib2 import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
||||||
import mimetools
|
import mimetools
|
||||||
except ImportError:
|
except ImportError:
|
||||||
from urllib.request import BaseHandler, Request, addinfourl, parse_keqv_list, parse_http_list
|
# python 3
|
||||||
|
from urllib.request import BaseHandler, HTTPCookieProcessor, Request, addinfourl, parse_keqv_list, parse_http_list, build_opener
|
||||||
import email
|
import email
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
# python 2
|
||||||
basestring
|
basestring
|
||||||
except NameError:
|
except NameError:
|
||||||
|
# python 3
|
||||||
basestring = unicode = str
|
basestring = unicode = str
|
||||||
buffer = memoryview
|
|
||||||
|
|
||||||
|
|
||||||
MIMETYPE = {
|
MIMETYPE = {
|
||||||
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml'],
|
'xml': ['text/xml', 'application/xml', 'application/rss+xml', 'application/rdf+xml', 'application/atom+xml', 'application/xhtml+xml'],
|
||||||
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
|
'html': ['text/html', 'application/xhtml+xml', 'application/xml']}
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
|
||||||
|
|
||||||
|
|
||||||
|
def custom_handler(accept=None, strict=False, delay=None, encoding=None, basic=False):
|
||||||
|
handlers = []
|
||||||
|
|
||||||
|
# as per urllib2 source code, these Handelers are added first
|
||||||
|
# *unless* one of the custom handlers inherits from one of them
|
||||||
|
#
|
||||||
|
# [ProxyHandler, UnknownHandler, HTTPHandler,
|
||||||
|
# HTTPDefaultErrorHandler, HTTPRedirectHandler,
|
||||||
|
# FTPHandler, FileHandler, HTTPErrorProcessor]
|
||||||
|
# & HTTPSHandler
|
||||||
|
|
||||||
|
#handlers.append(DebugHandler())
|
||||||
|
handlers.append(SizeLimitHandler(100*1024)) # 100KiB
|
||||||
|
handlers.append(HTTPCookieProcessor())
|
||||||
|
handlers.append(GZIPHandler())
|
||||||
|
handlers.append(HTTPEquivHandler())
|
||||||
|
handlers.append(HTTPRefreshHandler())
|
||||||
|
handlers.append(UAHandler(DEFAULT_UA))
|
||||||
|
|
||||||
|
if not basic:
|
||||||
|
handlers.append(AutoRefererHandler())
|
||||||
|
|
||||||
|
handlers.append(EncodingFixHandler(encoding))
|
||||||
|
|
||||||
|
if accept:
|
||||||
|
handlers.append(ContentNegociationHandler(MIMETYPE[accept], strict))
|
||||||
|
|
||||||
|
handlers.append(CacheHandler(force_min=delay))
|
||||||
|
|
||||||
|
return build_opener(*handlers)
|
||||||
|
|
||||||
|
|
||||||
|
class DebugHandler(BaseHandler):
|
||||||
|
handler_order = 2000
|
||||||
|
|
||||||
|
def http_request(self, req):
|
||||||
|
print(repr(req.header_items()))
|
||||||
|
return req
|
||||||
|
|
||||||
|
def http_response(self, req, resp):
|
||||||
|
print(resp.headers.__dict__)
|
||||||
|
return resp
|
||||||
|
|
||||||
|
https_request = http_request
|
||||||
|
https_response = http_response
|
||||||
|
|
||||||
|
|
||||||
|
class SizeLimitHandler(BaseHandler):
|
||||||
|
""" Limit file size, defaults to 5MiB """
|
||||||
|
|
||||||
|
handler_order = 450
|
||||||
|
|
||||||
|
def __init__(self, limit=5*1024^2):
|
||||||
|
self.limit = limit
|
||||||
|
|
||||||
|
def http_response(self, req, resp):
|
||||||
|
data = resp.read(self.limit)
|
||||||
|
|
||||||
|
fp = BytesIO(data)
|
||||||
|
old_resp = resp
|
||||||
|
resp = addinfourl(fp, old_resp.headers, old_resp.url, old_resp.code)
|
||||||
|
resp.msg = old_resp.msg
|
||||||
|
|
||||||
|
return resp
|
||||||
|
|
||||||
|
https_response = http_response
|
||||||
|
|
||||||
|
|
||||||
|
def UnGzip(data):
|
||||||
|
" Supports truncated files "
|
||||||
|
return zlib.decompressobj(zlib.MAX_WBITS | 32).decompress(data)
|
||||||
|
|
||||||
|
|
||||||
class GZIPHandler(BaseHandler):
|
class GZIPHandler(BaseHandler):
|
||||||
def http_request(self, req):
|
def http_request(self, req):
|
||||||
req.add_unredirected_header('Accept-Encoding', 'gzip')
|
req.add_unredirected_header('Accept-Encoding', 'gzip')
|
||||||
@@ -38,7 +115,9 @@ class GZIPHandler(BaseHandler):
|
|||||||
if 200 <= resp.code < 300:
|
if 200 <= resp.code < 300:
|
||||||
if resp.headers.get('Content-Encoding') == 'gzip':
|
if resp.headers.get('Content-Encoding') == 'gzip':
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
data = GzipFile(fileobj=BytesIO(data), mode='r').read()
|
|
||||||
|
data = UnGzip(data)
|
||||||
|
|
||||||
resp.headers['Content-Encoding'] = 'identity'
|
resp.headers['Content-Encoding'] = 'identity'
|
||||||
|
|
||||||
fp = BytesIO(data)
|
fp = BytesIO(data)
|
||||||
@@ -52,9 +131,15 @@ class GZIPHandler(BaseHandler):
|
|||||||
https_request = http_request
|
https_request = http_request
|
||||||
|
|
||||||
|
|
||||||
def detect_encoding(data, con=None):
|
def detect_encoding(data, resp=None):
|
||||||
if con is not None and con.info().get('charset'):
|
if resp is not None:
|
||||||
return con.info().get('charset')
|
enc = resp.headers.get('charset')
|
||||||
|
if enc is not None:
|
||||||
|
return enc
|
||||||
|
|
||||||
|
enc = parse_header(resp.headers.get('content-type', ''))[1].get('charset')
|
||||||
|
if enc is not None:
|
||||||
|
return enc
|
||||||
|
|
||||||
match = re.search(b'charset=["\']?([0-9a-zA-Z-]+)', data[:1000])
|
match = re.search(b'charset=["\']?([0-9a-zA-Z-]+)', data[:1000])
|
||||||
if match:
|
if match:
|
||||||
@@ -64,8 +149,8 @@ def detect_encoding(data, con=None):
|
|||||||
if match:
|
if match:
|
||||||
return match.groups()[0].lower().decode()
|
return match.groups()[0].lower().decode()
|
||||||
|
|
||||||
enc = chardet.detect(data[:1000])['encoding']
|
enc = chardet.detect(data[-2000:])['encoding']
|
||||||
if enc:
|
if enc and enc != 'ascii':
|
||||||
return enc
|
return enc
|
||||||
|
|
||||||
return 'utf-8'
|
return 'utf-8'
|
||||||
@@ -79,7 +164,11 @@ class EncodingFixHandler(BaseHandler):
|
|||||||
maintype = resp.info().get('Content-Type', '').split('/')[0]
|
maintype = resp.info().get('Content-Type', '').split('/')[0]
|
||||||
if 200 <= resp.code < 300 and maintype == 'text':
|
if 200 <= resp.code < 300 and maintype == 'text':
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
enc = detect_encoding(data, resp) if not self.encoding else self.encoding
|
|
||||||
|
if not self.encoding:
|
||||||
|
enc = detect_encoding(data, resp)
|
||||||
|
else:
|
||||||
|
enc = self.encoding
|
||||||
|
|
||||||
if enc:
|
if enc:
|
||||||
data = data.decode(enc, 'replace')
|
data = data.decode(enc, 'replace')
|
||||||
@@ -109,7 +198,6 @@ class UAHandler(BaseHandler):
|
|||||||
|
|
||||||
class AutoRefererHandler(BaseHandler):
|
class AutoRefererHandler(BaseHandler):
|
||||||
def http_request(self, req):
|
def http_request(self, req):
|
||||||
if req.host != 'feeds.feedburner.com':
|
|
||||||
req.add_unredirected_header('Referer', 'http://%s' % req.host)
|
req.add_unredirected_header('Referer', 'http://%s' % req.host)
|
||||||
return req
|
return req
|
||||||
|
|
||||||
@@ -139,7 +227,7 @@ class ContentNegociationHandler(BaseHandler):
|
|||||||
|
|
||||||
def http_response(self, req, resp):
|
def http_response(self, req, resp):
|
||||||
contenttype = resp.info().get('Content-Type', '').split(';')[0]
|
contenttype = resp.info().get('Content-Type', '').split(';')[0]
|
||||||
if 200 <= resp.code < 300 and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
|
if 200 <= resp.code < 300 and self.accept is not None and self.strict and contenttype in MIMETYPE['html'] and contenttype not in self.accept:
|
||||||
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
|
# opps, not what we were looking for, let's see if the html page suggests an alternative page of the right types
|
||||||
|
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
@@ -209,42 +297,37 @@ class HTTPRefreshHandler(BaseHandler):
|
|||||||
https_response = http_response
|
https_response = http_response
|
||||||
|
|
||||||
|
|
||||||
class BaseCacheHandler(BaseHandler):
|
default_cache = {}
|
||||||
" Cache based on etags/last-modified. Inherit from this to implement actual storage "
|
|
||||||
|
|
||||||
|
class CacheHandler(BaseHandler):
|
||||||
|
" Cache based on etags/last-modified "
|
||||||
|
|
||||||
private_cache = False # False to behave like a CDN (or if you just don't care), True like a PC
|
private_cache = False # False to behave like a CDN (or if you just don't care), True like a PC
|
||||||
handler_order = 499
|
handler_order = 499
|
||||||
|
|
||||||
def __init__(self, force_min=None):
|
def __init__(self, cache=None, force_min=None):
|
||||||
|
self.cache = cache or default_cache
|
||||||
self.force_min = force_min # force_min (seconds) to bypass http headers, -1 forever, 0 never, -2 do nothing if not in cache
|
self.force_min = force_min # force_min (seconds) to bypass http headers, -1 forever, 0 never, -2 do nothing if not in cache
|
||||||
|
|
||||||
def _load(self, url):
|
def load(self, url):
|
||||||
out = list(self.load(url))
|
try:
|
||||||
|
out = list(self.cache[url])
|
||||||
|
except KeyError:
|
||||||
|
out = [None, None, unicode(), bytes(), 0]
|
||||||
|
|
||||||
if sys.version_info[0] >= 3:
|
if sys.version_info[0] >= 3:
|
||||||
out[2] = email.message_from_string(out[2] or unicode()) # headers
|
out[2] = email.message_from_string(out[2] or unicode()) # headers
|
||||||
else:
|
else:
|
||||||
out[2] = mimetools.Message(StringIO(out[2] or unicode()))
|
out[2] = mimetools.Message(StringIO(out[2] or unicode()))
|
||||||
|
|
||||||
out[3] = out[3] or bytes() # data
|
|
||||||
out[4] = out[4] or 0 # timestamp
|
|
||||||
|
|
||||||
return out
|
return out
|
||||||
|
|
||||||
def load(self, url):
|
|
||||||
" Return the basic vars (code, msg, headers, data, timestamp) "
|
|
||||||
return (None, None, None, None, None)
|
|
||||||
|
|
||||||
def _save(self, url, code, msg, headers, data, timestamp):
|
|
||||||
headers = unicode(headers)
|
|
||||||
self.save(url, code, msg, headers, data, timestamp)
|
|
||||||
|
|
||||||
def save(self, url, code, msg, headers, data, timestamp):
|
def save(self, url, code, msg, headers, data, timestamp):
|
||||||
" Save values to disk "
|
self.cache[url] = (code, msg, unicode(headers), data, timestamp)
|
||||||
pass
|
|
||||||
|
|
||||||
def http_request(self, req):
|
def http_request(self, req):
|
||||||
(code, msg, headers, data, timestamp) = self._load(req.get_full_url())
|
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
|
||||||
|
|
||||||
if 'etag' in headers:
|
if 'etag' in headers:
|
||||||
req.add_unredirected_header('If-None-Match', headers['etag'])
|
req.add_unredirected_header('If-None-Match', headers['etag'])
|
||||||
@@ -255,7 +338,7 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
return req
|
return req
|
||||||
|
|
||||||
def http_open(self, req):
|
def http_open(self, req):
|
||||||
(code, msg, headers, data, timestamp) = self._load(req.get_full_url())
|
(code, msg, headers, data, timestamp) = self.load(req.get_full_url())
|
||||||
|
|
||||||
# some info needed to process everything
|
# some info needed to process everything
|
||||||
cache_control = parse_http_list(headers.get('cache-control', ()))
|
cache_control = parse_http_list(headers.get('cache-control', ()))
|
||||||
@@ -294,6 +377,11 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
# force refresh
|
# force refresh
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
elif code == 301 and cache_age < 7*24*3600:
|
||||||
|
# "301 Moved Permanently" has to be cached...as long as we want (awesome HTTP specs), let's say a week (why not?)
|
||||||
|
# use force_min=0 if you want to bypass this (needed for a proper refresh)
|
||||||
|
pass
|
||||||
|
|
||||||
elif self.force_min is None and ('no-cache' in cc_list
|
elif self.force_min is None and ('no-cache' in cc_list
|
||||||
or 'no-store' in cc_list
|
or 'no-store' in cc_list
|
||||||
or ('private' in cc_list and not self.private)):
|
or ('private' in cc_list and not self.private)):
|
||||||
@@ -308,11 +396,6 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
# still recent enough for us, use cache
|
# still recent enough for us, use cache
|
||||||
pass
|
pass
|
||||||
|
|
||||||
elif code == 301 and cache_age < 7*24*3600:
|
|
||||||
# "301 Moved Permanently" has to be cached...as long as we want (awesome HTTP specs), let's say a week (why not?)
|
|
||||||
# use force_min=0 if you want to bypass this (needed for a proper refresh)
|
|
||||||
pass
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# according to the www, we have to refresh when nothing is said
|
# according to the www, we have to refresh when nothing is said
|
||||||
return None
|
return None
|
||||||
@@ -346,7 +429,7 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
|
|
||||||
# save to disk
|
# save to disk
|
||||||
data = resp.read()
|
data = resp.read()
|
||||||
self._save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
|
self.save(req.get_full_url(), resp.code, resp.msg, resp.headers, data, time.time())
|
||||||
|
|
||||||
fp = BytesIO(data)
|
fp = BytesIO(data)
|
||||||
old_resp = resp
|
old_resp = resp
|
||||||
@@ -356,11 +439,11 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
return resp
|
return resp
|
||||||
|
|
||||||
def http_error_304(self, req, fp, code, msg, headers):
|
def http_error_304(self, req, fp, code, msg, headers):
|
||||||
cache = list(self._load(req.get_full_url()))
|
cache = list(self.load(req.get_full_url()))
|
||||||
|
|
||||||
if cache[0]:
|
if cache[0]:
|
||||||
cache[-1] = time.time()
|
cache[-1] = time.time()
|
||||||
self._save(req.get_full_url(), *cache)
|
self.save(req.get_full_url(), *cache)
|
||||||
|
|
||||||
new = Request(req.get_full_url(),
|
new = Request(req.get_full_url(),
|
||||||
headers=req.headers,
|
headers=req.headers,
|
||||||
@@ -377,36 +460,85 @@ class BaseCacheHandler(BaseHandler):
|
|||||||
https_response = http_response
|
https_response = http_response
|
||||||
|
|
||||||
|
|
||||||
sqlite_default = ':memory'
|
class BaseCache:
|
||||||
|
def __contains__(self, url):
|
||||||
|
try:
|
||||||
|
self[url]
|
||||||
|
|
||||||
|
except KeyError:
|
||||||
|
return False
|
||||||
|
|
||||||
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
class SQliteCacheHandler(BaseCacheHandler):
|
import sqlite3
|
||||||
def __init__(self, force_min=-1, filename=None):
|
|
||||||
BaseCacheHandler.__init__(self, force_min)
|
|
||||||
|
|
||||||
|
|
||||||
|
class SQLiteCache(BaseCache):
|
||||||
|
def __init__(self, filename=':memory:'):
|
||||||
self.con = sqlite3.connect(filename or sqlite_default, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
|
self.con = sqlite3.connect(filename or sqlite_default, detect_types=sqlite3.PARSE_DECLTYPES, check_same_thread=False)
|
||||||
self.con.execute('create table if not exists data (url unicode PRIMARY KEY, code int, msg unicode, headers unicode, data bytes, timestamp int)')
|
|
||||||
self.con.commit()
|
with self.con:
|
||||||
|
self.con.execute('CREATE TABLE IF NOT EXISTS data (url UNICODE PRIMARY KEY, code INT, msg UNICODE, headers UNICODE, data BLOB, timestamp INT)')
|
||||||
|
self.con.execute('pragma journal_mode=WAL')
|
||||||
|
|
||||||
def __del__(self):
|
def __del__(self):
|
||||||
self.con.close()
|
self.con.close()
|
||||||
|
|
||||||
def load(self, url):
|
def __getitem__(self, url):
|
||||||
row = self.con.execute('select * from data where url=?', (url,)).fetchone()
|
row = self.con.execute('SELECT * FROM data WHERE url=?', (url,)).fetchone()
|
||||||
|
|
||||||
if not row:
|
if not row:
|
||||||
return (None, None, None, None, None)
|
raise KeyError
|
||||||
|
|
||||||
return row[1:]
|
return row[1:]
|
||||||
|
|
||||||
def save(self, url, code, msg, headers, data, timestamp):
|
def __setitem__(self, url, value): # value = (code, msg, headers, data, timestamp)
|
||||||
data = buffer(data)
|
value = list(value)
|
||||||
|
value[3] = sqlite3.Binary(value[3]) # data
|
||||||
|
value = tuple(value)
|
||||||
|
|
||||||
if self.con.execute('select code from data where url=?', (url,)).fetchone():
|
if url in self:
|
||||||
self.con.execute('update data set code=?, msg=?, headers=?, data=?, timestamp=? where url=?',
|
with self.con:
|
||||||
(code, msg, headers, data, timestamp, url))
|
self.con.execute('UPDATE data SET code=?, msg=?, headers=?, data=?, timestamp=? WHERE url=?',
|
||||||
|
value + (url,))
|
||||||
|
|
||||||
else:
|
else:
|
||||||
self.con.execute('insert into data values (?,?,?,?,?,?)', (url, code, msg, headers, data, timestamp))
|
with self.con:
|
||||||
|
self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?)', (url,) + value)
|
||||||
|
|
||||||
self.con.commit()
|
|
||||||
|
import pymysql.cursors
|
||||||
|
|
||||||
|
|
||||||
|
class MySQLCacheHandler(BaseCache):
|
||||||
|
" NB. Requires mono-threading, as pymysql isn't thread-safe "
|
||||||
|
def __init__(self, user, password, database, host='localhost'):
|
||||||
|
self.con = pymysql.connect(host=host, user=user, password=password, database=database, charset='utf8', autocommit=True)
|
||||||
|
|
||||||
|
with self.con.cursor() as cursor:
|
||||||
|
cursor.execute('CREATE TABLE IF NOT EXISTS data (url VARCHAR(255) NOT NULL PRIMARY KEY, code INT, msg TEXT, headers TEXT, data BLOB, timestamp INT)')
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
self.con.close()
|
||||||
|
|
||||||
|
def __getitem__(self, url):
|
||||||
|
cursor = self.con.cursor()
|
||||||
|
cursor.execute('SELECT * FROM data WHERE url=%s', (url,))
|
||||||
|
row = cursor.fetchone()
|
||||||
|
|
||||||
|
if not row:
|
||||||
|
raise KeyError
|
||||||
|
|
||||||
|
return row[1:]
|
||||||
|
|
||||||
|
def __setitem__(self, url, value): # (code, msg, headers, data, timestamp)
|
||||||
|
if url in self:
|
||||||
|
with self.con.cursor() as cursor:
|
||||||
|
cursor.execute('UPDATE data SET code=%s, msg=%s, headers=%s, data=%s, timestamp=%s WHERE url=%s',
|
||||||
|
value + (url,))
|
||||||
|
|
||||||
|
else:
|
||||||
|
with self.con.cursor() as cursor:
|
||||||
|
cursor.execute('INSERT INTO data VALUES (%s,%s,%s,%s,%s,%s)', (url,) + value)
|
||||||
|
@@ -1,70 +1,146 @@
|
|||||||
|
[rss-rdf]
|
||||||
|
mode = xml
|
||||||
|
|
||||||
|
timeformat = %a, %d %b %Y %H:%M:%S %Z
|
||||||
|
|
||||||
|
base = <?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/"></rdf:RDF>
|
||||||
|
|
||||||
|
title = /rdf:RDF/rssfake:channel/rssfake:title
|
||||||
|
desc = /rdf:RDF/rssfake:channel/rssfake:description
|
||||||
|
items = /rdf:RDF/rssfake:channel/rssfake:item
|
||||||
|
|
||||||
|
item_title = rssfake:title
|
||||||
|
item_link = rssfake:link
|
||||||
|
item_desc = rssfake:description
|
||||||
|
item_content = content:encoded
|
||||||
|
item_time = rssfake:pubDate
|
||||||
|
|
||||||
|
|
||||||
|
[rss-channel]
|
||||||
|
mode = xml
|
||||||
|
|
||||||
|
timeformat = %a, %d %b %Y %H:%M:%S %Z
|
||||||
|
|
||||||
|
base = <?xml version="1.0"?><rss version="2.0"></rss>
|
||||||
|
|
||||||
|
title = /rss/channel/title
|
||||||
|
desc = /rss/channel/description
|
||||||
|
items = /rss/channel/item
|
||||||
|
|
||||||
|
item_title = title
|
||||||
|
item_link = link
|
||||||
|
item_desc = description
|
||||||
|
item_content = content:encoded
|
||||||
|
item_time = pubDate
|
||||||
|
|
||||||
|
|
||||||
|
[rss-atom]
|
||||||
|
mode = xml
|
||||||
|
|
||||||
|
timeformat = %Y-%m-%dT%H:%M:%SZ
|
||||||
|
|
||||||
|
base = <?xml version="1.0"?><feed xmlns="http://www.w3.org/2005/Atom"></feed>
|
||||||
|
|
||||||
|
title = /atom:feed/atom:title
|
||||||
|
desc = /atom:feed/atom:subtitle
|
||||||
|
items = /atom:feed/atom:entry
|
||||||
|
|
||||||
|
item_title = atom:title
|
||||||
|
item_link = atom:link/@href
|
||||||
|
item_desc = atom:summary
|
||||||
|
item_content = atom:content
|
||||||
|
item_time = atom:published
|
||||||
|
item_updated = atom:updated
|
||||||
|
|
||||||
|
[rss-atom03]
|
||||||
|
mode = xml
|
||||||
|
|
||||||
|
timeformat = %Y-%m-%dT%H:%M:%SZ
|
||||||
|
|
||||||
|
base = <?xml version="1.0"?><feed version="0.3" xmlns="http://purl.org/atom/ns#"></feed>
|
||||||
|
title = /atom03:feed/atom03:title
|
||||||
|
desc = /atom03:feed/atom03:subtitle
|
||||||
|
items = /atom03:feed/atom03:entry
|
||||||
|
|
||||||
|
item_title = atom03:title
|
||||||
|
item_link = atom03:link/@href
|
||||||
|
item_desc = atom03:summary
|
||||||
|
item_content = atom03:content
|
||||||
|
item_time = atom03:published
|
||||||
|
item_updated = atom03:updated
|
||||||
|
|
||||||
|
[json]
|
||||||
|
mode = json
|
||||||
|
|
||||||
|
mimetype = application/json
|
||||||
|
timeformat = %Y-%m-%dT%H:%M:%SZ
|
||||||
|
base = {}
|
||||||
|
|
||||||
|
title = title
|
||||||
|
desc = desc
|
||||||
|
items = items.[]
|
||||||
|
|
||||||
|
item_title = title
|
||||||
|
item_link = url
|
||||||
|
item_desc = desc
|
||||||
|
item_content = content
|
||||||
|
item_time = time
|
||||||
|
item_updated = updated
|
||||||
|
|
||||||
|
[html]
|
||||||
|
mode = html
|
||||||
|
|
||||||
|
title = //div[@id='header']/h1
|
||||||
|
desc = //div[@id='header']/h2
|
||||||
|
items = //div[@id='content']/div
|
||||||
|
|
||||||
|
item_title = ./a
|
||||||
|
item_link = ./a/@href
|
||||||
|
item_desc = ./div[class=desc]
|
||||||
|
item_content = ./div[class=content]
|
||||||
|
|
||||||
|
base = <!DOCTYPE html> <html> <head> <title>Feed reader by morss</title> <meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" /> </head> <body> <div id="header"> <h1>@feed.title</h1> <h2>@feed.desc</h2> <p>- via morss</p> </div> <div id="content"> <div class="item"> <a class="title link" href="@item.link" target="_blank">@item.title</a> <div class="desc">@item.desc</div> <div class="content">@item.content</div> </div> </div> <script> var items = document.getElementsByClassName('item') for (var i in items) items[i].onclick = function() { this.classList.toggle('active') document.body.classList.toggle('noscroll') } </script> </body> </html>
|
||||||
|
|
||||||
[twitter]
|
[twitter]
|
||||||
mode = xpath
|
mode = html
|
||||||
|
|
||||||
path =
|
path =
|
||||||
http://twitter.com/*
|
http://twitter.com/*
|
||||||
https://twitter.com/*
|
https://twitter.com/*
|
||||||
http://www.twitter.com/*
|
http://www.twitter.com/*
|
||||||
https://www.twitter.com/*
|
https://www.twitter.com/*
|
||||||
|
|
||||||
title = //head/title/text()
|
title = //head/title
|
||||||
items = //div[class=js-tweet]|//div[class=tweet]
|
items = //table[class=tweet]|//div[class=tweet]
|
||||||
|
|
||||||
item_title = ./@data-name " (@" ./@data-screen-name ")"
|
item_title = .//div[class=username]
|
||||||
item_link = .//a[class=js-permalink]/@href
|
item_link = ./@href
|
||||||
item_content = .//p[class=js-tweet-text]
|
item_desc = .//div[class=tweet-text]/div
|
||||||
item_time = .//span/@data-time
|
|
||||||
|
|
||||||
[google]
|
[google]
|
||||||
mode = xpath
|
mode = html
|
||||||
|
|
||||||
path =
|
path =
|
||||||
http://google.com/search?q=*
|
http://google.com/search?q=*
|
||||||
http://www.google.com/search?q=*
|
http://www.google.com/search?q=*
|
||||||
|
|
||||||
title = //head/title/text()
|
title = //head/title
|
||||||
items = //li[class=g]
|
items = //li[class=g]
|
||||||
|
|
||||||
item_title = .//h3//text()
|
item_title = .//h3
|
||||||
item_link = .//a/@href
|
item_link = .//a/@href
|
||||||
item_content = .//span[class=st]
|
item_desc = .//span[class=st]
|
||||||
|
|
||||||
[ddg.gg]
|
[ddg.gg]
|
||||||
mode = xpath
|
mode = html
|
||||||
|
|
||||||
path =
|
path =
|
||||||
http://duckduckgo.com/html/?q=*
|
http://duckduckgo.com/html/?q=*
|
||||||
https://duckduckgo.com/html/?q=*
|
https://duckduckgo.com/html/?q=*
|
||||||
|
|
||||||
title = //head/title/text()
|
title = //head/title
|
||||||
items = //div[class=results_links][not(contains(@class,'sponsored'))]
|
items = //div[class=results_links][not(contains(@class,'sponsored'))]
|
||||||
|
|
||||||
item_title = .//a[class=large]//text()
|
item_title = .//a[class=result__a]
|
||||||
item_link = .//a[class=large]/@href
|
item_link = .//a[class=result__a]/@href
|
||||||
item_content = .//div[class=snippet]
|
item_desc = .//a[class=result__snippet]
|
||||||
|
|
||||||
[facebook home]
|
|
||||||
mode = json
|
|
||||||
path =
|
|
||||||
https://graph.facebook.com/*/home*
|
|
||||||
https://graph.facebook.com/*/feed*
|
|
||||||
|
|
||||||
title = "Facebook"
|
|
||||||
items = data
|
|
||||||
|
|
||||||
item_title = from.name {" > " to.data.name<", ">}
|
|
||||||
item_link = actions.link[0]
|
|
||||||
item_content = message story{"<br/><br/><a href='" link "'><img src='" picture "' /></a>"}{"<blockquote><a href='" link "'>" name "</a><br/>" description "</blockquote>"}{"<br/><br/> – @ " place.name}
|
|
||||||
item_time = created_time
|
|
||||||
item_id = id
|
|
||||||
|
|
||||||
[facebook message/post]
|
|
||||||
mode = json
|
|
||||||
path =
|
|
||||||
https://graph.facebook.com/*
|
|
||||||
https://graph.facebook.com/*
|
|
||||||
|
|
||||||
title = "Facebook"
|
|
||||||
items = comments.data
|
|
||||||
|
|
||||||
item_title = from.name
|
|
||||||
item_content = message
|
|
||||||
item_time = created_time
|
|
||||||
item_id = id
|
|
||||||
|
209
morss/feedify.py
209
morss/feedify.py
@@ -1,215 +1,28 @@
|
|||||||
#!/usr/bin/env python
|
|
||||||
|
|
||||||
import os.path
|
|
||||||
|
|
||||||
import re
|
import re
|
||||||
import json
|
import json
|
||||||
|
|
||||||
from fnmatch import fnmatch
|
|
||||||
import lxml.html
|
|
||||||
|
|
||||||
from . import feeds
|
|
||||||
from . import crawler
|
from . import crawler
|
||||||
|
|
||||||
try:
|
|
||||||
from ConfigParser import ConfigParser
|
|
||||||
from urlparse import urlparse, urljoin
|
|
||||||
from urllib2 import urlopen
|
|
||||||
except ImportError:
|
|
||||||
from configparser import ConfigParser
|
|
||||||
from urllib.parse import urlparse, urljoin
|
|
||||||
from urllib.request import urlopen
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
basestring
|
basestring
|
||||||
except NameError:
|
except NameError:
|
||||||
basestring = str
|
basestring = str
|
||||||
|
|
||||||
|
|
||||||
def to_class(query):
|
|
||||||
pattern = r'\[class=([^\]]+)\]'
|
|
||||||
repl = r'[@class and contains(concat(" ", normalize-space(@class), " "), " \1 ")]'
|
|
||||||
return re.sub(pattern, repl, query)
|
|
||||||
|
|
||||||
|
|
||||||
def get_rule(link):
|
|
||||||
config = ConfigParser()
|
|
||||||
config.read(os.path.join(os.path.dirname(__file__), 'feedify.ini'))
|
|
||||||
|
|
||||||
for section in config.sections():
|
|
||||||
values = dict(config.items(section))
|
|
||||||
values['path'] = values['path'].split('\n')[1:]
|
|
||||||
for path in values['path']:
|
|
||||||
if fnmatch(link, path):
|
|
||||||
return values
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def supported(link):
|
|
||||||
return get_rule(link) is not False
|
|
||||||
|
|
||||||
|
|
||||||
def format_string(string, getter, error=False):
|
|
||||||
out = ""
|
|
||||||
char = string[0]
|
|
||||||
|
|
||||||
follow = string[1:]
|
|
||||||
|
|
||||||
if char == '"':
|
|
||||||
match = follow.partition('"')
|
|
||||||
out = match[0]
|
|
||||||
if len(match) >= 2:
|
|
||||||
next_match = match[2]
|
|
||||||
else:
|
|
||||||
next_match = None
|
|
||||||
elif char == '{':
|
|
||||||
match = follow.partition('}')
|
|
||||||
try:
|
|
||||||
test = format_string(match[0], getter, True)
|
|
||||||
except (ValueError, KeyError):
|
|
||||||
pass
|
|
||||||
else:
|
|
||||||
out = test
|
|
||||||
|
|
||||||
next_match = match[2]
|
|
||||||
elif char == ' ':
|
|
||||||
next_match = follow
|
|
||||||
elif re.search(r'^([^{}<>" ]+)(?:<"([^>]+)">)?(.*)$', string):
|
|
||||||
match = re.search(r'^([^{}<>" ]+)(?:<"([^>]+)">)?(.*)$', string).groups()
|
|
||||||
raw_value = getter(match[0])
|
|
||||||
if not isinstance(raw_value, basestring):
|
|
||||||
if match[1] is not None:
|
|
||||||
out = match[1].join(raw_value)
|
|
||||||
else:
|
|
||||||
out = ''.join(raw_value)
|
|
||||||
if not out and error:
|
|
||||||
raise ValueError
|
|
||||||
next_match = match[2]
|
|
||||||
else:
|
|
||||||
raise ValueError('bogus string')
|
|
||||||
|
|
||||||
if next_match is not None and len(next_match):
|
|
||||||
return out + format_string(next_match, getter, error)
|
|
||||||
else:
|
|
||||||
return out
|
|
||||||
|
|
||||||
|
|
||||||
def pre_worker(url):
|
def pre_worker(url):
|
||||||
if urlparse(url).netloc == 'itunes.apple.com':
|
if url.startswith('http://itunes.apple.com/') or url.startswith('https://itunes.apple.com/'):
|
||||||
match = re.search('/id([0-9]+)(\?.*)?$', url)
|
match = re.search('/id([0-9]+)(\?.*)?$', url)
|
||||||
if match:
|
if match:
|
||||||
iid = match.groups()[0]
|
iid = match.groups()[0]
|
||||||
redirect = 'https://itunes.apple.com/lookup?id={id}'.format(id=iid)
|
redirect = 'https://itunes.apple.com/lookup?id=%s' % iid
|
||||||
return redirect
|
|
||||||
|
try:
|
||||||
|
con = crawler.custom_handler(basic=True).open(redirect, timeout=4)
|
||||||
|
data = con.read()
|
||||||
|
|
||||||
|
except (IOError, HTTPException):
|
||||||
|
raise
|
||||||
|
|
||||||
|
return json.loads(data.decode('utf-8', 'replace'))['results'][0]['feedUrl']
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
class Builder(object):
|
|
||||||
def __init__(self, link, data=None):
|
|
||||||
self.link = link
|
|
||||||
self.data = data
|
|
||||||
|
|
||||||
if self.data is None:
|
|
||||||
self.data = urlopen(link).read()
|
|
||||||
|
|
||||||
self.encoding = crawler.detect_encoding(self.data)
|
|
||||||
|
|
||||||
if isinstance(self.data, bytes):
|
|
||||||
self.data = self.data.decode(crawler.detect_encoding(self.data), 'replace')
|
|
||||||
|
|
||||||
self.rule = get_rule(link)
|
|
||||||
|
|
||||||
if self.rule['mode'] == 'xpath':
|
|
||||||
self.doc = lxml.html.fromstring(self.data)
|
|
||||||
elif self.rule['mode'] == 'json':
|
|
||||||
self.doc = json.loads(self.data)
|
|
||||||
|
|
||||||
self.feed = feeds.FeedParserAtom()
|
|
||||||
|
|
||||||
def raw(self, html, expr):
|
|
||||||
" Returns selected items, thru a stupid query "
|
|
||||||
|
|
||||||
if self.rule['mode'] == 'xpath':
|
|
||||||
return html.xpath(to_class(expr))
|
|
||||||
|
|
||||||
elif self.rule['mode'] == 'json':
|
|
||||||
a = [html]
|
|
||||||
b = []
|
|
||||||
for x in expr.strip(".").split("."):
|
|
||||||
match = re.search('^([^\[]+)(?:\[([0-9]+)\])?$', x).groups()
|
|
||||||
for elem in a:
|
|
||||||
if isinstance(elem, dict):
|
|
||||||
kids = elem.get(match[0])
|
|
||||||
if kids is None:
|
|
||||||
pass
|
|
||||||
elif isinstance(kids, list):
|
|
||||||
b += kids
|
|
||||||
elif isinstance(kids, basestring):
|
|
||||||
b.append(kids.replace('\n', '<br/>'))
|
|
||||||
else:
|
|
||||||
b.append(kids)
|
|
||||||
|
|
||||||
if match[1] is None:
|
|
||||||
a = b
|
|
||||||
else:
|
|
||||||
if len(b) - 1 >= int(match[1]):
|
|
||||||
a = [b[int(match[1])]]
|
|
||||||
else:
|
|
||||||
a = []
|
|
||||||
b = []
|
|
||||||
return a
|
|
||||||
|
|
||||||
def strings(self, html, expr):
|
|
||||||
" Turns the results into a nice array of strings (ie. sth useful) "
|
|
||||||
|
|
||||||
if self.rule['mode'] == 'xpath':
|
|
||||||
out = []
|
|
||||||
for match in self.raw(html, expr):
|
|
||||||
if isinstance(match, basestring):
|
|
||||||
out.append(match)
|
|
||||||
elif isinstance(match, lxml.html.HtmlElement):
|
|
||||||
out.append(lxml.html.tostring(match))
|
|
||||||
|
|
||||||
elif self.rule['mode'] == 'json':
|
|
||||||
out = self.raw(html, expr)
|
|
||||||
|
|
||||||
out = [x.decode(self.encoding) if isinstance(x, bytes) else x for x in out]
|
|
||||||
return out
|
|
||||||
|
|
||||||
def string(self, html, expr):
|
|
||||||
" Makes a formatted string out of the getter and rule "
|
|
||||||
|
|
||||||
getter = lambda x: self.strings(html, x)
|
|
||||||
return format_string(self.rule[expr], getter)
|
|
||||||
|
|
||||||
def build(self):
|
|
||||||
" Builds the actual rss feed "
|
|
||||||
|
|
||||||
if 'title' in self.rule:
|
|
||||||
self.feed.title = self.string(self.doc, 'title')
|
|
||||||
|
|
||||||
if 'items' in self.rule:
|
|
||||||
matches = self.raw(self.doc, self.rule['items'])
|
|
||||||
if matches and len(matches):
|
|
||||||
for item in matches:
|
|
||||||
feed_item = {}
|
|
||||||
|
|
||||||
if 'item_title' in self.rule:
|
|
||||||
feed_item['title'] = self.string(item, 'item_title')
|
|
||||||
if 'item_link' in self.rule:
|
|
||||||
url = self.string(item, 'item_link')
|
|
||||||
if url:
|
|
||||||
url = urljoin(self.link, url)
|
|
||||||
feed_item['link'] = url
|
|
||||||
if 'item_desc' in self.rule:
|
|
||||||
feed_item['desc'] = self.string(item, 'item_desc')
|
|
||||||
if 'item_content' in self.rule:
|
|
||||||
feed_item['content'] = self.string(item, 'item_content')
|
|
||||||
if 'item_time' in self.rule:
|
|
||||||
feed_item['updated'] = self.string(item, 'item_time')
|
|
||||||
if 'item_id' in self.rule:
|
|
||||||
feed_item['id'] = self.string(item, 'item_id')
|
|
||||||
feed_item['is_permalink'] = False
|
|
||||||
|
|
||||||
self.feed.items.append(feed_item)
|
|
||||||
|
1318
morss/feeds.py
1318
morss/feeds.py
File diff suppressed because it is too large
Load Diff
355
morss/morss.py
355
morss/morss.py
@@ -7,10 +7,10 @@ import threading
|
|||||||
|
|
||||||
from fnmatch import fnmatch
|
from fnmatch import fnmatch
|
||||||
import re
|
import re
|
||||||
import json
|
|
||||||
|
|
||||||
import lxml.etree
|
import lxml.etree
|
||||||
import lxml.html
|
import lxml.html
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
from . import feeds
|
from . import feeds
|
||||||
from . import feedify
|
from . import feedify
|
||||||
@@ -19,22 +19,20 @@ from . import readabilite
|
|||||||
|
|
||||||
import wsgiref.simple_server
|
import wsgiref.simple_server
|
||||||
import wsgiref.handlers
|
import wsgiref.handlers
|
||||||
|
import cgitb
|
||||||
|
|
||||||
from html2text import HTML2Text
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
# python 2
|
||||||
from Queue import Queue
|
from Queue import Queue
|
||||||
from httplib import HTTPException
|
from httplib import HTTPException
|
||||||
from urllib2 import build_opener
|
from urllib import unquote
|
||||||
from urllib2 import HTTPError
|
|
||||||
from urllib import quote_plus
|
|
||||||
from urlparse import urlparse, urljoin, parse_qs
|
from urlparse import urlparse, urljoin, parse_qs
|
||||||
except ImportError:
|
except ImportError:
|
||||||
|
# python 3
|
||||||
from queue import Queue
|
from queue import Queue
|
||||||
from http.client import HTTPException
|
from http.client import HTTPException
|
||||||
from urllib.request import build_opener
|
from urllib.parse import unquote
|
||||||
from urllib.error import HTTPError
|
|
||||||
from urllib.parse import quote_plus
|
|
||||||
from urllib.parse import urlparse, urljoin, parse_qs
|
from urllib.parse import urlparse, urljoin, parse_qs
|
||||||
|
|
||||||
LIM_ITEM = 100 # deletes what's beyond
|
LIM_ITEM = 100 # deletes what's beyond
|
||||||
@@ -48,9 +46,7 @@ THREADS = 10 # number of threads (1 for single-threaded)
|
|||||||
DEBUG = False
|
DEBUG = False
|
||||||
PORT = 8080
|
PORT = 8080
|
||||||
|
|
||||||
DEFAULT_UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0'
|
PROTOCOL = ['http', 'https']
|
||||||
|
|
||||||
PROTOCOL = ['http', 'https', 'ftp']
|
|
||||||
|
|
||||||
|
|
||||||
def filterOptions(options):
|
def filterOptions(options):
|
||||||
@@ -72,6 +68,7 @@ def log(txt, force=False):
|
|||||||
if DEBUG or force:
|
if DEBUG or force:
|
||||||
if 'REQUEST_URI' in os.environ:
|
if 'REQUEST_URI' in os.environ:
|
||||||
open('morss.log', 'a').write("%s\n" % repr(txt))
|
open('morss.log', 'a').write("%s\n" % repr(txt))
|
||||||
|
|
||||||
else:
|
else:
|
||||||
print(repr(txt))
|
print(repr(txt))
|
||||||
|
|
||||||
@@ -79,6 +76,7 @@ def log(txt, force=False):
|
|||||||
def len_html(txt):
|
def len_html(txt):
|
||||||
if len(txt):
|
if len(txt):
|
||||||
return len(lxml.html.fromstring(txt).text_content())
|
return len(lxml.html.fromstring(txt).text_content())
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
@@ -86,6 +84,7 @@ def len_html(txt):
|
|||||||
def count_words(txt):
|
def count_words(txt):
|
||||||
if len(txt):
|
if len(txt):
|
||||||
return len(lxml.html.fromstring(txt).text_content().split())
|
return len(lxml.html.fromstring(txt).text_content().split())
|
||||||
|
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
@@ -94,12 +93,14 @@ class Options:
|
|||||||
if len(args):
|
if len(args):
|
||||||
self.options = args
|
self.options = args
|
||||||
self.options.update(options or {})
|
self.options.update(options or {})
|
||||||
|
|
||||||
else:
|
else:
|
||||||
self.options = options or {}
|
self.options = options or {}
|
||||||
|
|
||||||
def __getattr__(self, key):
|
def __getattr__(self, key):
|
||||||
if key in self.options:
|
if key in self.options:
|
||||||
return self.options[key]
|
return self.options[key]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -113,33 +114,26 @@ class Options:
|
|||||||
def parseOptions(options):
|
def parseOptions(options):
|
||||||
""" Turns ['md=True'] into {'md':True} """
|
""" Turns ['md=True'] into {'md':True} """
|
||||||
out = {}
|
out = {}
|
||||||
|
|
||||||
for option in options:
|
for option in options:
|
||||||
split = option.split('=', 1)
|
split = option.split('=', 1)
|
||||||
|
|
||||||
if len(split) > 1:
|
if len(split) > 1:
|
||||||
if split[0].lower() == 'true':
|
if split[0].lower() == 'true':
|
||||||
out[split[0]] = True
|
out[split[0]] = True
|
||||||
|
|
||||||
elif split[0].lower() == 'false':
|
elif split[0].lower() == 'false':
|
||||||
out[split[0]] = False
|
out[split[0]] = False
|
||||||
|
|
||||||
else:
|
else:
|
||||||
out[split[0]] = split[1]
|
out[split[0]] = split[1]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
out[split[0]] = True
|
out[split[0]] = True
|
||||||
|
|
||||||
return out
|
return out
|
||||||
|
|
||||||
|
|
||||||
default_handlers = [crawler.GZIPHandler(), crawler.UAHandler(DEFAULT_UA),
|
|
||||||
crawler.AutoRefererHandler(), crawler.HTTPEquivHandler(),
|
|
||||||
crawler.HTTPRefreshHandler()]
|
|
||||||
|
|
||||||
def custom_handler(accept, strict=False, delay=DELAY, encoding=None):
|
|
||||||
handlers = default_handlers[:]
|
|
||||||
handlers.append(crawler.EncodingFixHandler(encoding))
|
|
||||||
handlers.append(crawler.ContentNegociationHandler(crawler.MIMETYPE[accept], strict))
|
|
||||||
handlers.append(crawler.SQliteCacheHandler(delay))
|
|
||||||
|
|
||||||
return build_opener(*handlers)
|
|
||||||
|
|
||||||
|
|
||||||
def ItemFix(item, feedurl='/'):
|
def ItemFix(item, feedurl='/'):
|
||||||
""" Improves feed items (absolute links, resolve feedburner links, etc) """
|
""" Improves feed items (absolute links, resolve feedburner links, etc) """
|
||||||
|
|
||||||
@@ -177,14 +171,19 @@ def ItemFix(item, feedurl='/'):
|
|||||||
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
|
# pocket
|
||||||
|
if fnmatch(item.link, 'https://getpocket.com/redirect?url=*'):
|
||||||
|
item.link = parse_qs(urlparse(item.link).query)['url'][0]
|
||||||
|
log(item.link)
|
||||||
|
|
||||||
# facebook
|
# facebook
|
||||||
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
|
if fnmatch(item.link, 'https://www.facebook.com/l.php?u=*'):
|
||||||
item.link = parse_qs(urlparse(item.link).query)['u'][0]
|
item.link = parse_qs(urlparse(item.link).query)['u'][0]
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
# feedburner
|
# feedburner FIXME only works if RSS...
|
||||||
feeds.NSMAP['feedburner'] = 'http://rssnamespace.org/feedburner/ext/1.0'
|
item.NSMAP['feedburner'] = 'http://rssnamespace.org/feedburner/ext/1.0'
|
||||||
match = item.xval('feedburner:origLink')
|
match = item.rule_str('feedburner:origLink')
|
||||||
if match:
|
if match:
|
||||||
item.link = match
|
item.link = match
|
||||||
|
|
||||||
@@ -219,30 +218,15 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
|
|
||||||
log(item.link)
|
log(item.link)
|
||||||
|
|
||||||
# content already provided?
|
|
||||||
count_content = count_words(item.content)
|
|
||||||
count_desc = count_words(item.desc)
|
|
||||||
|
|
||||||
if not options.hungry and max(count_content, count_desc) > 500:
|
|
||||||
if count_desc > count_content:
|
|
||||||
item.content = item.desc
|
|
||||||
del item.desc
|
|
||||||
log('reversed sizes')
|
|
||||||
log('long enough')
|
|
||||||
return True
|
|
||||||
|
|
||||||
if not options.hungry and count_content > 5 * count_desc > 0 and count_content > 50:
|
|
||||||
log('content bigger enough')
|
|
||||||
return True
|
|
||||||
|
|
||||||
link = item.link
|
link = item.link
|
||||||
|
|
||||||
# twitter
|
# twitter
|
||||||
if urlparse(feedurl).netloc == 'twitter.com':
|
if urlparse(feedurl).netloc == 'twitter.com':
|
||||||
match = lxml.html.fromstring(item.content).xpath('//a/@data-expanded-url')
|
match = lxml.html.fromstring(item.desc).xpath('//a/@data-expanded-url')
|
||||||
if len(match):
|
if len(match):
|
||||||
link = match[0]
|
link = match[0]
|
||||||
log(link)
|
log(link)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
link = None
|
link = None
|
||||||
|
|
||||||
@@ -252,6 +236,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
|
if len(match) and urlparse(match[0]).netloc != 'www.facebook.com':
|
||||||
link = match[0]
|
link = match[0]
|
||||||
log(link)
|
log(link)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
link = None
|
link = None
|
||||||
|
|
||||||
@@ -267,7 +252,7 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
delay = -2
|
delay = -2
|
||||||
|
|
||||||
try:
|
try:
|
||||||
con = custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
|
con = crawler.custom_handler('html', False, delay, options.encoding).open(link, timeout=TIMEOUT)
|
||||||
data = con.read()
|
data = con.read()
|
||||||
|
|
||||||
except (IOError, HTTPException) as e:
|
except (IOError, HTTPException) as e:
|
||||||
@@ -279,14 +264,10 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
log('non-text page')
|
log('non-text page')
|
||||||
return True
|
return True
|
||||||
|
|
||||||
out = readabilite.get_article(data, options.encoding)
|
out = readabilite.get_article(data, link, options.encoding or crawler.detect_encoding(data, con))
|
||||||
|
|
||||||
if options.hungry or count_words(out) > max(count_content, count_desc):
|
if out is not None:
|
||||||
item.push_content(out)
|
item.content = out
|
||||||
|
|
||||||
else:
|
|
||||||
log('link not bigger enough')
|
|
||||||
return True
|
|
||||||
|
|
||||||
return True
|
return True
|
||||||
|
|
||||||
@@ -294,10 +275,6 @@ def ItemFill(item, options, feedurl='/', fast=False):
|
|||||||
def ItemBefore(item, options):
|
def ItemBefore(item, options):
|
||||||
# return None if item deleted
|
# return None if item deleted
|
||||||
|
|
||||||
if options.empty:
|
|
||||||
item.remove()
|
|
||||||
return None
|
|
||||||
|
|
||||||
if options.search:
|
if options.search:
|
||||||
if options.search not in item.title:
|
if options.search not in item.title:
|
||||||
item.remove()
|
item.remove()
|
||||||
@@ -307,10 +284,6 @@ def ItemBefore(item, options):
|
|||||||
|
|
||||||
|
|
||||||
def ItemAfter(item, options):
|
def ItemAfter(item, options):
|
||||||
if options.strip:
|
|
||||||
del item.desc
|
|
||||||
del item.content
|
|
||||||
|
|
||||||
if options.clip and item.desc and item.content:
|
if options.clip and item.desc and item.content:
|
||||||
item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
|
item.content = item.desc + "<br/><br/><center>* * *</center><br/><br/>" + item.content
|
||||||
del item.desc
|
del item.desc
|
||||||
@@ -328,36 +301,30 @@ def ItemAfter(item, options):
|
|||||||
if options.noref:
|
if options.noref:
|
||||||
item.link = ''
|
item.link = ''
|
||||||
|
|
||||||
if options.md:
|
|
||||||
conv = HTML2Text(baseurl=item.link)
|
|
||||||
conv.unicode_snob = True
|
|
||||||
|
|
||||||
if item.desc:
|
|
||||||
item.desc = conv.handle(item.desc)
|
|
||||||
if item.content:
|
|
||||||
item.content = conv.handle(item.content)
|
|
||||||
|
|
||||||
return item
|
return item
|
||||||
|
|
||||||
|
|
||||||
def FeedFetch(url, options):
|
def UrlFix(url):
|
||||||
# basic url clean-up
|
|
||||||
if url is None:
|
if url is None:
|
||||||
raise MorssException('No url provided')
|
raise MorssException('No url provided')
|
||||||
|
|
||||||
|
if isinstance(url, bytes):
|
||||||
|
url = url.decode()
|
||||||
|
|
||||||
if urlparse(url).scheme not in PROTOCOL:
|
if urlparse(url).scheme not in PROTOCOL:
|
||||||
url = 'http://' + url
|
url = 'http://' + url
|
||||||
log(url)
|
log(url)
|
||||||
|
|
||||||
url = url.replace(' ', '%20')
|
url = url.replace(' ', '%20')
|
||||||
|
|
||||||
if isinstance(url, bytes):
|
return url
|
||||||
url = url.decode()
|
|
||||||
|
|
||||||
# do some useful facebook work
|
|
||||||
|
def FeedFetch(url, options):
|
||||||
|
# allow for code execution for feedify
|
||||||
pre = feedify.pre_worker(url)
|
pre = feedify.pre_worker(url)
|
||||||
if pre:
|
if pre:
|
||||||
url = pre
|
url = UrlFix(pre)
|
||||||
log('url redirect')
|
log('url redirect')
|
||||||
log(url)
|
log(url)
|
||||||
|
|
||||||
@@ -368,31 +335,40 @@ def FeedFetch(url, options):
|
|||||||
delay = 0
|
delay = 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
con = custom_handler('xml', True, delay, options.encoding).open(url, timeout=TIMEOUT * 2)
|
con = crawler.custom_handler(accept='xml', strict=True, delay=delay,
|
||||||
|
encoding=options.encoding, basic=not options.items) \
|
||||||
|
.open(url, timeout=TIMEOUT * 2)
|
||||||
xml = con.read()
|
xml = con.read()
|
||||||
|
|
||||||
except (HTTPError) as e:
|
|
||||||
raise MorssException('Error downloading feed (HTTP Error %s)' % e.code)
|
|
||||||
|
|
||||||
except (IOError, HTTPException):
|
except (IOError, HTTPException):
|
||||||
raise MorssException('Error downloading feed')
|
raise MorssException('Error downloading feed')
|
||||||
|
|
||||||
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||||
|
|
||||||
if url.startswith('https://itunes.apple.com/lookup?id='):
|
if options.items:
|
||||||
link = json.loads(xml.decode('utf-8', 'replace'))['results'][0]['feedUrl']
|
# using custom rules
|
||||||
log('itunes redirect: %s' % link)
|
rss = feeds.FeedHTML(xml)
|
||||||
return FeedFetch(link, options)
|
|
||||||
|
|
||||||
elif re.match(b'\s*<?xml', xml) is not None or contenttype in crawler.MIMETYPE['xml']:
|
rss.rules['items'] = options.items
|
||||||
rss = feeds.parse(xml)
|
|
||||||
|
|
||||||
elif feedify.supported(url):
|
if options.item_title:
|
||||||
feed = feedify.Builder(url, xml)
|
rss.rules['item_title'] = options.item_title
|
||||||
feed.build()
|
if options.item_link:
|
||||||
rss = feed.feed
|
rss.rules['item_link'] = options.item_link
|
||||||
|
if options.item_content:
|
||||||
|
rss.rules['item_content'] = options.item_content
|
||||||
|
if options.item_time:
|
||||||
|
rss.rules['item_time'] = options.item_time
|
||||||
|
|
||||||
|
rss = rss.convert(feeds.FeedXML)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
|
try:
|
||||||
|
rss = feeds.parse(xml, url, contenttype)
|
||||||
|
rss = rss.convert(feeds.FeedXML)
|
||||||
|
# contains all fields, otherwise much-needed data can be lost
|
||||||
|
|
||||||
|
except TypeError:
|
||||||
log('random page')
|
log('random page')
|
||||||
log(contenttype)
|
log(contenttype)
|
||||||
raise MorssException('Link provided is not a valid feed')
|
raise MorssException('Link provided is not a valid feed')
|
||||||
@@ -423,6 +399,7 @@ def FeedGather(rss, url, options):
|
|||||||
value = queue.get()
|
value = queue.get()
|
||||||
try:
|
try:
|
||||||
worker(*value)
|
worker(*value)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
log('Thread Error: %s' % e.message)
|
log('Thread Error: %s' % e.message)
|
||||||
queue.task_done()
|
queue.task_done()
|
||||||
@@ -462,6 +439,7 @@ def FeedGather(rss, url, options):
|
|||||||
for i, item in enumerate(list(rss.items)):
|
for i, item in enumerate(list(rss.items)):
|
||||||
if threads == 1:
|
if threads == 1:
|
||||||
worker(*[i, item])
|
worker(*[i, item])
|
||||||
|
|
||||||
else:
|
else:
|
||||||
queue.put([i, item])
|
queue.put([i, item])
|
||||||
|
|
||||||
@@ -481,26 +459,38 @@ def FeedGather(rss, url, options):
|
|||||||
return rss
|
return rss
|
||||||
|
|
||||||
|
|
||||||
def FeedFormat(rss, options):
|
def FeedFormat(rss, options, encoding='utf-8'):
|
||||||
if options.callback:
|
if options.callback:
|
||||||
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
|
if re.match(r'^[a-zA-Z0-9\.]+$', options.callback) is not None:
|
||||||
return '%s(%s)' % (options.callback, rss.tojson())
|
out = '%s(%s)' % (options.callback, rss.tojson(encoding='unicode'))
|
||||||
|
return out if encoding == 'unicode' else out.encode(encoding)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
raise MorssException('Invalid callback var name')
|
raise MorssException('Invalid callback var name')
|
||||||
|
|
||||||
elif options.json:
|
elif options.json:
|
||||||
if options.indent:
|
if options.indent:
|
||||||
return rss.tojson(indent=4)
|
return rss.tojson(encoding=encoding, indent=4)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return rss.tojson()
|
return rss.tojson(encoding=encoding)
|
||||||
|
|
||||||
elif options.csv:
|
elif options.csv:
|
||||||
return rss.tocsv()
|
return rss.tocsv(encoding=encoding)
|
||||||
|
|
||||||
elif options.reader:
|
elif options.reader:
|
||||||
return rss.tohtml()
|
if options.indent:
|
||||||
|
return rss.tohtml(encoding=encoding, pretty_print=True)
|
||||||
|
|
||||||
|
else:
|
||||||
|
return rss.tohtml(encoding=encoding)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
if options.indent:
|
if options.indent:
|
||||||
return rss.tostring(xml_declaration=True, encoding='UTF-8', pretty_print=True)
|
return rss.torss(xml_declaration=True, encoding=encoding, pretty_print=True)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return rss.tostring(xml_declaration=True, encoding='UTF-8')
|
return rss.torss(xml_declaration=True, encoding=encoding)
|
||||||
|
|
||||||
|
|
||||||
def process(url, cache=None, options=None):
|
def process(url, cache=None, options=None):
|
||||||
@@ -510,40 +500,55 @@ def process(url, cache=None, options=None):
|
|||||||
options = Options(options)
|
options = Options(options)
|
||||||
|
|
||||||
if cache:
|
if cache:
|
||||||
crawler.sqlite_default = cache
|
crawler.default_cache = crawler.SQLiteCache(cache)
|
||||||
|
|
||||||
|
url = UrlFix(url)
|
||||||
rss = FeedFetch(url, options)
|
rss = FeedFetch(url, options)
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
|
|
||||||
return FeedFormat(rss, options)
|
return FeedFormat(rss, options)
|
||||||
|
|
||||||
|
|
||||||
def cgi_app(environ, start_response):
|
def cgi_parse_environ(environ):
|
||||||
# get options
|
# get options
|
||||||
|
|
||||||
if 'REQUEST_URI' in environ:
|
if 'REQUEST_URI' in environ:
|
||||||
url = environ['REQUEST_URI'][1:]
|
url = environ['REQUEST_URI'][1:]
|
||||||
else:
|
else:
|
||||||
url = environ['PATH_INFO'][1:]
|
url = environ['PATH_INFO'][1:]
|
||||||
|
|
||||||
url = re.sub(r'^/?(morss.py|main.py|cgi/main.py)/', '', url)
|
if environ['QUERY_STRING']:
|
||||||
|
url += '?' + environ['QUERY_STRING']
|
||||||
|
|
||||||
|
url = re.sub(r'^/?(cgi/)?(morss.py|main.py)/', '', url)
|
||||||
|
|
||||||
if url.startswith(':'):
|
if url.startswith(':'):
|
||||||
split = url.split('/', 1)
|
split = url.split('/', 1)
|
||||||
options = split[0].split(':')[1:]
|
|
||||||
|
raw_options = unquote(split[0]).replace('|', '/').replace('\\\'', '\'').split(':')[1:]
|
||||||
|
|
||||||
if len(split) > 1:
|
if len(split) > 1:
|
||||||
url = split[1]
|
url = split[1]
|
||||||
else:
|
else:
|
||||||
url = ''
|
url = ''
|
||||||
|
|
||||||
else:
|
else:
|
||||||
options = []
|
raw_options = []
|
||||||
|
|
||||||
# init
|
# init
|
||||||
options = Options(filterOptions(parseOptions(options)))
|
options = Options(filterOptions(parseOptions(raw_options)))
|
||||||
headers = {}
|
|
||||||
|
|
||||||
global DEBUG
|
global DEBUG
|
||||||
DEBUG = options.debug
|
DEBUG = options.debug
|
||||||
|
|
||||||
|
return (url, options)
|
||||||
|
|
||||||
|
|
||||||
|
def cgi_app(environ, start_response):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
headers = {}
|
||||||
|
|
||||||
# headers
|
# headers
|
||||||
headers['status'] = '200 OK'
|
headers['status'] = '200 OK'
|
||||||
headers['cache-control'] = 'max-age=%s' % DELAY
|
headers['cache-control'] = 'max-age=%s' % DELAY
|
||||||
@@ -553,7 +558,7 @@ def cgi_app(environ, start_response):
|
|||||||
|
|
||||||
if options.html or options.reader:
|
if options.html or options.reader:
|
||||||
headers['content-type'] = 'text/html'
|
headers['content-type'] = 'text/html'
|
||||||
elif options.txt:
|
elif options.txt or options.silent:
|
||||||
headers['content-type'] = 'text/plain'
|
headers['content-type'] = 'text/plain'
|
||||||
elif options.json:
|
elif options.json:
|
||||||
headers['content-type'] = 'application/json'
|
headers['content-type'] = 'application/json'
|
||||||
@@ -565,33 +570,56 @@ def cgi_app(environ, start_response):
|
|||||||
else:
|
else:
|
||||||
headers['content-type'] = 'text/xml'
|
headers['content-type'] = 'text/xml'
|
||||||
|
|
||||||
crawler.sqlite_default = os.path.join(os.getcwd(), 'morss-cache.db')
|
crawler.default_cache = crawler.SQLiteCache(os.path.join(os.getcwd(), 'morss-cache.db'))
|
||||||
|
|
||||||
# get the work done
|
# get the work done
|
||||||
|
url = UrlFix(url)
|
||||||
rss = FeedFetch(url, options)
|
rss = FeedFetch(url, options)
|
||||||
|
|
||||||
if headers['content-type'] == 'text/xml':
|
if headers['content-type'] == 'text/xml':
|
||||||
headers['content-type'] = rss.mimetype
|
headers['content-type'] = rss.mimetype[0]
|
||||||
|
|
||||||
start_response(headers['status'], list(headers.items()))
|
start_response(headers['status'], list(headers.items()))
|
||||||
|
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
out = FeedFormat(rss, options)
|
out = FeedFormat(rss, options)
|
||||||
|
|
||||||
if not options.silent:
|
if options.silent:
|
||||||
return out
|
return ['']
|
||||||
|
|
||||||
log('done')
|
else:
|
||||||
|
return [out]
|
||||||
|
|
||||||
|
|
||||||
def cgi_wrapper(environ, start_response):
|
def middleware(func):
|
||||||
# simple http server for html and css
|
" Decorator to turn a function into a wsgi middleware "
|
||||||
|
# This is called when parsing the code
|
||||||
|
|
||||||
|
def app_builder(app):
|
||||||
|
# This is called when doing app = cgi_wrapper(app)
|
||||||
|
|
||||||
|
def app_wrap(environ, start_response):
|
||||||
|
# This is called when a http request is being processed
|
||||||
|
|
||||||
|
return func(environ, start_response, app)
|
||||||
|
|
||||||
|
return app_wrap
|
||||||
|
|
||||||
|
return app_builder
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_file_handler(environ, start_response, app):
|
||||||
|
" Simple HTTP server to serve static files (.html, .css, etc.) "
|
||||||
|
|
||||||
files = {
|
files = {
|
||||||
'': 'text/html',
|
'': 'text/html',
|
||||||
'index.html': 'text/html'}
|
'index.html': 'text/html',
|
||||||
|
'sheet.xsl': 'text/xsl'}
|
||||||
|
|
||||||
if 'REQUEST_URI' in environ:
|
if 'REQUEST_URI' in environ:
|
||||||
url = environ['REQUEST_URI'][1:]
|
url = environ['REQUEST_URI'][1:]
|
||||||
|
|
||||||
else:
|
else:
|
||||||
url = environ['PATH_INFO'][1:]
|
url = environ['PATH_INFO'][1:]
|
||||||
|
|
||||||
@@ -613,23 +641,87 @@ def cgi_wrapper(environ, start_response):
|
|||||||
headers['status'] = '200 OK'
|
headers['status'] = '200 OK'
|
||||||
headers['content-type'] = files[url]
|
headers['content-type'] = files[url]
|
||||||
start_response(headers['status'], list(headers.items()))
|
start_response(headers['status'], list(headers.items()))
|
||||||
return body
|
return [body]
|
||||||
|
|
||||||
except IOError:
|
except IOError:
|
||||||
headers['status'] = '404 Not found'
|
headers['status'] = '404 Not found'
|
||||||
start_response(headers['status'], list(headers.items()))
|
start_response(headers['status'], list(headers.items()))
|
||||||
return 'Error %s' % headers['status']
|
return ['Error %s' % headers['status']]
|
||||||
|
|
||||||
# actual morss use
|
else:
|
||||||
|
return app(environ, start_response)
|
||||||
|
|
||||||
|
|
||||||
|
def cgi_page(environ, start_response):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
# get page
|
||||||
|
PROTOCOL = ['http', 'https']
|
||||||
|
|
||||||
|
if urlparse(url).scheme not in ['http', 'https']:
|
||||||
|
url = 'http://' + url
|
||||||
|
|
||||||
|
con = crawler.custom_handler().open(url)
|
||||||
|
data = con.read()
|
||||||
|
|
||||||
|
contenttype = con.info().get('Content-Type', '').split(';')[0]
|
||||||
|
|
||||||
|
if contenttype in ['text/html', 'application/xhtml+xml', 'application/xml']:
|
||||||
|
html = lxml.html.fromstring(BeautifulSoup(data, 'lxml').prettify())
|
||||||
|
html.make_links_absolute(url)
|
||||||
|
|
||||||
|
kill_tags = ['script', 'iframe', 'noscript']
|
||||||
|
|
||||||
|
for tag in kill_tags:
|
||||||
|
for elem in html.xpath('//'+tag):
|
||||||
|
elem.getparent().remove(elem)
|
||||||
|
|
||||||
|
output = lxml.etree.tostring(html.getroottree(), encoding='utf-8')
|
||||||
|
|
||||||
|
else:
|
||||||
|
output = None
|
||||||
|
|
||||||
|
# return html page
|
||||||
|
headers = {'status': '200 OK', 'content-type': 'text/html'}
|
||||||
|
start_response(headers['status'], list(headers.items()))
|
||||||
|
return [output]
|
||||||
|
|
||||||
|
|
||||||
|
dispatch_table = {
|
||||||
|
'getpage': cgi_page
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_dispatcher(environ, start_response, app):
|
||||||
|
url, options = cgi_parse_environ(environ)
|
||||||
|
|
||||||
|
for key in dispatch_table.keys():
|
||||||
|
if key in options:
|
||||||
|
return dispatch_table[key](environ, start_response)
|
||||||
|
|
||||||
|
return app(environ, start_response)
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_error_handler(environ, start_response, app):
|
||||||
try:
|
try:
|
||||||
return cgi_app(environ, start_response) or []
|
return app(environ, start_response)
|
||||||
|
|
||||||
except (KeyboardInterrupt, SystemExit):
|
except (KeyboardInterrupt, SystemExit):
|
||||||
raise
|
raise
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
headers = {'status': '500 Oops', 'content-type': 'text/plain'}
|
headers = {'status': '500 Oops', 'content-type': 'text/html'}
|
||||||
start_response(headers['status'], list(headers.items()), sys.exc_info())
|
start_response(headers['status'], list(headers.items()), sys.exc_info())
|
||||||
log('ERROR <%s>: %s' % (url, e.message), force=True)
|
log('ERROR: %s' % repr(e), force=True)
|
||||||
return 'An error happened:\n%s' % e.message
|
return [cgitb.html(sys.exc_info())]
|
||||||
|
|
||||||
|
|
||||||
|
@middleware
|
||||||
|
def cgi_encode(environ, start_response, app):
|
||||||
|
out = app(environ, start_response)
|
||||||
|
return [x if isinstance(x, bytes) else x.encode('utf-8') for x in out]
|
||||||
|
|
||||||
|
|
||||||
def cli_app():
|
def cli_app():
|
||||||
@@ -639,8 +731,9 @@ def cli_app():
|
|||||||
global DEBUG
|
global DEBUG
|
||||||
DEBUG = options.debug
|
DEBUG = options.debug
|
||||||
|
|
||||||
crawler.sqlite_default = os.path.expanduser('~/.cache/morss-cache.db')
|
crawler.default_cache = crawler.SQLiteCache(os.path.expanduser('~/.cache/morss-cache.db'))
|
||||||
|
|
||||||
|
url = UrlFix(url)
|
||||||
rss = FeedFetch(url, options)
|
rss = FeedFetch(url, options)
|
||||||
rss = FeedGather(rss, url, options)
|
rss = FeedGather(rss, url, options)
|
||||||
out = FeedFormat(rss, options)
|
out = FeedFormat(rss, options)
|
||||||
@@ -655,6 +748,7 @@ def isInt(string):
|
|||||||
try:
|
try:
|
||||||
int(string)
|
int(string)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
except ValueError:
|
except ValueError:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -662,7 +756,13 @@ def isInt(string):
|
|||||||
def main():
|
def main():
|
||||||
if 'REQUEST_URI' in os.environ:
|
if 'REQUEST_URI' in os.environ:
|
||||||
# mod_cgi
|
# mod_cgi
|
||||||
wsgiref.handlers.CGIHandler().run(cgi_wrapper)
|
|
||||||
|
app = cgi_app
|
||||||
|
app = cgi_dispatcher(app)
|
||||||
|
app = cgi_error_handler(app)
|
||||||
|
app = cgi_encode(app)
|
||||||
|
|
||||||
|
wsgiref.handlers.CGIHandler().run(app)
|
||||||
|
|
||||||
elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
|
elif len(sys.argv) <= 1 or isInt(sys.argv[1]) or '--root' in sys.argv[1:]:
|
||||||
# start internal (basic) http server
|
# start internal (basic) http server
|
||||||
@@ -671,22 +771,31 @@ def main():
|
|||||||
argPort = int(sys.argv[1])
|
argPort = int(sys.argv[1])
|
||||||
if argPort > 0:
|
if argPort > 0:
|
||||||
port = argPort
|
port = argPort
|
||||||
|
|
||||||
else:
|
else:
|
||||||
raise MorssException('Port must be positive integer')
|
raise MorssException('Port must be positive integer')
|
||||||
|
|
||||||
else:
|
else:
|
||||||
port = PORT
|
port = PORT
|
||||||
|
|
||||||
|
app = cgi_app
|
||||||
|
app = cgi_file_handler(app)
|
||||||
|
app = cgi_dispatcher(app)
|
||||||
|
app = cgi_error_handler(app)
|
||||||
|
app = cgi_encode(app)
|
||||||
|
|
||||||
print('Serving http://localhost:%s/' % port)
|
print('Serving http://localhost:%s/' % port)
|
||||||
httpd = wsgiref.simple_server.make_server('', port, cgi_wrapper)
|
httpd = wsgiref.simple_server.make_server('', port, app)
|
||||||
httpd.serve_forever()
|
httpd.serve_forever()
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# as a CLI app
|
# as a CLI app
|
||||||
try:
|
try:
|
||||||
cli_app()
|
cli_app()
|
||||||
|
|
||||||
except (KeyboardInterrupt, SystemExit):
|
except (KeyboardInterrupt, SystemExit):
|
||||||
raise
|
raise
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print('ERROR: %s' % e.message)
|
print('ERROR: %s' % e.message)
|
||||||
|
|
||||||
|
@@ -19,12 +19,15 @@ def count_words(string):
|
|||||||
And so in about every language (sorry chinese).
|
And so in about every language (sorry chinese).
|
||||||
Basically skips spaces in the count. """
|
Basically skips spaces in the count. """
|
||||||
|
|
||||||
|
if string is None:
|
||||||
|
return 0
|
||||||
|
|
||||||
i = 0
|
i = 0
|
||||||
count = 0
|
count = 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
while True:
|
while True:
|
||||||
if string[i] not in '\n\t ':
|
if string[i] not in "\r\n\t ":
|
||||||
count += 1
|
count += 1
|
||||||
i += 6
|
i += 6
|
||||||
else:
|
else:
|
||||||
@@ -35,123 +38,270 @@ def count_words(string):
|
|||||||
return count
|
return count
|
||||||
|
|
||||||
|
|
||||||
regex_bad = re.compile('|'.join(['robots-nocontent', 'combx', 'comment',
|
def count_content(node):
|
||||||
'community', 'disqus', 'extra', 'foot', 'header', 'menu', 'remark', 'rss',
|
# count words and imgs
|
||||||
'shoutbox', 'sidebar', 'sponsor', 'ad-break', 'agegate', 'pagination',
|
return count_words(node.text_content()) + len(node.findall('.//img'))
|
||||||
'pager', 'popup', 'tweet', 'twitter', 'com-', 'sharing', 'share', 'social',
|
|
||||||
'contact', 'footnote', 'masthead', 'media', 'meta', 'outbrain', 'promo',
|
|
||||||
'related', 'scroll', 'shoutbox', 'sidebar', 'sponsor', 'shopping', 'tags',
|
|
||||||
'tool', 'widget']), re.I)
|
|
||||||
|
|
||||||
regex_good = re.compile('|'.join(['and', 'article', 'body', 'column',
|
|
||||||
'main', 'shadow', 'content', 'entry', 'hentry', 'main', 'page',
|
|
||||||
'pagination', 'post', 'text', 'blog', 'story', 'par']), re.I)
|
|
||||||
|
|
||||||
tags_junk = ['script', 'head', 'iframe', 'object', 'noscript', 'param', 'embed', 'layer', 'applet', 'style']
|
class_bad = ['comment', 'community', 'extra', 'foot',
|
||||||
|
'sponsor', 'pagination', 'pager', 'tweet', 'twitter', 'com-', 'masthead',
|
||||||
|
'media', 'meta', 'related', 'shopping', 'tags', 'tool', 'author', 'about',
|
||||||
|
'head', 'robots-nocontent', 'combx', 'disqus', 'menu', 'remark', 'rss',
|
||||||
|
'shoutbox', 'sidebar', 'ad-', 'agegate', 'popup', 'sharing', 'share',
|
||||||
|
'social', 'contact', 'footnote', 'outbrain', 'promo', 'scroll', 'hidden',
|
||||||
|
'widget', 'hide']
|
||||||
|
|
||||||
attributes_fine = ['title', 'src', 'href', 'type', 'name', 'for', 'value']
|
regex_bad = re.compile('|'.join(class_bad), re.I)
|
||||||
|
|
||||||
|
class_good = ['and', 'article', 'body', 'column', 'main',
|
||||||
|
'shadow', 'content', 'entry', 'hentry', 'main', 'page', 'pagination',
|
||||||
|
'post', 'text', 'blog', 'story', 'par', 'editorial']
|
||||||
|
|
||||||
|
regex_good = re.compile('|'.join(class_good), re.I)
|
||||||
|
|
||||||
|
|
||||||
|
tags_junk = ['script', 'head', 'iframe', 'object', 'noscript',
|
||||||
|
'param', 'embed', 'layer', 'applet', 'style', 'form', 'input', 'textarea',
|
||||||
|
'button', 'footer']
|
||||||
|
|
||||||
|
tags_bad = tags_junk + ['a', 'aside']
|
||||||
|
|
||||||
|
tags_good = ['h1', 'h2', 'h3', 'article', 'p', 'cite', 'section', 'figcaption',
|
||||||
|
'figure', 'em', 'strong', 'pre', 'br', 'hr', 'headline']
|
||||||
|
|
||||||
|
tags_meaning = ['a', 'abbr', 'address', 'acronym', 'audio', 'article', 'aside',
|
||||||
|
'b', 'bdi', 'bdo', 'big', 'blockquote', 'br', 'caption', 'cite', 'center',
|
||||||
|
'code', 'col', 'colgroup', 'data', 'dd', 'del', 'details', 'description',
|
||||||
|
'dfn', 'dl', 'font', 'dt', 'em', 'figure', 'figcaption', 'h1', 'h2', 'h3',
|
||||||
|
'h4', 'h5', 'h6', 'hr', 'i', 'img', 'ins', 'kbd', 'li', 'main', 'mark',
|
||||||
|
'nav', 'ol', 'p', 'pre', 'q', 'ruby', 'rp', 'rt', 's', 'samp', 'small',
|
||||||
|
'source', 'strike', 'strong', 'sub', 'summary', 'sup', 'table', 'tbody',
|
||||||
|
'td', 'tfoot', 'th', 'thead', 'time', 'tr', 'track', 'tt', 'u', 'ul', 'var',
|
||||||
|
'wbr', 'video']
|
||||||
|
# adapted from tt-rss source code, to keep even as shells
|
||||||
|
|
||||||
|
tags_void = ['img', 'hr', 'br'] # to keep even if empty
|
||||||
|
|
||||||
|
|
||||||
|
attributes_fine = ['title', 'src', 'href', 'type', 'value']
|
||||||
|
|
||||||
|
|
||||||
def score_node(node):
|
def score_node(node):
|
||||||
|
" Score individual node "
|
||||||
|
|
||||||
score = 0
|
score = 0
|
||||||
|
|
||||||
if node.tag in tags_junk:
|
|
||||||
return 0
|
|
||||||
|
|
||||||
if isinstance(node, lxml.html.HtmlComment):
|
|
||||||
return 0
|
|
||||||
|
|
||||||
if node.tag in ['a']:
|
|
||||||
score -= 1
|
|
||||||
|
|
||||||
if node.tag in ['h1', 'h2', 'article']:
|
|
||||||
score += 8
|
|
||||||
|
|
||||||
if node.tag in ['p']:
|
|
||||||
score += 3
|
|
||||||
|
|
||||||
class_id = node.get('class', '') + node.get('id', '')
|
class_id = node.get('class', '') + node.get('id', '')
|
||||||
|
|
||||||
score += len(regex_good.findall(class_id) * 4)
|
if (isinstance(node, lxml.html.HtmlComment)
|
||||||
score -= len(regex_bad.findall(class_id) * 3)
|
or isinstance(node, lxml.html.HtmlProcessingInstruction)
|
||||||
|
or node.tag in tags_bad
|
||||||
|
or regex_bad.search(class_id)):
|
||||||
|
return 0
|
||||||
|
|
||||||
score += count_words(''.join([node.text or ''] + [x.tail or '' for x in node])) / 10. # the .tail part is to include *everything* in that node
|
if node.tag in tags_good:
|
||||||
|
score += 4
|
||||||
|
|
||||||
|
if regex_good.search(class_id):
|
||||||
|
score += 3
|
||||||
|
|
||||||
|
wc = count_words(node.text_content())
|
||||||
|
|
||||||
|
score += min(int(wc/10), 3) # give 1pt bonus for every 10 words, max of 3
|
||||||
|
|
||||||
|
if wc != 0:
|
||||||
|
wca = count_words(' '.join([x.text_content() for x in node.findall('.//a')]))
|
||||||
|
score = score * ( 1 - float(wca)/wc )
|
||||||
|
|
||||||
return score
|
return score
|
||||||
|
|
||||||
|
|
||||||
def score_all(root):
|
def score_all(node, grades=None):
|
||||||
|
" Fairly dumb loop to score all worthwhile nodes. Tries to be fast "
|
||||||
|
|
||||||
|
if grades is None:
|
||||||
grades = {}
|
grades = {}
|
||||||
|
|
||||||
for item in root.iter():
|
for child in node:
|
||||||
score = score_node(item)
|
score = score_node(child)
|
||||||
|
child.attrib['seen'] = 'yes, ' + str(int(score))
|
||||||
|
|
||||||
grades[item] = score
|
if score > 0:
|
||||||
|
spread_score(child, score, grades)
|
||||||
parent = item.getparent()
|
score_all(child, grades)
|
||||||
if parent is not None:
|
|
||||||
grades[parent] += score / 2.
|
|
||||||
|
|
||||||
gdparent = parent.getparent()
|
|
||||||
if gdparent is not None:
|
|
||||||
grades[gdparent] += score / 4.
|
|
||||||
|
|
||||||
return grades
|
return grades
|
||||||
|
|
||||||
|
|
||||||
def get_best_node(root):
|
def spread_score(node, score, grades):
|
||||||
return sorted(score_all(root).items(), key=lambda x: x[1], reverse=True)[0][0]
|
" Spread the node's score to its parents, on a linear way "
|
||||||
|
|
||||||
|
delta = score / 2
|
||||||
|
for ancestor in [node,] + list(node.iterancestors()):
|
||||||
|
if score >= 1 or ancestor is node:
|
||||||
|
try:
|
||||||
|
grades[ancestor] += score
|
||||||
|
except KeyError:
|
||||||
|
grades[ancestor] = score
|
||||||
|
|
||||||
def clean_html(root):
|
score -= delta
|
||||||
for item in root.iter():
|
|
||||||
# Step 1. Do we keep the node?
|
|
||||||
|
|
||||||
if item.tag in tags_junk:
|
|
||||||
item.getparent().remove(item)
|
|
||||||
|
|
||||||
class_id = item.get('class', '') + item.get('id', '')
|
|
||||||
if regex_bad.match(class_id):
|
|
||||||
item.getparent().remove(item)
|
|
||||||
|
|
||||||
if isinstance(item, lxml.html.HtmlComment):
|
|
||||||
item.getparent().remove(item)
|
|
||||||
|
|
||||||
# Step 2. Clean the node's attributes
|
|
||||||
|
|
||||||
for attrib in item.attrib:
|
|
||||||
if attrib not in attributes_fine:
|
|
||||||
del item.attrib[attrib]
|
|
||||||
|
|
||||||
|
|
||||||
def br2p(root):
|
|
||||||
for item in root.iterfind('.//br'):
|
|
||||||
parent = item.getparent()
|
|
||||||
if parent is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
gdparent = parent.getparent()
|
|
||||||
if gdparent is None:
|
|
||||||
continue
|
|
||||||
|
|
||||||
if item.tail is None:
|
|
||||||
# if <br/> is at the end of a div (to avoid having <p/>)
|
|
||||||
continue
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# set up new item
|
break
|
||||||
new_item = lxml.html.Element(parent.tag)
|
|
||||||
new_item.text = item.tail
|
|
||||||
|
|
||||||
for child in item.itersiblings():
|
|
||||||
new_item.append(child)
|
def write_score_all(root, grades):
|
||||||
|
" Useful for debugging "
|
||||||
|
|
||||||
|
for node in root.iter():
|
||||||
|
node.attrib['score'] = str(int(grades.get(node, 0)))
|
||||||
|
|
||||||
|
|
||||||
|
def clean_root(root):
|
||||||
|
for node in list(root):
|
||||||
|
clean_root(node)
|
||||||
|
clean_node(node)
|
||||||
|
|
||||||
|
|
||||||
|
def clean_node(node):
|
||||||
|
parent = node.getparent()
|
||||||
|
|
||||||
|
if parent is None:
|
||||||
|
# this is <html/> (or a removed element waiting for GC)
|
||||||
|
return
|
||||||
|
|
||||||
|
gdparent = parent.getparent()
|
||||||
|
|
||||||
|
# remove shitty tags
|
||||||
|
if node.tag in tags_junk:
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
# remove shitty class/id FIXME TODO too efficient, might want to add a toggle
|
||||||
|
class_id = node.get('class', '') + node.get('id', '')
|
||||||
|
if len(regex_bad.findall(class_id)) >= 2:
|
||||||
|
node.getparent().remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
# remove shitty link
|
||||||
|
if node.tag == 'a' and len(list(node.iter())) > 3:
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
# remove comments
|
||||||
|
if isinstance(node, lxml.html.HtmlComment) or isinstance(node, lxml.html.HtmlProcessingInstruction):
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
# remove if too many kids & too high link density
|
||||||
|
wc = count_words(node.text_content())
|
||||||
|
if wc != 0 and len(list(node.iter())) > 3:
|
||||||
|
wca = count_words(' '.join([x.text_content() for x in node.findall('.//a')]))
|
||||||
|
if float(wca)/wc > 0.8:
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
|
||||||
|
# squash text-less elements shells
|
||||||
|
if node.tag in tags_void:
|
||||||
|
# keep 'em
|
||||||
|
pass
|
||||||
|
elif node.tag in tags_meaning:
|
||||||
|
# remove if content-less
|
||||||
|
if not count_content(node):
|
||||||
|
parent.remove(node)
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
# squash non-meaningful if no direct text
|
||||||
|
content = (node.text or '') + ' '.join([child.tail or '' for child in node])
|
||||||
|
if not count_words(content):
|
||||||
|
node.drop_tag()
|
||||||
|
return
|
||||||
|
|
||||||
|
# for http://vice.com/fr/
|
||||||
|
if node.tag == 'img' and 'data-src' in node.attrib:
|
||||||
|
node.attrib['src'] = node.attrib['data-src']
|
||||||
|
|
||||||
|
# clean the node's attributes
|
||||||
|
for attrib in node.attrib:
|
||||||
|
if attrib not in attributes_fine:
|
||||||
|
del node.attrib[attrib]
|
||||||
|
|
||||||
|
# br2p
|
||||||
|
if node.tag == 'br':
|
||||||
|
if gdparent is None:
|
||||||
|
return
|
||||||
|
|
||||||
|
if not count_words(node.tail):
|
||||||
|
# if <br/> is at the end of a div (to avoid having <p/>)
|
||||||
|
return
|
||||||
|
|
||||||
|
else:
|
||||||
|
# set up new node
|
||||||
|
new_node = lxml.html.Element(parent.tag)
|
||||||
|
new_node.text = node.tail
|
||||||
|
|
||||||
|
for child in node.itersiblings():
|
||||||
|
new_node.append(child)
|
||||||
|
|
||||||
# delete br
|
# delete br
|
||||||
item.tail = None
|
node.tail = None
|
||||||
parent.remove(item)
|
parent.remove(node)
|
||||||
|
|
||||||
gdparent.insert(gdparent.index(parent)+1, new_item)
|
gdparent.insert(gdparent.index(parent)+1, new_node)
|
||||||
|
|
||||||
|
|
||||||
def get_article(data, encoding=None):
|
def lowest_common_ancestor(nodeA, nodeB, max_depth=None):
|
||||||
return lxml.etree.tostring(get_best_node(parse(data, encoding)))
|
ancestorsA = list(nodeA.iterancestors())
|
||||||
|
ancestorsB = list(nodeB.iterancestors())
|
||||||
|
|
||||||
|
if max_depth is not None:
|
||||||
|
ancestorsA = ancestorsA[:max_depth]
|
||||||
|
ancestorsB = ancestorsB[:max_depth]
|
||||||
|
|
||||||
|
ancestorsA.insert(0, nodeA)
|
||||||
|
ancestorsB.insert(0, nodeB)
|
||||||
|
|
||||||
|
for ancestorA in ancestorsA:
|
||||||
|
if ancestorA in ancestorsB:
|
||||||
|
return ancestorA
|
||||||
|
|
||||||
|
return nodeA # should always find one tho, at least <html/>, but needed for max_depth
|
||||||
|
|
||||||
|
|
||||||
|
def rank_nodes(grades):
|
||||||
|
return sorted(grades.items(), key=lambda x: x[1], reverse=True)
|
||||||
|
|
||||||
|
|
||||||
|
def get_best_node(grades):
|
||||||
|
" To pick the best (raw) node. Another function will clean it "
|
||||||
|
|
||||||
|
if len(grades) == 1:
|
||||||
|
return grades[0]
|
||||||
|
|
||||||
|
top = rank_nodes(grades)
|
||||||
|
lowest = lowest_common_ancestor(top[0][0], top[1][0], 3)
|
||||||
|
|
||||||
|
return lowest
|
||||||
|
|
||||||
|
|
||||||
|
def get_article(data, url=None, encoding=None):
|
||||||
|
" Input a raw html string, returns a raw html string of the article "
|
||||||
|
|
||||||
|
html = parse(data, encoding)
|
||||||
|
scores = score_all(html)
|
||||||
|
|
||||||
|
if not len(scores):
|
||||||
|
return None
|
||||||
|
|
||||||
|
best = get_best_node(scores)
|
||||||
|
wc = count_words(best.text_content())
|
||||||
|
wca = count_words(' '.join([x.text_content() for x in best.findall('.//a')]))
|
||||||
|
|
||||||
|
if wc - wca < 50 or float(wca) / wc > 0.3:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if url:
|
||||||
|
best.make_links_absolute(url)
|
||||||
|
|
||||||
|
clean_root(best)
|
||||||
|
|
||||||
|
return lxml.etree.tostring(best, pretty_print=True)
|
||||||
|
@@ -6,7 +6,167 @@
|
|||||||
<meta charset="UTF-8" />
|
<meta charset="UTF-8" />
|
||||||
<meta name="description" content="@feed.desc (via morss)" />
|
<meta name="description" content="@feed.desc (via morss)" />
|
||||||
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
||||||
<link rel="stylesheet" href="https://thisisdallas.github.io/Simple-Grid/simpleGrid.css" />
|
|
||||||
|
<style type="text/css">
|
||||||
|
/* columns - from https://thisisdallas.github.io/Simple-Grid/simpleGrid.css */
|
||||||
|
|
||||||
|
* {
|
||||||
|
box-sizing: border-box;
|
||||||
|
}
|
||||||
|
|
||||||
|
#content {
|
||||||
|
width: 100%;
|
||||||
|
max-width: 1140px;
|
||||||
|
min-width: 755px;
|
||||||
|
margin: 0 auto;
|
||||||
|
overflow: hidden;
|
||||||
|
|
||||||
|
padding-top: 20px;
|
||||||
|
padding-left: 20px; /* grid-space to left */
|
||||||
|
padding-right: 0px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-20px=0 */
|
||||||
|
}
|
||||||
|
|
||||||
|
.item {
|
||||||
|
width: 33.33%;
|
||||||
|
float: left;
|
||||||
|
padding-right: 20px; /* column-space */
|
||||||
|
}
|
||||||
|
|
||||||
|
@@media handheld, only screen and (max-width: 767px) { /* @@ to escape from the template engine */
|
||||||
|
#content {
|
||||||
|
width: 100%;
|
||||||
|
min-width: 0;
|
||||||
|
margin-left: 0px;
|
||||||
|
margin-right: 0px;
|
||||||
|
padding-left: 20px; /* grid-space to left */
|
||||||
|
padding-right: 10px; /* grid-space to right: (grid-space-left - column-space) e.g. 20px-10px=10px */
|
||||||
|
}
|
||||||
|
|
||||||
|
.item {
|
||||||
|
width: auto;
|
||||||
|
float: none;
|
||||||
|
margin-left: 0px;
|
||||||
|
margin-right: 0px;
|
||||||
|
margin-top: 10px;
|
||||||
|
margin-bottom: 10px;
|
||||||
|
padding-left: 0px;
|
||||||
|
padding-right: 10px; /* column-space */
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/* design */
|
||||||
|
|
||||||
|
#header h1, #header h2, #header p {
|
||||||
|
font-family: sans;
|
||||||
|
text-align: center;
|
||||||
|
margin: 0;
|
||||||
|
padding: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#header h1 {
|
||||||
|
font-size: 2.5em;
|
||||||
|
font-weight: bold;
|
||||||
|
padding: 1em 0 0.25em;
|
||||||
|
}
|
||||||
|
|
||||||
|
#header h2 {
|
||||||
|
font-size: 1em;
|
||||||
|
font-weight: normal;
|
||||||
|
}
|
||||||
|
|
||||||
|
#header p {
|
||||||
|
color: gray;
|
||||||
|
font-style: italic;
|
||||||
|
font-size: 0.75em;
|
||||||
|
}
|
||||||
|
|
||||||
|
#content {
|
||||||
|
text-align: justify;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item .title {
|
||||||
|
font-weight: bold;
|
||||||
|
display: block;
|
||||||
|
text-align: center;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item .link {
|
||||||
|
color: inherit;
|
||||||
|
text-decoration: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item:not(.active) {
|
||||||
|
cursor: pointer;
|
||||||
|
|
||||||
|
height: 20em;
|
||||||
|
margin-bottom: 20px;
|
||||||
|
overflow: hidden;
|
||||||
|
text-overflow: ellpisps;
|
||||||
|
|
||||||
|
padding: 0.25em;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item:not(.active) .title {
|
||||||
|
padding-bottom: 0.1em;
|
||||||
|
margin-bottom: 0.1em;
|
||||||
|
border-bottom: 1px solid silver;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item:not(.active):before {
|
||||||
|
content: " ";
|
||||||
|
display: block;
|
||||||
|
width: 100%;
|
||||||
|
position: absolute;
|
||||||
|
top: 18.5em;
|
||||||
|
height: 1.5em;
|
||||||
|
background: linear-gradient(to bottom, rgba(255,255,255,0) 0%, rgba(255,255,255,1) 100%);
|
||||||
|
}
|
||||||
|
|
||||||
|
.item:not(.active) .article * {
|
||||||
|
max-width: 100%;
|
||||||
|
font-size: 1em !important;
|
||||||
|
font-weight: normal;
|
||||||
|
display: inline;
|
||||||
|
margin: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item.active {
|
||||||
|
background: white;
|
||||||
|
position: fixed;
|
||||||
|
overflow: auto;
|
||||||
|
top: 0;
|
||||||
|
left: 0;
|
||||||
|
height: 100%;
|
||||||
|
width: 100%;
|
||||||
|
z-index: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
body.noscroll {
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item.active > * {
|
||||||
|
max-width: 700px;
|
||||||
|
margin: auto;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item.active .title {
|
||||||
|
font-size: 2em;
|
||||||
|
padding: 0.5em 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item.active .article object,
|
||||||
|
.item.active .article video,
|
||||||
|
.item.active .article audio {
|
||||||
|
display: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.item.active .article img {
|
||||||
|
max-height: 20em;
|
||||||
|
max-width: 100%;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
@@ -18,9 +178,8 @@
|
|||||||
<p>- via morss</p>
|
<p>- via morss</p>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div id="content" class="grid grid-pad">
|
<div id="content">
|
||||||
@for item in feed.items:
|
@for item in feed.items:
|
||||||
<div class="col-1-3">
|
|
||||||
<div class="item">
|
<div class="item">
|
||||||
@if item.link:
|
@if item.link:
|
||||||
<a class="title link" href="@item.link" target="_blank">@item.title</a>
|
<a class="title link" href="@item.link" target="_blank">@item.title</a>
|
||||||
@@ -35,7 +194,6 @@
|
|||||||
@end
|
@end
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
|
||||||
@end
|
@end
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
@@ -45,6 +203,7 @@
|
|||||||
items[i].onclick = function()
|
items[i].onclick = function()
|
||||||
{
|
{
|
||||||
this.classList.toggle('active')
|
this.classList.toggle('active')
|
||||||
|
document.body.classList.toggle('noscroll')
|
||||||
}
|
}
|
||||||
</script>
|
</script>
|
||||||
</body>
|
</body>
|
||||||
|
@@ -1,6 +1,5 @@
|
|||||||
lxml
|
lxml
|
||||||
|
bs4
|
||||||
python-dateutil <= 1.5
|
python-dateutil <= 1.5
|
||||||
html2text
|
|
||||||
ordereddict
|
|
||||||
wheezy.template
|
|
||||||
chardet
|
chardet
|
||||||
|
pymysql
|
||||||
|
2
setup.py
2
setup.py
@@ -7,7 +7,7 @@ setup(
|
|||||||
author='pictuga, Samuel Marks',
|
author='pictuga, Samuel Marks',
|
||||||
author_email='contact at pictuga dot com',
|
author_email='contact at pictuga dot com',
|
||||||
url='http://morss.it/',
|
url='http://morss.it/',
|
||||||
license='GPL 3+',
|
license='AGPL v3',
|
||||||
package_dir={package_name: package_name},
|
package_dir={package_name: package_name},
|
||||||
packages=find_packages(),
|
packages=find_packages(),
|
||||||
package_data={package_name: ['feedify.ini', 'reader.html.template']},
|
package_data={package_name: ['feedify.ini', 'reader.html.template']},
|
||||||
|
@@ -4,14 +4,6 @@ ErrorDocument 403 "Access forbidden"
|
|||||||
ErrorDocument 404 /cgi/main.py
|
ErrorDocument 404 /cgi/main.py
|
||||||
ErrorDocument 500 "A very nasty bug found his way onto this very server"
|
ErrorDocument 500 "A very nasty bug found his way onto this very server"
|
||||||
|
|
||||||
Order Deny,Allow
|
|
||||||
|
|
||||||
<Files ~ "\.(py|pyc|db|log)$">
|
<Files ~ "\.(py|pyc|db|log)$">
|
||||||
deny from all
|
deny from all
|
||||||
</Files>
|
</Files>
|
||||||
|
|
||||||
<Files main.py>
|
|
||||||
allow from all
|
|
||||||
AddHandler cgi-script .py
|
|
||||||
Options +ExecCGI
|
|
||||||
</Files>
|
|
||||||
|
9
www/cgi/.htaccess
Normal file
9
www/cgi/.htaccess
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
order allow,deny
|
||||||
|
|
||||||
|
deny from all
|
||||||
|
|
||||||
|
<Files main.py>
|
||||||
|
allow from all
|
||||||
|
AddHandler cgi-script .py
|
||||||
|
Options +ExecCGI
|
||||||
|
</Files>
|
@@ -1,27 +0,0 @@
|
|||||||
<?php
|
|
||||||
|
|
||||||
define('FBAPPID', "<insert yours>");
|
|
||||||
define('FBSECRET', "<insert yours>");
|
|
||||||
define('FBAPPTOKEN', FBAPPID . '|' . FBSECRET);
|
|
||||||
|
|
||||||
if (isset($_GET['code']))
|
|
||||||
{
|
|
||||||
# get real token from code
|
|
||||||
$code = $_GET['code'];
|
|
||||||
$eurl = sprintf("https://graph.facebook.com/oauth/access_token?client_id=%s&redirect_uri=%s&client_secret=%s&code=%s",
|
|
||||||
FBAPPID, $_SERVER['SCRIPT_URI'], FBSECRET, $code);
|
|
||||||
parse_str(file_get_contents($eurl), $values);
|
|
||||||
$token = $values['access_token'];
|
|
||||||
|
|
||||||
# get long-lived access token
|
|
||||||
$eurl = sprintf("https://graph.facebook.com/oauth/access_token?grant_type=fb_exchange_token&client_id=%s&client_secret=%s&fb_exchange_token=%s",
|
|
||||||
FBAPPID, FBSECRET, $token);
|
|
||||||
parse_str(file_get_contents($eurl), $values);
|
|
||||||
$ltoken = $values['access_token'];
|
|
||||||
|
|
||||||
setcookie('token', $ltoken, 0, '/');
|
|
||||||
|
|
||||||
# headers
|
|
||||||
header('status: 303 See Other');
|
|
||||||
header('location: http://' . $_SERVER['SERVER_NAME'] . '/');
|
|
||||||
}
|
|
@@ -2,7 +2,7 @@
|
|||||||
<html>
|
<html>
|
||||||
<head>
|
<head>
|
||||||
<title>morss</title>
|
<title>morss</title>
|
||||||
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;">
|
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
||||||
<meta charset="UTF-8" />
|
<meta charset="UTF-8" />
|
||||||
<style type="text/css">
|
<style type="text/css">
|
||||||
body
|
body
|
||||||
|
123
www/sheet.xsl
Normal file
123
www/sheet.xsl
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
|
||||||
|
|
||||||
|
<xsl:output method="html"/>
|
||||||
|
|
||||||
|
<xsl:template match="/">
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>RSS feed by morss</title>
|
||||||
|
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0;" />
|
||||||
|
|
||||||
|
<style type="text/css">
|
||||||
|
body {
|
||||||
|
overflow-wrap: anywhere;
|
||||||
|
word-wrap: anywhere;
|
||||||
|
font-family: sans;
|
||||||
|
}
|
||||||
|
|
||||||
|
#url {
|
||||||
|
background-color: rgba(255, 165, 0, 0.25);
|
||||||
|
padding: 1% 5%;
|
||||||
|
display: inline-block;
|
||||||
|
max-width: 100%;
|
||||||
|
}
|
||||||
|
|
||||||
|
body > ul {
|
||||||
|
background-color: #FFFAF4;
|
||||||
|
padding: 1%;
|
||||||
|
max-width: 100%;
|
||||||
|
}
|
||||||
|
|
||||||
|
ul {
|
||||||
|
list-style-type: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.tag {
|
||||||
|
color: darkred;
|
||||||
|
}
|
||||||
|
|
||||||
|
.attr {
|
||||||
|
color: darksalmon;
|
||||||
|
}
|
||||||
|
|
||||||
|
.value {
|
||||||
|
color: darkblue;
|
||||||
|
}
|
||||||
|
|
||||||
|
.comment {
|
||||||
|
color: lightgrey;
|
||||||
|
}
|
||||||
|
|
||||||
|
pre {
|
||||||
|
margin: 0;
|
||||||
|
max-width: 100%;
|
||||||
|
white-space: normal;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
|
||||||
|
<body>
|
||||||
|
<h1>RSS feed by morss</h1>
|
||||||
|
|
||||||
|
<p>Your RSS feed is <strong style="color: green">ready</strong>. You
|
||||||
|
can enter the following url in your newsreader:</p>
|
||||||
|
|
||||||
|
<div id="url"></div>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<xsl:apply-templates/>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
document.getElementById("url").innerHTML = window.location.href;
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
</xsl:template>
|
||||||
|
|
||||||
|
<xsl:template match="*">
|
||||||
|
<li>
|
||||||
|
<span class="element">
|
||||||
|
<
|
||||||
|
<span class="tag"><xsl:value-of select="name()"/></span>
|
||||||
|
|
||||||
|
<xsl:for-each select="@*">
|
||||||
|
<span class="attr"> <xsl:value-of select="name()"/></span>
|
||||||
|
=
|
||||||
|
"<span class="value"><xsl:value-of select="."/></span>"
|
||||||
|
</xsl:for-each>
|
||||||
|
>
|
||||||
|
</span>
|
||||||
|
|
||||||
|
<xsl:if test="node()">
|
||||||
|
<ul>
|
||||||
|
<xsl:apply-templates/>
|
||||||
|
</ul>
|
||||||
|
</xsl:if>
|
||||||
|
|
||||||
|
<span class="element">
|
||||||
|
</
|
||||||
|
<span class="tag"><xsl:value-of select="name()"/></span>
|
||||||
|
>
|
||||||
|
</span>
|
||||||
|
</li>
|
||||||
|
</xsl:template>
|
||||||
|
|
||||||
|
<xsl:template match="comment()">
|
||||||
|
<li>
|
||||||
|
<pre class="comment"><![CDATA[<!--]]><xsl:value-of select="."/><![CDATA[-->]]></pre>
|
||||||
|
</li>
|
||||||
|
</xsl:template>
|
||||||
|
|
||||||
|
<xsl:template match="text()">
|
||||||
|
<li>
|
||||||
|
<pre>
|
||||||
|
<xsl:value-of select="normalize-space(.)"/>
|
||||||
|
</pre>
|
||||||
|
</li>
|
||||||
|
</xsl:template>
|
||||||
|
|
||||||
|
<xsl:template match="text()[not(normalize-space())]"/>
|
||||||
|
|
||||||
|
</xsl:stylesheet>
|
Reference in New Issue
Block a user