Rewrite of the url-proxy servlet (YaCyProxyServlet "/proxy.html?url=xyz")
using different rewrite of url methode (using JSoup instead of regex for more flexibility)
(problem with regex was to also modify http header tags, causing problems with some relative link urls
and on included
header tag)
Design goal of this urlproxy
- option to handle links/urls the owner/user clicked on
- index visited pages on the fly (without to configure a permanent "transparent" proxy
For the goal and as distinguish from the "transparent" proxy we don't want (and need) to route all content
through the proxy (e.g. we are not interested in transporting css etc. but concentrate on searcheable content.
general functionallity to implement
1 - check user access right
2 - get target url from parameter
3 - check target url accepteable
4 - get target url
5 - index target url
6 - perform any custom event/treatment (for/on this user clicked url) - not implemented
7 - modify loaded target content (like rewrite links to get proxied)
8 - optionally add augmentation / interaction - not implemented
9 - deliver to client broser
The rewrite of links can't be perfect, as all kinds of scripting etc. can be involved,
with jsoup only the
attributes of the body are modified. What will help to display
the page correct but will also results that e.g. with forms and javascript menues links will not
point to the original site (instead to the proxy url)
TODO: instead of using JSoup on top the (2 time parsing - for indexing & content rewrite) check option to joined parsing steps
Hint: a browser favorite of
javascript: window.location.href = ('http://localhost:9090/proxy.html?url=' + location.href);
will start the urlproxy with the current browser address.
This class is linked to YaCy within jetty using the defaults/web.xml configuration