Very quick-and-dirty hack to strip noscript tags

html5lib doesn't try to parse anything inside of noscript tags, just
html-escaping it instead. since woodwind definitely removes javascript,
we want to instead preserve the content inside of noscript tags, as html.

since I have to remove the noscript tags before the document hits
beautifulsoup, I'm resorting to a regex.

Fixes #31
This commit is contained in:
Kyle Mahan 2015-08-14 08:35:42 -07:00
parent 9a0cdb6925
commit a1ec24aca6

View file

@ -378,6 +378,10 @@ def process_xml_feed_for_new_entries(feed, content, backfill, now):
def process_html_feed_for_new_entries(feed, content, backfill, now):
# strip noscript tags before parsing, since we definitely aren't
# going to preserve js
content = re.sub('</?noscript[^>]*>', '', content)
parsed = mf2util.interpret_feed(
mf2py.parse(url=feed.feed, doc=content), feed.feed)
hfeed = parsed.get('entries', [])