Very quick-and-dirty hack to strip noscript tags

html5lib doesn't try to parse anything inside of noscript tags, just html-escaping it instead. since woodwind definitely removes javascript, we want to instead preserve the content inside of noscript tags, as html. since I have to remove the noscript tags before the document hits beautifulsoup, I'm resorting to a regex. Fixes #31
2015-08-14 08:35:42 -07:00 · 2015-08-14 08:35:42 -07:00 · a1ec24aca6
commit a1ec24aca6
parent 9a0cdb6925
1 changed files with 4 additions and 0 deletions
--- a/woodwind/tasks.py
+++ b/woodwind/tasks.py
@ -378,6 +378,10 @@ def process_xml_feed_for_new_entries(feed, content, backfill, now):


 def process_html_feed_for_new_entries(feed, content, backfill, now):
+    # strip noscript tags before parsing, since we definitely aren't
+    # going to preserve js
+    content = re.sub('</?noscript[^>]*>', '', content)
+
    parsed = mf2util.interpret_feed(
        mf2py.parse(url=feed.feed, doc=content), feed.feed)
    hfeed = parsed.get('entries', [])