Very quick-and-dirty hack to strip noscript tags
html5lib doesn't try to parse anything inside of noscript tags, just html-escaping it instead. since woodwind definitely removes javascript, we want to instead preserve the content inside of noscript tags, as html. since I have to remove the noscript tags before the document hits beautifulsoup, I'm resorting to a regex. Fixes #31
This commit is contained in:
parent
9a0cdb6925
commit
a1ec24aca6
1 changed files with 4 additions and 0 deletions
|
@ -378,6 +378,10 @@ def process_xml_feed_for_new_entries(feed, content, backfill, now):
|
|||
|
||||
|
||||
def process_html_feed_for_new_entries(feed, content, backfill, now):
|
||||
# strip noscript tags before parsing, since we definitely aren't
|
||||
# going to preserve js
|
||||
content = re.sub('</?noscript[^>]*>', '', content)
|
||||
|
||||
parsed = mf2util.interpret_feed(
|
||||
mf2py.parse(url=feed.feed, doc=content), feed.feed)
|
||||
hfeed = parsed.get('entries', [])
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue