As I ditched Thunderbird for miniflux as main RSS reader, I've spent quite some time improving it.
I was casually browsing its code when I stumbled upon the following
regex:
imgRegex = regexp.MustCompile(`<img [^>]+>`), used in a single place:
doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
matches := imgRegex.FindAllString(noscript.Text(), 2)
if len(matches) == 1 {
changed = true
noscript.ReplaceWithHtml(matches[0])
}
})
This looks like a terrible idea, and shouldn't be hard to replace with something better like this:
doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
if img := noscript.Find("img"); img.Length() == 1 {
img.Unwrap()
changed = true
}
})
Unfortunately, it didn't work, and led to a significant amount of time being
wasted spent trying to debug/understand what was going on.
Turns out goquery is using cascadia, which
in turn uses go's x/net/html, which is parsing html with scripting
enabled,
making it not play nice with <noscript> tags. An
issue has been opened upstream in
July 2016, and closed by a fix from April 2019, but
unfortunately it only works for <noscript> tags in <head>, meh.
In goquery's issue on the
topic, someone suggested
to use something like this, to populate <noscript>'s html content:
root.Find(`noscript`).Each(func(i int, noscript *goquery.Selection) {
noscript.SetHtml(noscript.Text())
})
Unfortunately, this didn't work for me. An horrible alternative would be to use
x/net/html or goquery to manually parse noscript.Html(), but this would
be ridiculously overkill, surely there is a better way.
ParseOptionEnableScripting's
documentation
doesn't say anything about <head> context, and by looking at the history of
html/parse.go, we can see that namusyaka
implemented <noscript> parsing in <body> as well in December
2019! So the proper
solution is this simple diff:
- doc, err := goquery.NewDocumentFromReader(strings.NewReader(entryContent))
+ parserHtml, err := nethtml.ParseWithOptions(strings.NewReader(entryContent), nethtml.ParseOptionEnableScripting(false))
+ doc := goquery.NewDocumentFromNode(parserHtml)
The corresponding miniflux pull-request can be found here, no more ugly regex! May this little blogpost prevent other from wasting as much time as I did.