Parsing <noscript> tags with goquery
Fri 13 December 2024 — download

As I ditched Thunderbird for miniflux as main RSS reader, I've spent quite some time improving it.

I was casually browsing its code when I stumbled upon the following regex: imgRegex = regexp.MustCompile(`<img [^>]+>`), used in a single place:

doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
    matches := imgRegex.FindAllString(noscript.Text(), 2)

    if len(matches) == 1 {
        changed = true

        noscript.ReplaceWithHtml(matches[0])
    }
})

This looks like a terrible idea, and shouldn't be hard to replace with something better like this:

doc.Find("noscript").Each(func(i int, noscript *goquery.Selection) {
      if img := noscript.Find("img"); img.Length() == 1 {
              img.Unwrap()
              changed = true
      }
})

Unfortunately, it didn't work, and led to a significant amount of time being wasted spent trying to debug/understand what was going on. Turns out goquery is using cascadia, which in turn uses go's x/net/html, which is parsing html with scripting enabled, making it not play nice with <noscript> tags. An issue has been opened upstream in July 2016, and closed by a fix from April 2019, but unfortunately it only works for <noscript> tags in <head>, meh.

In goquery's issue on the topic, someone suggested to use something like this, to populate <noscript>'s html content:

root.Find(`noscript`).Each(func(i int, noscript *goquery.Selection) {
    noscript.SetHtml(noscript.Text())
})

Unfortunately, this didn't work for me. An horrible alternative would be to use x/net/html or goquery to manually parse noscript.Html(), but this would be ridiculously overkill, surely there is a better way. ParseOptionEnableScripting's documentation doesn't say anything about <head> context, and by looking at the history of html/parse.go, we can see that namusyaka implemented <noscript> parsing in <body> as well in December 2019! So the proper solution is this simple diff:

-       doc, err := goquery.NewDocumentFromReader(strings.NewReader(entryContent))
+       parserHtml, err := nethtml.ParseWithOptions(strings.NewReader(entryContent), nethtml.ParseOptionEnableScripting(false))
+       doc := goquery.NewDocumentFromNode(parserHtml)

The corresponding miniflux pull-request can be found here, no more ugly regex! May this little blogpost prevent other from wasting as much time as I did.