How to decode Reddit's RSS using Golang?

2020-07-11 08:50发布

问题:

I've been playing about with Go's XML package and cannot see what is wrong with the following code.

package main

import (
    "encoding/xml"
    "fmt"
    "net/http"
) 

type Channel struct {
    Items Item
}

type Item struct {
    Title       string `xml:"title"`
    Link        string `xml:"link"`
    Description string `xml:"description"`
}

func main() {

    var items = new(Channel)
    res, err := http.Get("http://www.reddit.com/r/google.xml")

    if err != nil {
        fmt.Printf("Error: %v\n", err)
    } else {
        decoded := xml.NewDecoder(res.Body)

        err = decoded.Decode(items)

        if err != nil {
            fmt.Printf("Error: %v\n", err)
        }

        fmt.Printf("Title: %s\n", items.Items.Title)
    }
}

The above code runs without any errors and prints to the terminal:

Title:

The struct seems empty but I can't see why it isn't getting populated with the XML data.

回答1:

Your program comes close, but needs to specify just a little bit more context to match the XML document.

You need to revise your field tags to help guide the XML binding down through your Channel structure to your Item structure:

type Channel struct {
    Items []Item `xml:"channel>item"`
}

type Item struct {
    Title       string `xml:"title"`
    Link        string `xml:"link"`
    Description string `xml:"description"`
}

Per the documentation for encoding/xml.Unmarshal(), the seventh bullet item applies here:

If the XML element contains a sub-element whose name matches the prefix of a tag formatted as "a" or "a>b>c", unmarshal will descend into the XML structure looking for elements with the given names, and will map the innermost elements to that struct field. A tag starting with ">" is equivalent to one starting with the field name followed by ">".

In your case, you're looking to descend through the top-level <rss> element's <channel> elements to find each <item> element. Note, though, that we don't need to—an in fact can't—specify that the Channel struct should burrow through the top-level <rss> element by writing the Items field's tag as

`xml:"rss>channel>item"`

That context is implicit; the struct supplied to Unmarshall() already maps to the top-level XML element.

Note too that your Channel struct's Items field should be of type slice-of-Item, not just a single Item.


You mentioned that you're having trouble getting the proposal to work. Here's a complete listing that I find works as one would expect:

package main

import (
    "encoding/xml"
    "fmt"
    "net/http"
    "os"
) 

type Channel struct {
    Items []Item `xml:"channel>item"`
}

type Item struct {
    Title       string `xml:"title"`
    Link        string `xml:"link"`
    Description string `xml:"description"`
}

func main() {
    if res, err := http.Get("http://www.reddit.com/r/google.xml"); err != nil {
        fmt.Println("Error retrieving resource:", err)
        os.Exit(1)
    } else {
        channel := Channel{}
        if err := xml.NewDecoder(res.Body).Decode(&channel); err != nil {
            fmt.Println("Error:", err)
            os.Exit(1)
        } else if len(channel.Items) != 0 {
            item := channel.Items[0]
            fmt.Println("First title:", item.Title)
            fmt.Println("First link:", item.Link)
            fmt.Println("First description:", item.Description)
        }
    }
}


回答2:

I'd be completely explicit like this - name all the XML parts

See the playground for a full working example

type Rss struct {
    Channel Channel `xml:"channel"`
}

type Channel struct {
    Title       string `xml:"title"`
    Link        string `xml:"link"`
    Description string `xml:"description"`
    Items       []Item `xml:"item"`
}

type Item struct {
    Title       string `xml:"title"`
    Link        string `xml:"link"`
    Description string `xml:"description"`
}


回答3:

Nowadays the Reddit RSS feed seem to be have changed to the atom type. This means that regular parsing will not work anymore. The atom functionality of go-rss could parse such feeds:

//Feed struct for RSS
type Feed struct {
  Entry []Entry `xml:"entry"`
}

//Entry struct for each Entry in the Feed
type Entry struct {
  ID      string `xml:"id"`
  Title   string `xml:"title"`
  Updated string `xml:"updated"`
}

//Atom parses atom feeds
func Atom(resp *http.Response) (*Feed, error) {
  defer resp.Body.Close()
  xmlDecoder := xml.NewDecoder(resp.Body)
  xmlDecoder.CharsetReader = charset.NewReader
  feed := Feed{}
  if err := xmlDecoder.Decode(&feed); err != nil {
      return nil, err
  }
  return &feed, nil
}