Refactoring to v.2.0

This commit is contained in:
relikd
2018-12-27 21:11:59 +01:00
parent f9e672661a
commit 62c5bef463
50 changed files with 2574 additions and 3128 deletions

View File

@@ -1,39 +1,66 @@
# RSXML
This is utility code for parsing XML and HTML using libXML2s SAX parser.
This is utility code for parsing XML and HTML using libXML2s SAX parser. It does not depend on any other third-party frameworks and builds two targets: one for Mac, one for iOS.
It builds two framework targets: one for Mac, one for iOS. It does not depend on any other third-party frameworks. The code is Objective-C with ARC.
**Note:** This is an actively maintained fork of the [RSXML library by Brent Simmons](https://github.com/brentsimmons/RSXML). The original library seems to be inactive in favor of the new version [RSParser](https://github.com/brentsimmons/RSParser) which is written with Swift support in mind. If you prefer Swift you should go ahead and work with that project. However, the reason for this fork is to keep a version alive which is Objective-C only.
#### The gist
To parse XML, create an `RSSAXParserDelegate`. (There are examples in the framework that you can crib from.)
To parse HTML, create an `RSSAXHTMLParserDelegate`. (There are examples for this too.)
### Why use libXML2s SAX API?
#### Goodies and Extras
Brent Simmons put much value on low memory footprint and fast parsing. With his own words: "RSXML was written to avoid allocating Objective-C objects except when absolutely needed. You'll note use of things like `memcmp` and `strncmp`". This promise will not be broken in future development.
There are three XML parsers included, for OPML, RSS, and Atom. To parse OPML, see `RSOPMLParser`. To parse RSS and Atom, see `RSFeedParser`.
These parsers may or may not be complete enough for your needs. You could, in theory, start writing an RSS reader just with these. (And, if you want to, go for it, with my blessing.)
There are two HTML parsers included. `RSHTMLMetadataParser` pulls metadata from the head section of an HTML document. `RSHTMLLinkParser` pulls all the links (anchors, <a href=…> tags) from an HTML document.
### Refactoring v.2.0
Other possibly interesting things:
The refactoring that led to version 2.0 changed many things. With nearly all files touched, I would say roughly 80% of the code was updated. The parser architecture was rewritten and every parser is now a subclass of `RSXMLParser`. The parsing interface uses generic return types and some of the returned documents have changed as well.
`RSDateParser` makes it easy to parse dates in the formats found in various types of feeds.
In general, the performance did not change but if so only to get slightly better. However, the performance of the HTML metadata parser improved by 80% 90% (by canceling the parse after the head tag). At the same time, heap allocations dropped to 50% 30% for the test cases (same reason).
In the previous version, the test case for parsing a non-opml file (with `RSOPMLParser`) took 13 seconds, whereas now, the parser cancels after a few milliseconds.
## Usage
```
RSXMLData *xmlData = [[RSXMLData alloc] initWithData:d urlString:@"https://www.example.org"];
// TODO: check xmlData.parserError
RSFeedParser *parser = [[RSFeedParser alloc] initWithXMLData:xmlData];
// TODO: check [parser canParse]
// TODO: alternatively check error after parseSync:
NSError *parseError;
RSParsedFeed *document = [parser parseSync:&parseError];
```
`RSXMLData` will return an error in `.parserError` if the provided data is not in XML format (see `RSXMLError` for possible reasons). The other point of failure is after initializing a parser with the `RSXMLData`. This will set an error if the parser does not match the underlying data (e.g., if you try to parse an `.opml` file with an Atom or RSS parser).
If you don't care about the parser used to decode the data, `[xmlData getParser]` will return the most suitable parser. You can use that parser right away to call `parseSync:`. Anyway, you can also parse the XML file asynchronously with `parseAsync:`.
```
[[xmlData getParser] parseAsync:^(RSParsedFeed *parsedDocument, NSError *error) {
// process feed items ...
}];
```
### Available parsers
This library includes parsers for RSS, Atom, OPML, and HTML metadata. The latter will return links to feed URLs, icon files, or generally all anchor tags linking to whatever. Use `RSFeedParser` to parse a feed regardless of type (Atom: `RSAtomParser`, RSS: `RSRSSParser`). To parse `.opml` files use `RSOPMLParser`, and for `.html` files there are two available `RSHTMLMetadataParser` (icons and feed links) and `RSHTMLLinkParser` (all anchor tags).
Depending on the parser the return value of `parseSync`/`parseAsync` is: `RSParsedFeed`, `RSOPMLItem`, `RSHTMLMetadata`, or `RSHTMLMetadataAnchor`.
You can define the parser type by declaring it like this: `RSXMLData<RSFeedParser*> xmlData`. That won't force the selection of the parser, though. But `[xmlData getParser]` will return the correct type; which in turn will return the appropriate document type (same as using a specific parser in the first place).
### Extras
`RSDateParser` makes it easy to parse dates from various formats found in different feed types.
`NSString+RSXML` decodes HTML entities.
Also note: there are some unit tests.
#### Why use libXML2s SAX API?
SAX is kind of a pain because of all the state you have to manage. But its fastest and uses the least amount of memory.
An alternative is to use `NSXMLParser`, which is event-driven like SAX. However, RSXML was written to avoid allocating Objective-C objects except when absolutely needed. Youll note use of things like `memcp` and `strncmp`.
Normally I avoid this kind of thing *strenuously*. I prefer to work at the highest level possible.
But my more-than-a-decade of experience parsing XML has led me to this solution, which — last time I checked, which was, admittedly, a few years ago — was not only fastest but also uses the least memory. (The two things are related, of course: creating objects is bad for performance, so this code attempts to do the minimum possible.)
All that low-level stuff is encapsulated, however. If you parse a feed, for instance, the caller gets an `RSParsedFeed` which contains `RSParsedArticle`s, and theyre standard Objective-C objects. Its only inside your `RSSAXParserDelegate` and `RSSAXHTMLParserDelegate` where youll need to deal with C.