update readme

This commit is contained in:
relikd
2022-04-08 21:00:49 +02:00
parent e871e6f03e
commit a25b62d934
2 changed files with 57 additions and 15 deletions

View File

@@ -5,13 +5,13 @@ A collection of scripts to parse and process input files (html, RSS, Atom) w
<sup>1</sup>: Well, except for `pytelegrambotapi`, but feel free to replace it with another module. <sup>1</sup>: Well, except for `pytelegrambotapi`, but feel free to replace it with another module.
Includes tools for extracting “news” articles, detection of duplicates, content downloader, cron-like repeated events, and Telegram bot client. Includes tools for extracting “news” articles, detection of duplicates, content downloader, cron-like repeated events, and Telegram bot client.
Basically everything you need for quick and dirty format processing (html->RSS, notification->telegram, rss/podcast->download, webscraper, ...) without writing the same code over and over again. Basically everything you need for quick and dirty format processing (`html->RSS`, `notification->telegram`, `rss/podcast->download`, `web scraper`, ...) without writing the same code over and over again.
The architecture is modular and pipeline processing oriented. The architecture is modular and pipeline oriented.
Use whatever suits the task at hand. Use whatever suits the task at hand.
### In progress ## Usage
Documentation, examples, tests and `setup.py` will be added soon. There is a short [usage](./botlib/README.md) documentation on the individual componentes of this lib.
And there are some [examples](./examples/) on how to combine them.
Meanwhile, take a look at the [Usage](botlib/README.md) documentation. Lastly, for web scraping, open the [playground.py](./examples/web-scraper/playground.py) to test your regex.

View File

@@ -17,7 +17,17 @@ from botlib.tgclient import TGClient
## Cli, DirType ## Cli, DirType
TODO A wrapper around `argparse`. Shorter calls for common parameters.
If `.arg_dir()` is not enough, use `DirType` for existing directory params.
```py
cli = Cli()
cli.arg_dir('dest_dir', help='...')
cli.arg_file('FILE', help='...')
cli.arg_bool('--dry-run', help='...')
cli.arg('source', help='...')
args = cli.parse()
```
@@ -54,42 +64,74 @@ Includes an etag / last-modified check to reduce network load.
```py ```py
# for these 3 calls, create a download connection just once. # for these 3 calls, create a download connection just once.
# the result is stored in cache/curl-example.org-1234..ABC.data # the result is stored in cache/curl-example.org-1234..ABC.data
Curl.get('https://example.org') Curl.get('https://example.org', cache_only=False)
Curl.get('https://example.org') Curl.get('https://example.org')
Curl.get('https://example.org') Curl.get('https://example.org')
# if the URL path is different, the content is downloaded into another file # if the URL path is different, the content is downloaded into another file
Curl.get('https://example.org?1') Curl.get('https://example.org?1')
# download files # download files
Curl.file('https://example.org/image.png', './example-image.png') Curl.file('https://example.org/image.png', './example-image.png')
# ... or json
Curl.json('https://example.org/data.json', fallback={})
# ... or just open a connection # ... or just open a connection
with Curl.open('https://example.org') as wp: with Curl.open('https://example.org') as wp:
wp.read() wp.read()
``` ```
There is also an easy-getter to download files only if they do not appear locally. There is also an easy-getter to download multiple files if they were not already.
```py ```py
Curl.once('./dest/dir/', 'filename', [ Curl.once('./dest/dir/', 'filename', [
'https://example.org/audio.mp3', 'https://example.org/audio.mp3',
'https://example.org/image.jpg' 'https://example.org/image.jpg'
], date, desc='my super long description') ], date)
``` ```
This will check whether `./dest/dir/filename.mp3` and `./dest/dir/filename.jpg` exists and if not, download them. This will check whether `./dest/dir/filename.mp3` and `./dest/dir/filename.jpg` exists and if not, download them.
It will also put the content of desc in `./dest/dir/filename.txt` (again, only if the file does not exist yet). All file modification dates will be set to `date` (if set).
All file modification dates will be set to `date`.
## Feed2List ## Feed2List
TODO Parse RSS or Atom xml and return list. Similar to `HTML2list`.
```py
for entry in reversed(Feed2List(fp, keys=[
'link', 'title', 'description', 'enclosure', # audio
'pubDate', 'media:content', # image
# 'itunes:image', 'itunes:duration', 'itunes:summary'
])):
date = entry.get('pubDate')
process_entry(entry, date)
```
## Log, FileTime, StrFormat ## Log, FileTime, StrFormat, FileWrite
TODO A few tools that may be helpful.
- `Log.error` and `Log.info` will just print a message (including the current timestamp) to the console.
- `FileTime.get` and `FileTime.set` reads and writes file times, e.g., set the date to the same time when a podcast episode was released.
- `StrFormat` has:
- `.strip_html()` make plain text from html
- `.to_date()`: convert string to date
- `.safe_filename()`: strips all filesystem-unsafe characters
`FileWrite.once` works exactly the same way as `Curl.once` does.
The only difference is that `FileWrite` accepts just one filename and requires a callback method:
```py
@FileWrite.once(dest_dir, 'description.txt', date, overwrite=False)
def _fn():
desc = title + '\n' + '=' * len(title)
desc += '\n\n' + StrFormat.strip_html(entry.get('description', ''))
return desc + '\n\n\n' + entry.get('link', '') + '\n'
```
The callback `_fn` is only called if the file does not exist yet.
This avoids unnecessary processing in repeated calls.