update readme
This commit is contained in:
12
README.md
12
README.md
@@ -5,13 +5,13 @@ A collection of scripts to parse and process input files (html, RSS, Atom) – w
|
|||||||
<sup>1</sup>: Well, except for `pytelegrambotapi`, but feel free to replace it with another module.
|
<sup>1</sup>: Well, except for `pytelegrambotapi`, but feel free to replace it with another module.
|
||||||
|
|
||||||
Includes tools for extracting “news” articles, detection of duplicates, content downloader, cron-like repeated events, and Telegram bot client.
|
Includes tools for extracting “news” articles, detection of duplicates, content downloader, cron-like repeated events, and Telegram bot client.
|
||||||
Basically everything you need for quick and dirty format processing (html->RSS, notification->telegram, rss/podcast->download, webscraper, ...) without writing the same code over and over again.
|
Basically everything you need for quick and dirty format processing (`html->RSS`, `notification->telegram`, `rss/podcast->download`, `web scraper`, ...) without writing the same code over and over again.
|
||||||
The architecture is modular and pipeline processing oriented.
|
The architecture is modular and pipeline oriented.
|
||||||
Use whatever suits the task at hand.
|
Use whatever suits the task at hand.
|
||||||
|
|
||||||
|
|
||||||
### In progress
|
## Usage
|
||||||
|
|
||||||
Documentation, examples, tests and `setup.py` will be added soon.
|
There is a short [usage](./botlib/README.md) documentation on the individual componentes of this lib.
|
||||||
|
And there are some [examples](./examples/) on how to combine them.
|
||||||
Meanwhile, take a look at the [Usage](botlib/README.md) documentation.
|
Lastly, for web scraping, open the [playground.py](./examples/web-scraper/playground.py) to test your regex.
|
||||||
|
|||||||
@@ -17,7 +17,17 @@ from botlib.tgclient import TGClient
|
|||||||
|
|
||||||
## Cli, DirType
|
## Cli, DirType
|
||||||
|
|
||||||
TODO
|
A wrapper around `argparse`. Shorter calls for common parameters.
|
||||||
|
If `.arg_dir()` is not enough, use `DirType` for existing directory params.
|
||||||
|
|
||||||
|
```py
|
||||||
|
cli = Cli()
|
||||||
|
cli.arg_dir('dest_dir', help='...')
|
||||||
|
cli.arg_file('FILE', help='...')
|
||||||
|
cli.arg_bool('--dry-run', help='...')
|
||||||
|
cli.arg('source', help='...')
|
||||||
|
args = cli.parse()
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -54,42 +64,74 @@ Includes an etag / last-modified check to reduce network load.
|
|||||||
```py
|
```py
|
||||||
# for these 3 calls, create a download connection just once.
|
# for these 3 calls, create a download connection just once.
|
||||||
# the result is stored in cache/curl-example.org-1234..ABC.data
|
# the result is stored in cache/curl-example.org-1234..ABC.data
|
||||||
Curl.get('https://example.org')
|
Curl.get('https://example.org', cache_only=False)
|
||||||
Curl.get('https://example.org')
|
Curl.get('https://example.org')
|
||||||
Curl.get('https://example.org')
|
Curl.get('https://example.org')
|
||||||
# if the URL path is different, the content is downloaded into another file
|
# if the URL path is different, the content is downloaded into another file
|
||||||
Curl.get('https://example.org?1')
|
Curl.get('https://example.org?1')
|
||||||
# download files
|
# download files
|
||||||
Curl.file('https://example.org/image.png', './example-image.png')
|
Curl.file('https://example.org/image.png', './example-image.png')
|
||||||
|
# ... or json
|
||||||
|
Curl.json('https://example.org/data.json', fallback={})
|
||||||
# ... or just open a connection
|
# ... or just open a connection
|
||||||
with Curl.open('https://example.org') as wp:
|
with Curl.open('https://example.org') as wp:
|
||||||
wp.read()
|
wp.read()
|
||||||
```
|
```
|
||||||
|
|
||||||
There is also an easy-getter to download files only if they do not appear locally.
|
There is also an easy-getter to download multiple files if they were not already.
|
||||||
|
|
||||||
```py
|
```py
|
||||||
Curl.once('./dest/dir/', 'filename', [
|
Curl.once('./dest/dir/', 'filename', [
|
||||||
'https://example.org/audio.mp3',
|
'https://example.org/audio.mp3',
|
||||||
'https://example.org/image.jpg'
|
'https://example.org/image.jpg'
|
||||||
], date, desc='my super long description')
|
], date)
|
||||||
```
|
```
|
||||||
|
|
||||||
This will check whether `./dest/dir/filename.mp3` and `./dest/dir/filename.jpg` exists – and if not, download them.
|
This will check whether `./dest/dir/filename.mp3` and `./dest/dir/filename.jpg` exists – and if not, download them.
|
||||||
It will also put the content of desc in `./dest/dir/filename.txt` (again, only if the file does not exist yet).
|
All file modification dates will be set to `date` (if set).
|
||||||
All file modification dates will be set to `date`.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Feed2List
|
## Feed2List
|
||||||
|
|
||||||
TODO
|
Parse RSS or Atom xml and return list. Similar to `HTML2list`.
|
||||||
|
|
||||||
|
```py
|
||||||
|
for entry in reversed(Feed2List(fp, keys=[
|
||||||
|
'link', 'title', 'description', 'enclosure', # audio
|
||||||
|
'pubDate', 'media:content', # image
|
||||||
|
# 'itunes:image', 'itunes:duration', 'itunes:summary'
|
||||||
|
])):
|
||||||
|
date = entry.get('pubDate')
|
||||||
|
process_entry(entry, date)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Log, FileTime, StrFormat
|
## Log, FileTime, StrFormat, FileWrite
|
||||||
|
|
||||||
TODO
|
A few tools that may be helpful.
|
||||||
|
|
||||||
|
- `Log.error` and `Log.info` will just print a message (including the current timestamp) to the console.
|
||||||
|
- `FileTime.get` and `FileTime.set` reads and writes file times, e.g., set the date to the same time when a podcast episode was released.
|
||||||
|
- `StrFormat` has:
|
||||||
|
- `.strip_html()` make plain text from html
|
||||||
|
- `.to_date()`: convert string to date
|
||||||
|
- `.safe_filename()`: strips all filesystem-unsafe characters
|
||||||
|
|
||||||
|
`FileWrite.once` works exactly the same way as `Curl.once` does.
|
||||||
|
The only difference is that `FileWrite` accepts just one filename and requires a callback method:
|
||||||
|
|
||||||
|
```py
|
||||||
|
@FileWrite.once(dest_dir, 'description.txt', date, overwrite=False)
|
||||||
|
def _fn():
|
||||||
|
desc = title + '\n' + '=' * len(title)
|
||||||
|
desc += '\n\n' + StrFormat.strip_html(entry.get('description', ''))
|
||||||
|
return desc + '\n\n\n' + entry.get('link', '') + '\n'
|
||||||
|
```
|
||||||
|
|
||||||
|
The callback `_fn` is only called if the file does not exist yet.
|
||||||
|
This avoids unnecessary processing in repeated calls.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user