Title: warctika: warctika 1.0: First production release
Type Software Tom Nicholls (2014): warctika: warctika 1.0: First production release. Zenodo. Software. https://zenodo.org/record/12183
Links
- Item record in Zenodo
- Digital object URL
Summary
This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records.
The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work.
WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/
The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.
The software at this stage should be considered feature-complete, though it may have minor additions in the future.
More information
- DOI: 10.5281/zenodo.12183
Subjects
- Python, WARC files, Heritrix, OpenWayback, Text extraction, Apache Tika
Dates
- Publication date: 2014
- Issued: October 10, 2014
Rights
- https://opensource.org/licenses/GPL-2.0 GNU General Public License v2.0 only
- info:eu-repo/semantics/openAccess Open Access
Format
electronic resource
Relateditems
Description | Item type | Relationship | Uri |
---|---|---|---|
IsSupplementTo | https://github.com/pmyteh/warctika/tree/v1.0 | ||
IsVersionOf | https://doi.org/10.5281/zenodo.592694 | ||
IsPartOf | https://zenodo.org/communities/zenodo |