warctika: warctika 1.0: F...

Title: warctika: warctika 1.0: First production release

Type Software Tom Nicholls (2014): warctika: warctika 1.0: First production release. Zenodo. Software. https://zenodo.org/record/12183

Author: Tom Nicholls (Oxford Internet Institute, University of Oxford) ;

Summary

This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records.

The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work.

WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/

The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.

The software at this stage should be considered feature-complete, though it may have minor additions in the future.

More information

DOI: 10.5281/zenodo.12183

Subjects

Python, WARC files, Heritrix, OpenWayback, Text extraction, Apache Tika

Dates

Publication date: 2014
Issued: October 10, 2014

Rights

https://opensource.org/licenses/GPL-2.0 GNU General Public License v2.0 only
info:eu-repo/semantics/openAccess Open Access

Format

electronic resource

Relateditems

Description	Item type	Relationship	Uri
		IsSupplementTo	https://github.com/pmyteh/warctika/tree/v1.0
		IsVersionOf	https://doi.org/10.5281/zenodo.592694
		IsPartOf	https://zenodo.org/communities/zenodo

This is a limited proof of concept to search for research data, not a production system.

MIT Libraries Homepage

Title: warctika: warctika 1.0: First production release

Links

Summary

More information

Subjects

Dates

Rights

Format

Relateditems