This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: warctika: warctika 1.0: First production release

Type Software Tom Nicholls (2014): warctika: warctika 1.0: First production release. Zenodo. Software. https://zenodo.org/record/12183

Author: Tom Nicholls (Oxford Internet Institute, University of Oxford) ;

Links

Summary

This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records.

The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work.

WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/

The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.

The software at this stage should be considered feature-complete, though it may have minor additions in the future.

More information

  • DOI: 10.5281/zenodo.12183

Subjects

  • Python, WARC files, Heritrix, OpenWayback, Text extraction, Apache Tika

Dates

  • Publication date: 2014
  • Issued: October 10, 2014

Rights


Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsSupplementTohttps://github.com/pmyteh/warctika/tree/v1.0
IsVersionOfhttps://doi.org/10.5281/zenodo.592694
IsPartOfhttps://zenodo.org/communities/zenodo