SciKit-Learn Laboratory (...

Title: SciKit-Learn Laboratory (SKLL) 1.0.0

Type Software Dan Blanchard, Nitin Madnani, Michael Heilman, Nils Murrugarra Llerena, Diane M. Napolitano, Aoife Cahill, Keelan Evanini, Chee Wee Leong (2014): SciKit-Learn Laboratory (SKLL) 1.0.0. Zenodo. Software. https://zenodo.org/record/12825

Authors: Dan Blanchard (Educational Testing Service) ; Nitin Madnani (Educational Testing Service) ; Michael Heilman (Educational Testing Service) ; Nils Murrugarra Llerena (University of Pittsburgh) ; Diane M. Napolitano (Educational Testing Service) ; Aoife Cahill (Educational Testing Service) ; Keelan Evanini (Educational Testing Service) ; Chee Wee Leong (Educational Testing Service) ;

Summary

The 1.0 release is finally here! It's been a little over a year since our first public release, and we're ready to say that SKLL is 1.0. Read our massive release notes:

We did make some API- and config-file-breaking changes. They are listed at the end of the release notes. They should all be addressable by a quick find-and-replace.

Bug fixes

Fixed path problems in iris example (issue #103, PR #171) Fixed bug where ablated_features field was incorrect when config file contained multiple feature sets (issue #125) Fixed bug where CV would crash with rare classes (issue #109, PR #165) Fixed issue where warning about extremely large feature values was being issued before rescaling Fixed issue where some warning messages used mix of new-style and old-style replacement strings with old-style formatting. Fixed a number of bugs with filtering FeatureSet objects and writing filtered sets to files. Fixed bug in FeatureSet.__sub__ where feature names were being passed instead of indices. Fixed issue where MegaMWriter could not print numbers in Python 2.7.

New features

SKLL releases are now for specific versions of scikit-learn. 1.0.0 requires scikit-learn 0.15.2 (issue #138, PR #170) Added tutorial to documentation that walks new users through using SKLL in much the same way as our PyData talks. Added support for custom learners (issue #92, PR #183) Added two command-line utilities, join_features and filter_features, for joining and filtering feature files. These replace join_megam and filter_megam (issue #79, PR #198) Added support for specifying the field in ARFF, CSV, or TSV files that contains the IDs for each instance (issue #204, PR #206) Added train/test set sizes to result files (issue #150, PR #161) Added intercept to print_model_weights output (issue #155, PR #163) Added total time and end time-stamp to experiment results (issue #91, PR #167) Added exception when featureset_name is longer than 210 characters (issue #121, PR #168) Added regression example data, boston (issue #162) Added ability to specify number of grid search folds (issue #122, PR #175) Added warning message when number of features in training model are different than those for FeatureSet passed to Learner.predict() (issue #145) Added conda.yaml file to repository to make conda package creation simpler (issue #159, PR #173) Added loads more unit tests, greatly increased unit test coverage, and generally cleaned up test modules (issues #97, #148, #157, #188, and #202; PRs #176, #184, #196, #203, and #205) Added train_file and test_file fields to config files, which can be used to specify single file feature sets. This greatly simplifies running simple experiments (issue #12, PR #197) Added support for merging feature sets with IDs in different orders (issue #149, PR #177) Added ValueError when invalid tuning objective is specified (issues #117 and #179; PRs #174 and #181) Added shuffle option to config files to decide whether training data should be shuffled before training. By default this is False, but if grid_search is True, we will automatically shuffle. Previously, the default was True, and there was no option in the config files. (issue #189, PR #190) Updated documentation to indicate that we're using StratifiedKFold (issue #160) Added FeatureSet.__eq__ and FeatureSet.__getitem__ methods.

Minor changes without issues

Updated docstrings all over the place to be more accurate. Updated generate_predictions to use new Reader API. Added argv optional argument to all utility script main functions to simplify testing. Added mock tests, so SKLL now requires mock to work with Python 2.7. Added prettier SVG badges to README. Added link to Data Science at the Command Line to README. LibSVMReader now converts UTF-8 replacement characters that are used by LibSVMWriter when a feature name contains an =, |, #, :, or back to the original ASCII characters.

API breaking changes

FeatureSetWriter Writer load_examples(path) Reader.for_path(path).read() write_feature_file(...) Writer.for_path(FeatureSet(...)).write() FeatureSet.classes FeatureSet.labels All other instances of word "classes" changed to "labels" (#166) FeatureSet.feat_vectorizer FeatureSet.vectorizer run_ablation(all_combos=True) run_configuration(ablation=None) run_ablation() run_configuration(ablation=1) ExamplesTuple FeatureSet Removed feature_hasher argument to all Learner methods, because its unnecessary Learner.model_type is now the actual type of the underlying model instead of just a string. FeatureSet.__len__ now returns the number of examples instead of the number of features. Removed skll.learner._REGRESSION_MODELS and now we check for regression by seeing if model is subclass of RegressorMixin.

Config file breaking changes

Removed all short names for learners (PR #199) Can no longer use classifiers instead of learners train_location train_directory test_location train_directory cv_folds_location cv_folds_file

More information

DOI: 10.5281/zenodo.12825

Dates

Publication date: 2014
Issued: November 22, 2014

Rights

http://www.opensource.org/licenses/bsd-license.php BSD licenses (New and Simplified)
info:eu-repo/semantics/openAccess Open Access

Format

electronic resource

Relateditems

Description	Item type	Relationship	Uri
		IsSupplementTo	https://github.com/EducationalTestingService/skll/tree/1.0.0
		IsVersionOf	https://doi.org/10.5281/zenodo.591574
		IsPartOf	https://zenodo.org/communities/zenodo

This is a limited proof of concept to search for research data, not a production system.

MIT Libraries home

Search the MIT Libraries

Title: SciKit-Learn Laboratory (SKLL) 1.0.0

Links

Summary

More information

Dates

Rights

Format

Relateditems