Skip to content

Latest commit

 

History

History
124 lines (122 loc) · 17.3 KB

File metadata and controls

124 lines (122 loc) · 17.3 KB

Data Science Software, Libraries, and Packages

Package Managers

Software, Libraries, and Packages

  • Deep learning and neural networks
    • Torch - A scientific computing framework with wide support for machine learning algorithms that puts GPUs first
    • Caffe - A deep learning framework made with expression, speed, and modularity in mind
    • DL4J - Open-Source, Distributed, Deep Learning Library for the JVM
    • Theano - Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently
    • TensorFlow - Open source software library for numerical computation using data flow graphs
    • Amazon Deep Scalable Sparse Tensor Network Engine (DSSTNE) - An Amazon developed library for building Deep Learning (DL) machine learning (ML) models
    • Keras: Deep Learning library for Theano and TensorFlow - A high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano
  • Weka - A collection of machine learning algorithms for data mining tasks
  • Anaconda - Open data science platform powered by Python
  • Python(x,y) - A free scientific and engineering development software for numerical computations, data analysis and data visualization based on Python programming language, Qt graphical user interfaces and Spyder interactive scientific development environment
  • Python
    • IPython Documentation - Comprehensive environment for interactive and exploratory computing
    • Jupyter notebook - A web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text
    • Matplotlib - A python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
    • Natural Language Toolkit - A leading platform for building Python programs to work with human language data
    • Numpy - The fundamental package for scientific computing with Python
    • Scipy - A Python-based ecosystem of open-source software for mathematics, science, and engineering
    • Pandas - An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
    • PyBrain - Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library
    • Scikit-image - A collection of algorithms for image processing
    • Scikit-learn - A Python module for machine learning
    • Seaborn - A Python visualization library based on matplotlib
    • StatsModels - A Python module that allows users to explore data, estimate statistical models, and perform statistical tests
    • Pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization
    • Scrapy - An open source and collaborative framework for extracting the data you need from websites
    • ggplot - A package for plotting in Python
    • Altair - Declarative statistical visualization library for Python
    • Blaze - Provides Python users high-level access to efficient computation on inconveniently large data
    • Dask - A flexible parallel computing library for analytic computing
    • Bokeh - A Python interactive visualization library that targets modern web browsers for presentation
    • Basemap - A library for plotting 2D data on maps in Python
    • NetworkX - A Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
    • Beautiful Soup - A Python library for pulling data out of HTML and XML files
    • Gensim - Python framework for fast Vector Space Modelling
    • Shogun - Machine learning toolbox that provides a wide range of unified and efficient Machine Learning (ML) methods
    • Chainer - A Powerful, Flexible, and Intuitive Framework for Neural Networks
    • NuPIC - An open source project based on a theory of neocortex called Hierarchical Temporal Memory (HTM)
    • Neon - Python-based deep learning library
    • PyMC - A python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo
    • Fuel - A data pipeline framework which provides your machine learning models with the data they need
    • PyMVPA - PyMVPA stands for MultiVariate Pattern Analysis (MVPA) in Python
    • Deap - A novel evolutionary computation framework for rapid prototyping and testing of ideas
    • Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
  • R
    • General CRAN List - By task
    • General CRAN List - NLP/Text analytics
    • General CRAN List
    • ggplot2 - A plotting system for R
    • ISLR - The collection of datasets used in the book "An Introduction to Statistical Learning with Applications in R"
    • Rcpp - Provides R functions as well as C++ classes which offer a seamless integration of R and C++
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory
    • plyr - A set of tools that solves a common set of problems
    • stringr - A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package
    • shiny - Easy to build interactive web applications with R
    • knitr - A general-purpose tool for dynamic report generation in R using Literate Programming techniques
    • readr - Read flat/tabular text files from disk (or a connection)
    • R Markdown - Convert R Markdown documents into a variety of formats
    • tidyr - Data tidying (not general reshaping or aggregating) and works well with 'dplyr' data pipelines
    • lubridate - Functions to work with date-times and time-spans
    • lme4 - Fit linear and generalized linear mixed-effects models
    • nlme - Fit and compare Gaussian linear and nonlinear mixed-effects models
    • mime - Guesses the MIME type from a filename extension using the data derived from /etc/mime.types in UNIX-type systems
    • mda - Mixture and flexible discriminant analysis, multivariate adaptive regression splines (MARS), BRUTO, ...
    • lasso2 - Routines and documentation for solving regression problems while imposing an L1 constraint on the estimates
    • lars - Efficient procedures for fitting an entire lasso sequence with the cost of a single least squares fit
    • digest - Implementation of a function 'digest()' for the creation of hash digests of arbitrary R objects (using the 'md5', 'sha-1', 'sha-256', 'crc32', 'xxhash' and 'murmurhash' algorithms) permitting easy comparison of R language objects, as well as a function 'hmac()' to create hash-based message authentication code
    • reshape2 - Flexibly restructure and aggregate data using just two functions: melt and 'dcast' (or 'acast')
    • colorspace - Carries out mapping between assorted color spaces including RGB, HSV, HLS, CIEXYZ, CIELUV, HCL (polar CIELUV), CIELAB and polar CIELAB
    • RColorBrewer - Provides color schemes for maps (and other graphics)
    • manipulate - Interactive plotting functions for use within RStudio
    • scales - Graphical scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends
    • labeling - Provides a range of axis labeling algorithms
    • proto - An object oriented system using object-based, also called prototype-based, rather than class-based object oriented ideas
    • randomForest - Classification and regression based on a forest of trees using random inputs
    • glmnet - Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regression and the Cox model
    • caret - Misc functions for training and plotting classification and regression models
    • ggvis - An implementation of an interactive grammar of graphics, taking the best parts of 'ggplot2', combining them with the reactive framework of 'shiny' and drawing web graphics using 'vega'
    • rgl - Provides medium to high level functions for 3D interactive graphics, including functions modelled on base graphics (plot3d(), etc.) as well as functions for constructing representations of geometric objects (cube3d(), etc.)
    • htmlwidgets - A framework for creating HTML widgets that render in various contexts including the R console, 'R Markdown' documents, and 'Shiny' web applications
    • leaflet - Create and customize interactive maps using the 'Leaflet' JavaScript library and the 'htmlwidgets' package
    • dygraphs - An R interface to the 'dygraphs' JavaScript charting library
    • googleVis - R interface to Google Charts API, allowing users to create interactive charts based on data frames
    • zoo - An S3 class with methods for totally ordered indexed observations. It is particularly aimed at irregular time series of numeric vectors/matrices and factors
    • RCurl - A wrapper for 'libcurl' http://curl.haxx.se/libcurl/ Provides functions to allow one to compose general HTTP requests and provides convenient functions to fetch URIs, get & post forms, etc. and process the results returned by the Web server
    • jsonlite - A fast JSON parser and generator optimized for statistical data and the web
    • bitops - Functions for bitwise operations on integer vectors
    • devtools - Collection of package development tools
    • magrittr - Provides a mechanism for chaining commands with a new forward-pipe operator, %>%
    • packrat - Manage the R packages your project depends on in an isolated, portable, and reproducible way
    • Haven - Import foreign statistical formats into R via the embedded 'ReadStat' C library
    • DT - Data objects in R can be rendered as HTML tables using the JavaScript library 'DataTables' (typically via R Markdown or Shiny)
    • MICE - Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm
    • rpart - Recursive partitioning for classification, regression and survival trees
    • party - A computational toolbox for recursive partitioning
    • nnet - Software for feed-forward neural networks with a single hidden layer, and for multinomial log-linear models
    • e1071 - Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, ...
    • kernlab - Kernel-based machine learning methods for classification, regression, clustering, novelty detection, quantile regression and dimensionality reduction
    • gbm - Includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart)
    • wordcloud - Pretty word clouds
    • c50 - C5.0 decision trees and rule-based models for pattern recognition
    • class - Various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps
    • neuralnet - Training of neural networks using backpropagation, resilient backpropagation with (Riedmiller, 1994) or without weight backtracking (Riedmiller and Braun, 1993) or the modified globally convergent version by Anastasiadis et al. (2005)
    • tm - A framework for text mining applications within R
    • gmodels - Various R programming tools for model fitting
    • rodbc - An ODBC database interface
    • princurve - Fits a principal curve to a data matrix in arbitrary dimensions
  • Analytics