You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
leveldb - A fast and lightweight key/value database library by Google.
cpy-leveldb - Python bindings for LevelDB using leveldb c api.
The Chromium Projects - The Chromium projects include Chromium and Chromium OS, the open-source projects behind the Google Chrome browser and Google Chrome OS, respectively.
C++ base 库
toft - C++ Base Library for Linux server side development.
stringencoders - A collection of high performance c-string transformations, frequently 2x faster than standard implementations (if they exist at all).
Numpy - NumPy is the fundamental package for scientific computing with Python.
自然语言处理库
NLTK - NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.
NLTK Book
gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities.
openNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
SRILM - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
IRSTLM - The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs.
KenLM - KenLM estimates unpruned language models with modified Kneser-Ney smoothing.
Moses - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair.
GIZA++ - GIZA++ is a statical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model.
ReVerb - ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
Lemur - The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
Lucene - The Apache Lucene project develops open-source search software.
Solr - Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
LASSO - LASSO is a parallel machine learning system that learns a regression model from large data. It works in either of two modes: IPM-mode and MPI-mode.
libsvm - A Library for Support Vector Machines.
支持向量机通俗导论(理解SVM的三层境界) 来自研究者July. 在本文中,你将看到,理解SVM分三层境界,
第一层: 了解SVM(你只需要对SVM有个大致的了解,知道它是个什么东西便已足够);
第二层: 深入SVM(你将跟我一起深入SVM的内部原理,通晓其各处脉络,以为将来运用它时游刃有余);
第三层: 证明SVM(当你了解了所有的原理之后,你会有大笔一挥,尝试证明它的冲动)。
liblinear - A Library for Large Linear Classification.
RankLib - RankLib is a library of learning to rank algorithms.
svmlight - SVMlight is an implementation of Support Vector Machines (SVMs) in C.
plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
GibbsLDA++ - A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
Yahoo_LDA - Yahoo!'s topic modelling framework using Latent Dirichlet Allocation
maxent - A simple C++ library for maximum entropy classification.
easyME - This is a simple implementation of Maximum Entropy model. Algorithms implemented include: GIS, SCGIS, LBFGS, Gaussian smoothing and Exponential smoothing.
libLBFGS - This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
OWL-QN - The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form {smooth function} plus {L1-norm of the parameters}. It has been used for training log-linear models (such as logistic regression) with L1-regularization.
CRF++ - CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
CRFsuite - A fast implementation of Conditional Random Fields (CRFs).
Wapiti - Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models.
sofia-ml - Suite of Fast Incremental Algorithms for Machine Learning. Includes methods for learning classification and ranking models, using Pegasos SVM, SGD-SVM, ROMMA, Passive-Aggressive Perceptron, Perceptron with Margins, and Logistic Regression.
mahout - The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.
MLTK - MLTK -- the Machine Learning Toolkit -- is a suite of C++ open source modules of Machine Learning.
FP-growth - An implementation of the FP-growth algorithm in pure Python.
MLcomp - MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
PyBrain - PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive "Backronym".
vowpal_wabbit - John Langford's original release of Vowpal Wabbit -- a fast online learning algorithm.
Theano - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
Caffe - Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.
数据交换协议
protobuf - Protocol Buffers - Google's data interchange format.
tinyxml2 - TinyXML-2 is a simple, small, efficient, C++ XML parser that can be easily integrating into other programs.
thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
数据库
MySQL++ - MySQL++ is a C++ wrapper for MySQL’s C API.
MongodDB - MongoDB (from "humongous") is an open-source document database, and the leading NoSQL database. Written in C++.
memcached - Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
leveldb - A fast and lightweight key/value database library by Google.
SSDB - A fast NoSQL database server with zset data type, an alternative to Redis.
SSDB is a high performace key-value(key-string, key-zset, key-hashmap) NoSQL persistent storage server, using Google LevelDB as storage engine. SSDB is stable, production-ready and is widely used by many Internet companies such as QIHU 360.
RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.
RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
fatcache - Memcache on SSD. Think of fatcache as a cache for your big data.
thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
Flask - Flask is a microframework for Python based on Werkzeug and Jinja2. It's intended for getting started very quickly and was developed with best intentions in mind.
中文docs
Bootstrap - Sleek, intuitive, and powerful front-end framework for faster and easier web development.
Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
分布式计算
Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
ZooKeeper - ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Storm - Distributed and fault-tolerant realtime computation.
Storm 维基 - 提供了有关 Storm、它的理论基础的大量优秀文档,以及有关获取 Storm 和设置新项目的各种教程。您还将找到一些有关 Storm 的许多方面的实用文档,包括 Storm 在本地模式、集群模式和在 Amazon 上的使用。
GitHub 上提供了 Storm 的一个 thorough class tree exists,详细介绍了 Storm 的类和接口。
使用 Twitter Storm 处理实时的大数据 - 流式处理大数据简介 简介: Storm 是一个开源的、大数据处理系统,与其他系统不同,它旨在用于分布式实时处理且与语言无关。了解 Twitter Storm、它的架构,以及批处理和流式处理解决方案的发展形势。
Storm 入门教程 - 来自量子恒道官方博客storm-starter - Learn to use Storm!
StreamCpp - A small C++ wrapper for Storm. Some documentation can be found at http://demeter.inf.ed.ac.uk/cross/stormcpp.htmlstorm-kafka - storm-kafka provides a regular spout implementation and a TransactionalSpout implementation for Apache Kafka 0.7.
Puppet - Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
Skynet - Skynet is a framework for distributed services in Go.
mapreduce-lite - A C++ implementaton of MapReduce without distributed filesystem.
GraphChi - GraphChi[huahua] is a spin-off of the GraphLab[rador's retriever] project.
GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and changing the graph structure while computing.
GraphChi ppt.
GraphChi Paper.
GraphChi Video.
GraphChi's C++ version. -disk-based large-scale graph computation. Big Data - small machine.
Celery --- Distributed Task Queue - Celery is a simple, flexible and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.
It’s a task queue with focus on real-time processing, while also supporting task scheduling.
这个框架几乎是 Python 下异步消息架构的终极解决方案.
正则表达式
re2 - an efficient, principled regular expression library.
编译工具
SCons - SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
CMake - the cross-platform, open-source build system.
spf13-vim - spf13-vim is a distribution of vim plugins and resources for Vim, GVim and MacVim. It is a completely cross platform distribution that stays true to the feel of vim while providing modern features like a plugin management system, autocomplete, tags and tons more.
Maximum Awesome - Config files for vim and tmux, lovingly tended by a small subculture of peace-loving hippies. Built for Mac OS X.
VimClojure - A filetype, syntax and indent plugin for Clojure.
pycrumbs - Bits and Bytes of Python from the Internet.
自动化部署引擎
Docker - Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
Docker 是一个开源自动化部署引擎,它可以将任何应用封装成一个简单、便携、不依赖于其他组件的容器,从而轻松地将其部署在各种虚拟环境中,以便进行各种调试。它既保证了应用的私有性,同时缩短了调试部署的周期,使得测试-封装-部署变得更加容易和便捷。不过现在Docker还在加紧开发中,相信等它开发完毕后,它会给人们的开发带来前所未有的便捷。
其他
Valgrind - Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.