Useful resources for text processing in Ruby
This curated list comprises awesome resources, libraries, information sources about computational processing of human languages with Ruby. It comes from our day to day work on Language Models and NLP Tools. Read why this list is awesome.
Any help, suggestions and contributions are welcome! Please read the Contributors Guide and refer the Contribution section.
- NLP Pipeline Subtasks
- High Level Tasks
- Machine Learning Libraries
- Language Aware String Manipulation
- Other Online Resources
- Talks and Presentations
- Books
- Community
- Contributing
- License
-
Open NLP - Ruby Bindings for the OpenNLP Toolkit.
-
Stanford Core NLP - Ruby Bindings for the Stanford CoreNLP tools.
-
Treat - Natural Language Processing framework for Ruby.
Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation.
-
pragmatic_tokenizer - Multilingual tokenizer to split a string into tokens.
-
nlp-pure - Natural language processing algorithms implemented in pure Ruby with minimal dependencies.
-
textoken - Simple and customizable text tokenization library.
-
pragmatic_segmenter - Word Boundary Disambiguation with many cookies.
-
punkt-segmenter - Pure Ruby implementation of the Punkt Segmenter.
-
Tactful_Tokenizer - RegExp based tokenizer for different languages.
-
scapel - Sentence Boundary Disambiguation tool.
Stemming is the term used in information retrieval to describe the process for
reducing wordforms to some base representation. Stemming should be distinguished
from Lemmatization since stems
are not necessarily have
linguistic motivation.
- ruby-stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby. [tutorial]
- uea-stemmer - Conservative stemmer for search and indexing.
Lemmatization is considered a process of finding a base form of a word. Lemmas are often collected in dictionaries.
- lemmatizer - WordNet based Lemmatizer for English texts.
- wc - a rubygem to count word occurrences in a given text
- word_count - a word counter for String and Hash in Ruby
- Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
- WordsCounted - a highly customisable Ruby text analyser
- N-Gram - N-Gram generator in Ruby
- ngram - break words and phrases into ngrams
- raingrams - a flexible and general-purpose ngrams library written in Ruby
- stanfordparser - Ruby based wrapper for the Stanford Parser.
- amatch - collection of five type of distances between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance'. Last one seems to work well to find similarity in long phrases)
- damerau-levenshtein - calculates edit distance using the Damerau-Levenshtein algorithm
- FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
- fuzzy-string-match - fuzzy string matching library for ruby
- FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
- Going the Distance - contains scripts that do various distance calculations
- hotwater - Fast Ruby FFI string edit distance algorithms
- levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
- TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
- tf-idf-similarity - calculate the similarity between texts using tf*idf
- SentimentLib - Simple extensible sentiment analysis gem.
- alignment - Alignment functions for corpus linguistics (Gale-Church implementation)
- Google API Client - Google API Ruby Client
- microsoft_translator - Ruby client for the microsoft translator API
- termit - Google Translate with speech synthesis in your terminal as ruby gem
- chatterbot - Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate.
- Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis.
- Stimmung - Semantic Polarity based on (SentiWS](http://wortschatz.informatik.uni-leipzig.de/download/sentiws.html)
- Chronic - pure Ruby natural language date parser
- Chronic Between - simple Ruby natural language parser for date and time ranges
- Chronic Duration - simple Ruby natural language parser for elapsed time
- Kronic - dirt simple library for parsing and formatting human readable dates
- Nickel - extracts date, time, and message information from naturally worded text
- Tickle - natural language parser for recurring events
- Confidential Info Redactor - a Ruby gem to semi-automatically redact confidential information from a text
- ruby-ner - named entity recognition with Stanford NER and Ruby
- ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer
- espeak-ruby - small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files
- Isabella - a voice-computing assistant built in Ruby
- tts - a ruby gem for converting text-to-speech using the Google translate service
- att_speech - A Ruby library for consuming the AT&T Speech API for speech to text
- pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx
- Speech2Text - using Google Speech to Text API Provide a Simple Interface to Convert Audio Files
Libraries in pure Ruby or written in other programming languages with appropriate bindings for Ruby.
-
rb-libsvm - Support Vector Machines with Ruby.
-
weka-jruby - JRuby bindings for Weka, different ML algorithms implemented through Weka. [tutorial]
-
decisiontree - Decision Tree ID3 Algorithm in pure Ruby.
-
rtimbl - Memory based learners from the Timbl framework.
- Classifier - a general module to allow Bayesian and other types of classifications
- classifier-reborn - (a fork of cardmagic/classifier) a general classifier module to allow Bayesian and other types of classifications
- Latent Dirichlet Allocation - used to automatically cluster documents into topics
- liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
- linnaeus - a redis-backed Bayesian classifier
- maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
- Naive-Bayes - simple Naive Bayes classifier
- nbayes - a full-featured, Ruby implementation of Naive Bayes
- omnicat - a generalized rack framework for text classifications
- omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy
- stuff-classifier - a library for classifying text into multiple categories
Libraries for language aware string manipulation, i.e. search, pattern matching, case conversion, transcoding, regular expressions which need information about the underlying language.
- active_support -
RoR
ActiveSupport
gem has various string extensions that can handle case - u - U extends Ruby’s Unicode support.
- unicode - Unicode normalization library.
- CommonRegexRuby - Find a lot of kinds of common information in a string.
- regexp-examples - Generate strings that match a given regular expression.
- verbal_expressions - Make difficult regular expressions easy.
-
2016
- Quickly Create a Telegram Bot in Ruby by Ardian Haxha [tutorial]
-
2015
-
N-gram Analysis for Fun and Profit by Jesus Castello [tutorial]
-
Machine Learning made simple with Ruby by Lorenzo Masini [tutorial]
-
Using Ruby Machine Learning to Find Paris Hilton Quotes by Rick Carlino [tutorial]
-
Exploring Natural Language Processing in Ruby by Kevin Dias [slides]
-
-
2014
-
Natural Language Parsing with Ruby by Glauco Custódio [tutorial]
-
Demystifying Data Science: Analyzing Conference Talks with Rails and Ngrams by Todd Schneider [video | code]
-
Natural Language Processing with Ruby by Konstantin Tennhard [video | video]
-
-
2013
-
How to parse 'go' - Natural Language Processing in Ruby by Tom Cartwright [slides]
-
Natural Language Processing in Ruby by Brandon Black [slides | video]
-
Natural Language Processing with Ruby: n-grams by Nathan Kleyn [tutorial]
-
A Tour Through Random Ruby by Robert Qualls [tutorial]
-
- Miller, Rob. Text Processing with Ruby: Extract Value from the Data That Surrounds You. Pragmatic Programmers, 2015. [link]
- Watson, Mark. Scripting Intelligence: Web 3.0 Information Gathering and Processing. APRESS, 2010. [link]
We are very glad to see you in this section and highly appreciate any help!
But we also take care about the quality of this list. If you want to contribute please
- read carefully the Contribution Guidelines and
- agree that your work will be published under the terms of the
CC0
license.
Some of the open tasks for contributors are listed in the todo file. You may want to start there.
Awesome NLP in Ruby
by Andrei Beliankou
To the extent possible under law, the person who associated CC0 with
Awesome NLP in Ruby
has waived all copyright and related or neighboring rights
to Awesome NLP in Ruby
.
You should have received a copy of the CC0 legalcode along with this work. If not, see http://creativecommons.org/publicdomain/zero/1.0/.