Forums  > Software  > Database management system fit for text mining?  
Page 1 of 1
Display using:  


Total Posts: 9
Joined: Jul 2017
Posted: 2018-11-24 16:49
Hello NP,

I've been doing some text mining lately, and have chosen PostgreSQL as the data store for a few Twitter statuses and other text bits from other sites.

I do not like the idea of moving the data out of the database, doing something to it in memory, and putting the results back, as this is incredibly inefficient. I instead try to do everything in-database, in SQL so far, letting the system figure out which algorithm to pick and how/when to move data between disks and RAM. I'm using the full-text search system PostgreSQL already has, but am running into irritating issues that need C development, like trying to have it split URLs into interesting components when creating the roots vector. (By default it tries to be smart and keeps URLs whole.) Since Makefiles gave me PTSD, this is enough for me to start considering alternatives.

Has anyone here done text mining, and if so what's your setup? When it comes to scalability, I need something that handles on the order of ten to a hundred Go for now.


Total Posts: 368
Joined: Jan 2015
Posted: 2018-11-27 21:47
I don't do much in text, but seems to me that you should be using Spark. It's raison d'ĂȘtre is exactly what you say, keeps data local in memory through successive transformations. Most of your pre-existing SQL logic could probably be ported over as long as you use the dataframe API (which you pretty much want to anyway for performance.) Your custom functions that can't be easily expressed in SQL, can be done with UDFs, which pretty much let you use any arbitrary Java/Scala/Python/R code.

You give up ACID transactions from moving away from a DBMS. But it sounds to me like you're using this in a research context for WORM data that's loaded in ETL batches. Not as a production system. On Spark's plus side, you get pretty seamless horizontal scalability, so if need be you can just spin up more VMs to make the process go as fast as need be.

I'm not a text guy, so take this with a grain of salt, but you may also want to consider Elasticsearch for the underlying data layer. If most of your queries are searching and parsing text, then performance is going to be a lot better than Postgres. But if you're doing a relational logic, like joins, need consistency, or have significant writes relative to reads, then stick with SQL.

Good questions outrank easy answers. -Paul Samuelson


Total Posts: 43
Joined: Jul 2018
Posted: 2018-11-27 23:43
Just clarifying, am I right that the Go code you're referring to is client-side application code that you're trying to replace with native database queries? Or do you mean you need a DBMS with Go bindings?

Solr ships with a document database with SQL support out of the box and has a lot of built-in functionality for text mining, e.g. tokenization, stemming.

I'm not too familiar with Twitter statuses and how they need to be processed, but we do some NLP work in preproduction for high-freq market making and we found Solr has more mature integrations over other Lucene derivatives (like Elasticsearch) to support our use cases (Tika for parsing, UIMA for annotations, GATE for information extraction, morphlines for transformations before loading into Solr, OpenNLP for coreference resolution, segmentation and NER, Mahout for model construction).


Total Posts: 1177
Joined: Feb 2007
Posted: 2018-11-28 15:35
I'm pretty sure Spark does not do 'calculate in place' any more than Postgres does.

Oracle used to sell a version of R that actually did do calculate in place on stuff stored in their DB. It probably didn't support latent dirichlet allocation or whatever, but you could code up naive bayes and friends.

It's funny that open source stuff haven't solved this kind of problem yet, as it is all in principle a solved problem. Sort of like open source stuff sucks at time series.

"Learning, n. The kind of ignorance distinguishing the studious."


Total Posts: 9
Joined: Jul 2017
Posted: 2018-12-07 13:59
EspressoLover, while I dislike SQL, I like the generality of the relational model and the protection from some unknown unknowns that ACID gives you. So I wish to keep them as long as possible, as I'm not even sure about my current schema. It doesn't have to be realtime indeed. However I'm enjoying the currently fast feedback when developing as I don't know what I'm doing. I have not yet determined that most of my queries are going to be about text, but so far they are. I thought the Lucene-based world, including ElasticSearch, was more geared towards sysadmins and log analysis, but from prikolno's answer it seems I was wrong.

prikolno, I'm using Python, not Go (I meant Giga octets). Mostly for nltk. For now, I'm merely focused on figuring out a proper equivalence relation between ID data (comparing names at first), and topic extraction. I'll definitely check out Solr; I thought that was old stuff.

jslade, if you have literature about this that exposes the state of the art, I'm all ears. (Out of curiosity, I'm interested in the same about time series.)
Previous Thread :: Next Thread 
Page 1 of 1