Forums  > Off-Topic  > Sources and experiences on outlier detection with textual data  
     
Page 1 of 1
Display using:  

Maggette


Total Posts: 1319
Joined: Jun 2007
 
Posted: 2021-09-06 08:57
Hi lot's of the real world stuff I did on that topic was mostly on numbers (int,long,decimals) or category data that is easy to map on numbers.

Outlier detection is something I am interested in general (for time series, graphs, online -streaming detectors, decision making ...etc) and there is a lot of interesting things going on and I experimented with a lot of stuff and have a fair share of thingss that run in production.

I have no practical application for it right now, but would like to know if anybody here as experiences or suggestions on outlier detection on textual data. Either whole documents or database columns with strings?
Thanks

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...

ronin


Total Posts: 684
Joined: May 2006
 
Posted: 2021-09-06 09:50
This might or might not be relevant.

At some point, I learned how to do EEG signal analysis. Don't ask. But anyway, in EEG, you are looking to eliminate artefacts which happen due to muscle movement, blinking etc.

So you decompose the signal using ICA, and the artefact typically generates a single IC. And there are some automated strategies for selecting which IC is an artefact and which are not - for instance, the signal has brown noise spectrum, and the artefact is white noise. Or something along those lines.

But that only works if the signal is multi-channel, and the artefact is contemporaneus in all channels. If the artefact has to propagate through the channels, ICA doesn't work. You are looking at doing decomposition using directed information and stuff like that.

And I wouldn't even know how to start with that on a single-channel data like textual data. You can't decompose a single channel into independent components.

I would try some clustering methods, or even just a filter. But I don't have any direct experience with this.

"There is a SIX am?" -- Arthur

nikol


Total Posts: 1377
Joined: Jun 2005
 
Posted: 2021-09-06 19:50
I had encountered these two libs
- https://github.com/life4/textdistance
- https://github.com/UKPLab/sentence-transformers

Planning to explore, but not yet. It maps poly-form(text1,text2) into distances/metrics, so numerical signal/noise separation is equally applicable.

... What is a man
If his chief good and market of his time
Be but to sleep and feed? (c)

Maggette


Total Posts: 1319
Joined: Jun 2007
 
Posted: 2021-09-07 13:52
Thx everybody.

Ich kam hierher und sah dich und deine Leute lächeln, und sagte mir: Maggette, scheiss auf den small talk, lass lieber deine Fäuste sprechen...
Previous Thread :: Next Thread 
Page 1 of 1