When Tyco Brahe turned his telescope towards the stars in the night sky he did it for the “wrong” reason. He was the kings astrologist at the Danish court and wanted to improve astrology by improving the accuracy of the measurements of the stars and the planets. Instead of forwarding the supernatural belief in a correspondence between the micro- and the microcosmos, that our small lifes on earth are destined by or related to the movements of the celestial bodies, he lay a solid foundation for modern astronomy and science. He started the movement, one can say, that came to question the very reason that he made the effort to study the stars in the first place. The data he produced came to be used to make “scientific” people in later times to crusade against “un-scientific”people such as astrologists.

The same with Isaac Newton. He was an alchemist and a Christian and his works came to be used in much the same way – the scientific mind-set born by these two people have been and is still used to argue that belief in alchemy, religion or, sometimes even, belief in anything at all prior to data is a childish and un-necessary in order to advance our understanding of life, the universe and everything. To be fair, from where I stand it looks like there is today a welcome movement of gradual reconciliation between the different looking-points of the interpretative humanistic and empirical natural sciences.

Once upon a time there was a young student…

When I was still working on my bachelors degree at University College of Halmstad some 18 years ago I got an intuition, or a hunch, that it would be a good idea to translate personality type models such as C.G. Jungs type theory into linguistic models and run them on media content using machine learning tools. The idea, in it’s most basic form, is to construct an experiment to study memetics, or how ideas spread. Ideas such as saving energy, sharing things or being cautious about gender bias in language can reasonably be traced over time in large populations thanks to social media. The new approach that I suggest is to  categorise individuals in a closed network such as Twitter into discrete psychological profiles based on writing style and study the dynamics over time. This very general idea raises several specific questions that must be dealt with, such as:

1. Does peoples use of language in social media actually fall into discrete categories that are reasonably stable over time?

2. Are such “linguistic groupings of people” stable over time?

3. Are the words and phrases defining cultural ideas (memes) possible to trace over time in a reliable way?

It’s a lot of methodological questions involved, as anyone can see. Lots of uncertainty. How much data is needed? How many individuals and over how long period of time? What aspects of language are relevant to study? Is it, as James Pennebaker at University of Texas, Austin suggests, style words (function words, such as “and”, “him”, “it”, “or”) or is it instead the “meaning words” of adjectives, verbs or even interjections such as “sorry”, “thank you” et cetera. In any case I believe it is necessary to let the data speak for itself as much as possible and withhold the impulse to use ones own pre-understanding of psychological models of the human psyche until after the empirical, mathematical part is done.

Back to the beginnings again… 

At the time when I started thinking about this, social media was almost non-existant even in the minds of us who studied media and communication. I didn’t even have a blog myself. When I started one, many years later, I was pretty scared about telling anyone about “my idea”, but you can trace it even in my first words in my first blog post if you know what you are looking for. Today mountains of linguistic samples of peoples everyday language are produced and made accessible via API:s and screen scraping technology, so the behavior of people (leaving language samples in accessible form for computers to read). The machine learning technology itself has advanced and become more accessible for a non-technical person (well, I’m catching up) like me.

I’ve been holding on to this idea ever since. I’ve been holding on to it without even remembering what it was all about in the first place, molding it into new ideas and merging it into what I’ve been doing for other reasons. Looking back at what I now recall of the original idea, I’ve not accomplished much, but I’ve accomplished some. While I’ve worked with different areas of media and communication and even started a company around this idea, I’ve always had this itching feeling that I’m not doing enough to further this project. That has made me spend a lot of years trying to combine a “normal” life with also investing a large percent of my own money and energy into moving this project of psychological analysis of social media forward, a tiny bit every year, most of the years. I’ve scattered myself and exhausted myself which hasn’t been beneficial for any of those sides of my life.

But why?

Is this even a reasonable idea?

I’ve read and been told that personality type theory is just a modern form of superstition. Maybe, it would be great to have data on that! I’ve been aware since the beginning that this type of research activity is often motivated by ad-targeting. Well, reducing the amount of ill-targeted ads would be great! Others have pointed out that spooky people in the military and security business use this sort of things and implied that no good can come out of it. Yes, I’ve actually been inspired by the fact that both psychometrics and network analysis has been and is still used by highly advanced intelligence units from all sides of the spectrum. Of course any powerful technology can be used for both “good” and “bad”. It’s not the methodology, it’s the actions taken based on publicly available data such as social media content I think we should be concerned with.

I’ve been told by lots of well-meaning people that there is no use in applying “blind” cold computers on trying to understand anything worth knowing about the depths of peoples motivations, intentions or dreams. How about irony? How about the different roles we play in different social settings? I believe large data sets counter-acts the first opposition and the fact that I’m studying the language people use when they are in the same social setting, namely the public-facing social media persona, counter-acts the second plus, again, large amounts of data.

It is true this interest of mine has grown into an obsession from time to time. I’ve at occasions spent more attention to this idea then what has been good for my personal finances and my psychological health. Part of it has definitely been to miss out on a lot of opportunities over many of these years in regards to relations, career and plain simple relaxation and enjoyment of life.

So it is a fully legitimate question to ask why I do things like spending years studying different fields such as developmental psychology, computer programming and spiritual practices and traditions without the degree to show for it or even a peaceful mind. Instead I’ve, with my eyes fully open, allowed myself to be a dilettante and an amateur in field after field and social setting after social setting. I’ve accomplished comparatively little in terms of worldly success, and almost nothing when it comes to actual results in the project I’ve been pursuing for almost two decades.

I still haven’t been able to construct a solid experiment and get some actual results apart from a really nice experiment constructed by an anonymous guy on the web based on my half-assed not so meticulous attempts some eight years ago with typealyzer.com.

So, why?

Well, my short-term motivation is of course to learn useful things so that I can make a decent living doing what I enjoy doing. Who wouldn’t? But I deeply enjoy the idea of science for science own sake. Accepting that this is something that by necessity takes time and must take time order to be properly done. I believe it is a good thing to produce new knowledge about how ideas spread among people, how we use personas to express ourselves and relate to other people, if and how life-style and interests can be predicted based on language style et cetera. Even if the immediate usefulness is not obvious and that all nooks and crannies of where such a broad research area as this is, will take one, cannot be fruitful commercially or even considered to be of human value. Part of doing this as research is to let the possibility open that the answer is no. Nothing here. But then, at least, we will know that – which is a good thing in itself.

Some thoughts about where this might go, if the initial research is successful 

Since this project is really ground research it is very hard to predict what might come out of it in terms of further research or possible fruitful new areas of study. The applications, if this type of research is able to produce useful results are more easy to predict, like in improving marketing, dating, the study of public opinion etc. I don’t even know what to call this field of study, which is a frustration in itself. Is it media and communication studies? Digital humanities? Memetics? Computational Sociology? I truly don’t know and I doubt that any one else knows either, yet. This type of research is probably at it’s cutting edge at private companies such as Facebook and not in academia, which might explain why it’s has been more focus on doing, than defining, this newborn discipline of applying machine learning to social network content and structures.

Finding applications, of course, is not the hard part of this field of research. The hard part is constructing the methodology and the experiments to comply with scientific standards of reliability, validity and reproducability. I have deep respect for how hard it is to do the science. It’s even been hard to collect enough linguistic samples and learn to code enough to be able to do that first necessary step. Not to mention to apply natural language processing and statistics to the data samples. It’s taken years. I’ve started to get grey stains in my beard, even, which is a positive side-effect from my perspective. For each new step I get closer to actually doing psychological analysis of large amounts of peoples social media data a new and even more cognitively taxing area of expertise has emerged. Heck, math and science were my absolutely worst subjects in school and learning even the basic level of programming that I’m at has definitely not come easily to me.

But. The point of all this was to answer what motivates me. This is the best I can do right now:

The possibility of amazingly interesting new knowledge and insight into how we use media and language to form society, share ideas and interact with each other lies in the other end and now more than ever I feel that I’m glad I’ve kept at it. It so very, very rewarding to be at the point at which I am now and to be able to see the first glimpses of knowledge being produced. Even learning about the different aspects of linguistics, psychology and not to forget – the tremendous joy of learning to code is what puts a smile on my lips most of the time when I sit down with this.

It’s such a fascinating idea that this type of knowledge production is actually starting to be possible, even for a single person (very shallowly) and small teams of enthusiasts (more in-depth)! It wasn’t long ago that even the methodologies and data collection capabilities was restricted to the budgets and specialized teams of large technology companies.

And finally, what I feel motivates me at the end of the day. This being done at all, even if the data shows that there is no correlation between psychological writing style and interest to anything useful at all, will provide a small piece of the puzzle to the greatest question of them all, especially in the what’s likely to be coming decades of further computerization of society:

what does it mean to be human?