bitcoin trading and investor sentiment
Amy Jang, Aria Li, Edward Hong, Xiaohui Li
September 30, 2018
For this year’s Columbia Data Science Hackathon, our team decides to work on sentimental analysis of bitcoin price with the data provided by Qu Capital. The two csv
files provided are: a dataset of tick-level bitcoin price timeseries on major exchange (Jan. 1, 2018 to Mar. 31, 2018, UTC time), and another dataset of relevant subreddits posted in that period of time.
We aim to design a model that extracts sentimental signals from the reddit data and, along with other features in both datasets, provides a prediction of the trend (direction) of investment returns over the next period of time, which will help the investor make trading decisions.
We start by plotting the timeseries.
The chart above is a visualization of the original data. However, the original dataset does not provide any useful information in terms of the market return, in this case the percent change in price, which will help determine when it would be an ideal time to buy or sell. Determining the market return every second is not helpful for the user as the time frame is too short. Thus, it is necessary to compress the data so that each row provides information that covers the timeframe of a minute in the bitcoin market.
Our initial step was to group the data by minute instead of having a separate instance for each transaction. This allowed us to calculate the mean, minimum, and maximum price of the selling and buying price of bitcoin per minute of the market. In addition, it allowed us to extract important information such as the standard deviation and the return per minute. This information could be used with the cleaned reddit data to determine changes in the market.
The reddit dataset contains multiple fields and we start by dropping those that don’t add information for our analysis, for instance link_id
; we also drop columns like edited
and gilded
, for their values are mostly zero. We are focusing on the body
of the reddit posts and the time it was created, along with indicators like score
.
As for the body of the reddit posts, we employed several steps of filtering. Firstly, we from a list of individual words for each post by splitting posts based on the space and irrelevant symbols are removed.
Secondly, we removed all the stopwords which do not express any emotion, such as me, he, is, the, this etc.
Word cloud
Design
We use NLP to turn words into high-dimensional vectors, which constitute part of our features. Along with other features like the standard deviation of buying prices and the trading volume, we use the LSTM networks model and XGBoost model to give a prediction of future investment returns.
Feature Engineering
In total we have 213 features to predict the price trends.
We analyzed the distribution of return per minute and categorized them into three types according to their percentile. When the return is lower than -0.0018, we labeled it as “decreasing” and assigned it value of 0; when the return is ranging from -0.0018 to 0.0008, we labeled it as “unchanged” and assigned it value of 1; and when the return is higher than 0.0008, we labeled it as “increasing” and assigned it value of 2.
Impact
This model, although very rudimentary at the moment due to a limit of time and data (only reddit data was used for sentimental analysis), has great potential as it can be expanded to provide deeper insight into how social media affects the market. If this model was implemented with additional data from other platforms such as Facebook, Instagram, Twitter, etc., the accuracy and usefulness of the model will presumably increase.
Additionally, this model could be used to see price fluctuations in markets other than that of bitcoins. When this model is applied to a different/larger market, the effect of social media on the market will be highlighted.
Training and testing
We have a notebook that contains our NLP processing here.
This is our feature importance chart.
And below is our model metrics chart:
XGBoost | LSTM | ||
---|---|---|---|
Accuracy | 0.61 | 0.55 | |
Precision | 0.77 | 0.71 | |
Recall | 0.60 | 0.52 | |
F1 Score | 0.64 | 0.57 |
A possible way to improve our model, if we had more time, is to obtain labels (e.g. positive/negative/neutral) for each reddit post and turn our sentiment analysis into a supervised learning process.
Another point worth noting could be seen from the above histogram. The graph shows that buying bids vary more than selling ones. This difference in the attributes of the two prices shows that there are nuances in the prices yet to be discovered.