Predicting Stock Prices with Sentiment Analysis 📈😄

4 minute read

Published: June 14, 2022

The ideation of this project came from our capstone project under Dr. Seema Bhardwaj. We aimed to create an ML model predictor to help protect my friends from dubious websites and irregular patterns. Simple Stock, the product we developed, focused on prediction and teaching students about stock market trends. This blog deals with the prediction part and our approach to it. If you’re interested in learning more about the product here is much lengthier breif about it .

In this blog post, I’ll take you through one of my exciting data science projects where I used sentiment analysis of news articles to predict stock prices using an LSTM model. Let’s dive in! 🚀

Introduction

Predicting stock prices is a challenging task due to the numerous factors influencing them. In this project, we will use the VADER sentiment analysis model to analyze news articles and incorporate their sentiment scores as features in our stock price prediction model. Our model will use an LSTM (Long Short-Term Memory) architecture to make predictions based on historical stock prices and news sentiments.

Step-by-Step Guide

Step 1: Data Collection 📝

First, we need to collect historical stock prices and news articles. For this project, I used data from the New York Times and historical stock prices of the top 10 companies in the S&P 500 index.

import yfinance as yf
import pandas as pd

# List of top 10 S&P 500 companies
companies = ["AAPL", "MSFT", "AMZN", "GOOGL", "FB", "TSLA", "BRK-B", "JPM", "JNJ", "V"]

# Download historical stock prices
stock_data = {}
for company in companies:
    stock_data[company] = yf.download(company, start="2015-01-01", end="2020-12-31")
    
# Convert to DataFrame
stock_prices = pd.concat(stock_data, axis=1)
stock_prices.to_csv("stock_prices.csv")

Step 2: Sentiment Analysis with VADER 😃😞

Next, we’ll use the VADER sentiment analysis model to analyze the sentiment of news articles. VADER is a pre-trained model specifically designed for sentiment analysis of social media texts.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import requests

analyzer = SentimentIntensityAnalyzer()

# Function to fetch and analyze news sentiment
def fetch_news_sentiment(company):
    url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?q={company}&api-key=YOUR_API_KEY"
    response = requests.get(url)
    articles = response.json()["response"]["docs"]
    
    sentiments = []
    for article in articles:
        text = article["lead_paragraph"]
        sentiment = analyzer.polarity_scores(text)
        sentiments.append(sentiment["compound"])
    
    return sentiments

# Analyze sentiments for each company
news_sentiments = {}
for company in companies:
    news_sentiments[company] = fetch_news_sentiment(company)

# Convert to DataFrame
sentiment_df = pd.DataFrame(news_sentiments)
sentiment_df.to_csv("news_sentiments.csv")

Step 3: Data Preparation 📊

We need to prepare our data by merging the stock prices and news sentiments. We’ll also normalize the data to ensure consistency.

# Load data
stock_prices = pd.read_csv("stock_prices.csv", header=[0, 1], index_col=0)
sentiments = pd.read_csv("news_sentiments.csv", index_col=0)

# Normalize stock prices
stock_prices_norm = (stock_prices - stock_prices.mean()) / stock_prices.std()

# Merge data
data = pd.concat([stock_prices_norm, sentiments], axis=1)
data.dropna(inplace=True)

Step 4: Building the LSTM Model 🧠

We’ll build our LSTM model using TensorFlow/Keras. The model will have an input layer, two LSTM layers, and a dense output layer.

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.model_selection import train_test_split

# Prepare data for LSTM
def create_dataset(data, window_size):
    X, y = [], []
    for i in range(len(data) - window_size):
        X.append(data[i:i + window_size])
        y.append(data[i + window_size, 0])
    return np.array(X), np.array(y)

window_size = 60
X, y = create_dataset(data.values, window_size)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Build LSTM model
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(window_size, X.shape[2])),
    LSTM(32),
    Dense(1, activation='relu')
])

model.compile(optimizer='adam', loss='mse')
model.summary()

# Train model
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test))

Step 5: Evaluating the Model 📉

We’ll evaluate our model using the Root Mean Squared Error (RMSE) metric to assess its accuracy.

from sklearn.metrics import mean_squared_error

# Predict
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"RMSE: {rmse}")

Conclusion 🎉

In this project, we successfully used sentiment analysis of news articles to improve stock price prediction with an LSTM model. By combining historical stock prices and sentiment scores, our model achieved some pretty impressive results. But hey, there’s always room for improvement! So, don’t hesitate to experiment with different model architectures and datasets. Happy coding! 😊

Oh, and if you want to check out the initial stages and thought process behind this blog, check out these slides from back when I was in 3rd year. Do remember, I’ve significantly improved at making presentations—oops, I mean coding—since then!

Yash