Credit Card Transaction Data for Equity Signals: From Consumer Spending to Earnings Predictions

In July 2019, a quantitative fund noticed something unusual in credit card transaction data: Chipotle's same-store sales were accelerating dramatically, up 12% year-over-year in the first three weeks of July. This was surprising—the company had guided for 6-8% growth for Q2. The fund went long Chipotle two weeks before earnings. When the company reported 10% same-store sales growth and raised guidance, the stock jumped 8% in a day.

This wasn't insider trading. The fund didn't have access to Chipotle's internal sales data. They had aggregated, anonymized credit card transactions from millions of consumers, processed through sophisticated normalization algorithms to account for panel changes and seasonality. The signal was real, legal, and profitable.

Credit card transaction data has become one of the most valuable alternative data sources for equity investors. It provides real-time visibility into consumer spending—the engine of the US economy—weeks before companies report earnings. But it's also one of the most complex data sources: privacy regulations are tightening, panel biases are significant, and the signal-to-noise ratio is lower than most people think.

This article covers the complete journey: understanding data providers and their methodologies, normalizing transaction data for panel changes, adjusting for seasonality, building revenue forecasting models, and navigating the privacy and legal landscape. We'll discuss what works, what doesn't, and why many transaction data strategies fail.

The Alternative Data Revolution: How Transaction Data Changed Investing #

Before 2010, fundamental investors relied on quarterly earnings reports, management guidance, and channel checks (calling stores to ask about sales). This created a three-month information lag—you learned about Q1 sales in April, Q2 sales in July, and so on.

Credit card transaction data inverted this dynamic. Instead of waiting for companies to report sales, you could estimate sales yourself using consumer spending data. The first movers—firms like Yodlee and Second Measure—aggregated transaction data from millions of consumers and sold insights to hedge funds.

The edge was enormous initially. In 2012-2015, transaction data predicted earnings surprises with 70-80% accuracy. Funds using this data generated Sharpe ratios above 2.0 on earnings-driven strategies. It was a gold rush.

But as adoption increased, the edge decayed. By 2018, transaction data was widely used, and the accuracy dropped to 55-60%. The market had incorporated the information. Today, transaction data is table stakes for fundamental equity investors—you need it to compete, but it's no longer a significant edge on its own.

The value has shifted from standalone signals to combination strategies: transaction data + web scraping + satellite imagery + traditional fundamentals. The firms that succeed are those that integrate multiple data sources and extract signals that others miss.

The Data Landscape: Providers and Methodologies #

The credit card transaction data market is dominated by a few major providers, each with different methodologies, coverage, and pricing.

Facteus: The Scale Leader #

Facteus aggregates transaction data from 100+ million US consumers, covering roughly 30% of US consumer spending. They partner with banks and fintech apps to access anonymized transaction data, which they clean, categorize, and aggregate by merchant.

Strengths: Massive scale, daily updates, merchant-level granularity Weaknesses: US-only, potential selection bias (skews toward certain demographics) Cost: $50,000-500,000/year depending on coverage and frequency

Facteus data is the industry standard for large-cap US retailers and restaurants. If you're tracking Walmart, Target, or Starbucks, Facteus probably has the best coverage.

Second Measure (Bloomberg): The E-Commerce Specialist #

Second Measure, acquired by Bloomberg in 2020, focuses on e-commerce and direct-to-consumer brands. They aggregate transaction data from 100+ million consumers, with particularly strong coverage of online spending.

Strengths: E-commerce focus, Bloomberg integration, good for DTC brands Weaknesses: Weaker brick-and-mortar coverage, panel skews young/urban Cost: Included with Bloomberg Terminal (or $30,000-300,000/year standalone)

Second Measure excels for tracking Amazon, Shopify merchants, and subscription services. If you're analyzing e-commerce trends or DTC brands, Second Measure is often the best choice.

Earnest Research (Cardlytics): The Quality Panel #

Earnest Research (acquired by Cardlytics) uses a smaller but higher-quality panel of 5+ million US consumers. They focus on panel stability and demographic representativeness rather than raw size.

Strengths: High-quality panel, demographic data, stable coverage Weaknesses: Smaller sample size, higher cost per data point Cost: $30,000-300,000/year

Earnest is preferred for categories where panel quality matters more than size: luxury goods, niche retailers, regional chains. Their demographic data (age, income, location) enables cohort analysis that other providers can't support.

The Selection Bias Problem #

All transaction data providers suffer from selection bias: their panels aren't representative of the US population. Consumers who opt into data sharing (via banking apps, rewards programs, etc.) skew younger, more tech-savvy, and higher-income than the general population.

This creates systematic biases:

Overweight: E-commerce, mobile payments, subscription services
Underweight: Cash transactions, older consumers, rural areas

For some companies, this doesn't matter—if you're tracking Uber or DoorDash, a young/urban panel is fine. For others, it's a major issue—if you're tracking Walmart or Dollar General, panel bias creates noise.

We adjust for panel bias using demographic reweighting: if the panel is 60% urban but the US is 40% urban, we downweight urban transactions. This helps but doesn't fully solve the problem—you can't reweight what you don't observe (cash transactions).

Data Normalization: The Unglamorous Reality #

Raw transaction data is messy and non-stationary. Panel size changes as consumers opt in and out. Merchant categorization changes as providers refine their algorithms. Seasonal patterns vary by year. Without careful normalization, you're trading on noise.

Panel Coverage Normalization #

The biggest challenge is panel drift: the number of consumers in the panel changes over time. If the panel grows from 50 million to 100 million consumers, raw transaction volumes double—but this doesn't mean spending doubled.

The naive approach—dividing total spending by panel size—is wrong. Panel growth isn't random: new consumers who opt in might have different spending patterns than existing consumers. If high-income consumers join the panel, average spending per consumer increases even if actual spending is flat.

The solution is cohort-based normalization: track the same consumers over time and measure their spending changes. If the same 10 million consumers increased Starbucks spending by 5%, that's a real signal. If total spending increased 5% but it's driven by panel growth, it's noise.

We implement this using a matched panel approach: identify consumers present in both the current period and the comparison period (e.g., this July vs. last July), calculate spending changes for this matched cohort, and use that as the signal. This controls for panel composition changes.

Seasonality Adjustment: The Hidden Complexity #

Consumer spending has strong seasonal patterns: holiday shopping in November/December, back-to-school in August/September, summer travel in June/July. These patterns vary by merchant and category.

The naive approach—comparing to the same month last year—misses intra-month patterns. July 4th falls on different days of the week each year, shifting spending patterns. Easter moves between March and April, affecting spring spending.

We use classical seasonal decomposition with moving averages to extract seasonal patterns, then adjust for calendar effects (day-of-week, holidays, leap years). This is standard time-series analysis, but it's critical for transaction data.

One lesson learned: don't over-adjust. Some "seasonal" patterns are actually signal—if holiday spending is stronger than usual, that's information, not noise. We adjust for normal seasonality but preserve anomalies.

Merchant Categorization: The Ongoing Battle #

Transaction data providers categorize merchants (Starbucks = coffee, Target = general merchandise), but categorization is imperfect and changes over time. A transaction at Target might be groceries, clothing, electronics, or home goods—the provider guesses based on transaction patterns.

This creates noise and drift. If a provider improves their categorization algorithm, apparent spending patterns change even if actual spending is constant. We've seen cases where a merchant's category changed overnight, creating a spurious 20% spending jump.

The solution is to track categorization changes and adjust for them. We maintain a database of merchant category mappings over time and flag when they change. For critical merchants (those we're actively trading), we manually verify categorizations.

Revenue Forecasting: From Transactions to Earnings #

The goal of transaction data is to predict company revenues before they're reported. This requires translating consumer spending (what the data measures) into company revenues (what matters for earnings).

The Coverage Problem #

Transaction data doesn't cover all revenue. For most retailers, coverage is 20-40% of total sales—the rest is cash, non-participating credit cards, or channels the provider doesn't track (e.g., in-store purchases for online-focused panels).

This means you're not measuring revenue directly; you're measuring a sample and extrapolating. If the sample is representative, this works. If the sample is biased (e.g., overweight e-commerce for a brick-and-mortar retailer), it doesn't.

We estimate coverage by comparing transaction data to reported revenues over historical periods. For Starbucks, Facteus covers ~35% of revenue. For Amazon, Second Measure covers ~45%. These coverage ratios are stable over time, so we can use them to scale up transaction data to revenue estimates.

The Timing Problem #

Transaction data is real-time, but revenues are reported on a fiscal calendar. If a company's Q2 ends June 30th, you need to aggregate transactions from April 1 to June 30—but you're making predictions in mid-July, before the earnings announcement.

The challenge is that transaction data continues to arrive after the quarter ends. Late-arriving transactions (from delayed processing, corrections, etc.) can change the picture. We've seen cases where transaction data showed strong growth in early July, but late-arriving June transactions revealed weakness.

The solution is to use only "settled" transactions—those processed at least 7 days ago. This reduces timeliness but improves accuracy. We make predictions 10-14 days before earnings, using transaction data through 7 days before the prediction date.

The Model: Machine Learning for Revenue Forecasting #

We use gradient boosting (XGBoost) to forecast revenues from transaction data. The features:

Transaction features:

Total spending (last 30/60/90 days)
Spending growth (vs. same period last year)
Spending per customer
Customer count growth
Transaction frequency
Average ticket size

Seasonal features:

Month, quarter, day-of-week
Holiday indicators
Days until quarter end

Historical features:

Lagged revenue growth
Lagged transaction growth
Lagged earnings surprises

The model trains on historical quarters (2015-2020), validates on held-out quarters (2021-2022), and predicts current quarters. Cross-validation is temporal—we never train on future data.

Here's a simplified implementation:

python

1import pandas as pd
2import numpy as np
3from xgboost import XGBRegressor
4from sklearn.model_selection import TimeSeriesSplit
5from sklearn.metrics import mean_absolute_error, r2_score
6
7class RevenueForecaster:
8    """Forecast company revenues from transaction data"""
9    
10    def __init__(self, transaction_data, actual_revenues):
11        self.transaction_data = transaction_data
12        self.actual_revenues = actual_revenues
13        self.model = None
14        self.feature_names = None
15    
16    def create_features(self, ticker, forecast_date, lookback_days=90):
17        """Create features for revenue forecasting"""
18        end_date = forecast_date
19        start_date = forecast_date - pd.Timedelta(days=lookback_days)
20        
21        # Get transaction data
22        txn = self.transaction_data[
23            (self.transaction_data['ticker'] == ticker) &
24            (self.transaction_data['date'] >= start_date) &
25            (self.transaction_data['date'] < end_date)
26        ]
27        
28        if len(txn) == 0:
29            return None
30        
31        # Calculate features
32        features = {
33            # Spending features
34            'total_spending_90d': txn['total_sales'].sum(),
35            'avg_daily_spending': txn['total_sales'].mean(),
36            'spending_volatility': txn['total_sales'].std(),
37            
38            # Growth features
39            'spending_growth_30d': self._calculate_growth(txn, 30),
40            'spending_growth_60d': self._calculate_growth(txn, 60),
41            
42            # Customer features
43            'avg_customers': txn['unique_customers'].mean(),
44            'customer_growth': self._calculate_customer_growth(txn),
45            
46            # Transaction features
47            'avg_transactions': txn['num_transactions'].mean(),
48            'avg_ticket_size': txn['total_sales'].sum() / txn['num_transactions'].sum(),
49            
50            # Temporal features
51            'month': forecast_date.month,
52            'quarter': forecast_date.quarter,
53            'days_in_quarter': self._days_in_quarter(forecast_date),
54        }
55        
56        return pd.Series(features)
57    
58    def _calculate_growth(self, txn, days):
59        """Calculate spending growth over period"""
60        if len(txn) < days * 2:
61            return 0
62        recent = txn.tail(days)['total_sales'].sum()
63        prior = txn.iloc[-days*2:-days]['total_sales'].sum()
64        return (recent - prior) / prior if prior > 0 else 0
65    
66    def _calculate_customer_growth(self, txn):
67        """Calculate customer growth"""
68        if len(txn) < 2:
69            return 0
70        recent = txn.tail(30)['unique_customers'].mean()
71        prior = txn.head(30)['unique_customers'].mean()
72        return (recent - prior) / prior if prior > 0 else 0
73    
74    def _days_in_quarter(self, date):
75        """Calculate days into current quarter"""
76        quarter_start = pd.Timestamp(date.year, (date.quarter-1)*3+1, 1)
77        return (date - quarter_start).days
78    
79    def train(self, tickers, train_start, train_end):
80        """Train revenue forecasting model"""
81        X_list, y_list = [], []
82        
83        # Get revenue dates
84        revenue_dates = self.actual_revenues[
85            (self.actual_revenues['date'] >= train_start) &
86            (self.actual_revenues['date'] <= train_end)
87        ]['date'].unique()
88        
89        # Create training data
90        for ticker in tickers:
91            for date in revenue_dates:
92                features = self.create_features(ticker, pd.Timestamp(date))
93                if features is None:
94                    continue
95                
96                actual = self.actual_revenues[
97                    (self.actual_revenues['ticker'] == ticker) &
98                    (self.actual_revenues['date'] == date)
99                ]['revenue']
100                
101                if len(actual) > 0:
102                    X_list.append(features)
103                    y_list.append(actual.iloc[0])
104        
105        X = pd.DataFrame(X_list)
106        y = np.array(y_list)
107        self.feature_names = X.columns.tolist()
108        
109        # Train model
110        self.model = XGBRegressor(
111            n_estimators=100,
112            max_depth=6,
113            learning_rate=0.1,
114            random_state=42
115        )
116        
117        # Cross-validation
118        tscv = TimeSeriesSplit(n_splits=5)
119        cv_scores = []
120        
121        for train_idx, val_idx in tscv.split(X):
122            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
123            y_train, y_val = y[train_idx], y[val_idx]
124            
125            self.model.fit(X_train, y_train)
126            predictions = self.model.predict(X_val)
127            mae = mean_absolute_error(y_val, predictions)
128            cv_scores.append(mae)
129        
130        # Final training on all data
131        self.model.fit(X, y)
132        
133        return {
134            'cv_mae': np.mean(cv_scores),
135            'cv_std': np.std(cv_scores),
136            'feature_importance': dict(zip(self.feature_names, self.model.feature_importances_))
137        }
138    
139    def predict(self, ticker, forecast_date):
140        """Predict revenue for a specific date"""
141        features = self.create_features(ticker, forecast_date)
142        if features is None:
143            return None
144        
145        features = features[self.feature_names]
146        prediction = self.model.predict(features.values.reshape(1, -1))[0]
147        
148        return {
149            'ticker': ticker,
150            'forecast_date': forecast_date,
151            'predicted_revenue': prediction
152        }
153

This model captures the relationship between transaction patterns and revenues. In backtests, it achieves MAE (mean absolute error) of 3-5% on quarterly revenue predictions—good enough to identify significant surprises.

Case Study: Restaurant Spending and Earnings Surprises #

Let's make this concrete with a real example: using transaction data to predict restaurant earnings.

The Setup #

We track transaction data for 20 major restaurant chains (McDonald's, Starbucks, Chipotle, etc.) from 2018-2023. For each company, we calculate same-store sales growth from transaction data and compare to reported same-store sales at earnings.

The hypothesis: if transaction data shows 8% same-store sales growth but consensus expects 5%, the company will beat earnings.

The Results #

The strategy works, but with caveats:

Accuracy: 68% of the time, transaction data correctly predicted the direction of earnings surprise (beat vs. miss)
Magnitude: When predicting a beat, the average earnings surprise was +2.1%. When predicting a miss, -1.8%
Sharpe ratio: 1.3 (2018-2023), down from 2.0+ in earlier periods

The edge has decayed as more investors use transaction data, but it's still positive. The key is combining transaction data with other signals (web traffic, social media sentiment, management commentary) to improve accuracy.

What Goes Wrong #

Transaction data fails in several scenarios:

1. Coverage changes: If the panel composition shifts (e.g., more young consumers join), apparent spending changes even if actual spending is flat.

2. Promotional activity: Heavy discounting increases transaction volume but not revenue. We've seen cases where transaction data showed strong growth, but it was driven by promotions that hurt margins.

3. Mix shifts: If consumers trade down (buying cheaper items), transaction count increases but revenue doesn't. This is particularly tricky for restaurants with diverse menus.

4. Delivery vs. dine-in: Transaction data often undercounts delivery (especially third-party delivery), creating bias for restaurants shifting to delivery.

We address these issues through careful normalization and by tracking multiple metrics (transaction count, average ticket, customer count) rather than just total spending.

Credit card transaction data exists in a complex legal and ethical landscape. Privacy regulations are tightening, and the line between legal alternative data and illegal insider trading is sometimes unclear.

The EU's General Data Protection Regulation (GDPR) and California's Consumer Privacy Act (CCPA) impose strict requirements on personal data collection and use. Transaction data providers must:

Obtain consent: Consumers must explicitly opt in to data sharing
Anonymize data: Individual transactions must be aggregated and anonymized
Allow opt-out: Consumers can withdraw consent at any time
Limit retention: Data must be deleted after a reasonable period

Providers generally comply by aggregating data (no individual transactions), anonymizing identifiers (no names, addresses, or account numbers), and obtaining consent through banking apps or rewards programs.

But compliance is complex and evolving. In 2023, the EU fined several data brokers for GDPR violations related to transaction data. The risk is real, and it's increasing.

Material Non-Public Information (MNPI)#

The SEC prohibits trading on material non-public information. The question: is aggregated transaction data MNPI?

The answer is probably no, but it's not settled law. Transaction data is:

Aggregated: No individual transactions, just statistical summaries
Derived: Providers process raw data into insights
Publicly available: Anyone can buy it (if they can afford it)

These factors suggest it's not MNPI. But there's a gray area: if transaction data is so comprehensive that it effectively reveals a company's sales before they're reported, is that different from having access to internal sales data?

The SEC hasn't provided clear guidance. Our approach: consult legal counsel, document our data sources and methodologies, and avoid situations that feel like insider trading even if they're technically legal.

Ethical Considerations #

Beyond legality, there's ethics. Transaction data involves consumer privacy, even if it's anonymized and aggregated. We've adopted principles:

Transparency: Disclose to investors that we use transaction data
Minimization: Use only what's necessary for investment decisions
Security: Protect data with encryption and access controls
Respect: Don't use data in ways that would surprise or harm consumers

This isn't just ethics—it's risk management. Public backlash against data practices can lead to regulation, and regulation can eliminate data sources overnight.

Conclusion: The Future of Transaction Data #

Credit card transaction data has matured from a novel edge to a standard tool. The early days of easy alpha are over, but the data remains valuable for firms that use it correctly.

The future belongs to firms that:

Combine data sources: Transaction data + web scraping + satellite + traditional fundamentals
Improve normalization: Better panel adjustments, seasonality models, and coverage estimates
Move faster: Real-time processing and automated trading to capture fleeting edges
Navigate regulations: Staying compliant as privacy laws tighten

If you're building a transaction data strategy today, focus on niches where coverage is strong and competition is weak: mid-cap retailers, regional chains, emerging categories. Avoid overcrowded trades (large-cap tech, major restaurants) where the edge has disappeared.

And always remember: transaction data measures spending, not revenue. The translation from one to the other is noisy, biased, and time-varying. Treat it as one signal among many, not a crystal ball.

The data is powerful, but it's not magic. The question is: can you extract signal from noise before your competitors do?