In 2016, a quantitative hedge fund scraped job postings from LinkedIn and corporate career pages to predict company growth. They noticed that a mid-cap software company had tripled its engineering headcount over six months—a signal of rapid expansion that hadn't yet appeared in earnings reports. They went long. Two quarters later, the company reported blowout revenue growth and raised guidance. The stock jumped 40%.
The fund wasn't breaking any laws. They weren't accessing private data or circumventing security measures. They were simply collecting publicly available information more systematically than anyone else. This is the promise of web scraping for trading: transforming the vast, unstructured data on the internet into structured, tradeable signals.
But web scraping is a minefield. The legal landscape is murky and evolving. The technical challenges are substantial—anti-bot systems, rate limiting, dynamic JavaScript, CAPTCHAs. And the ethical questions are real: just because you can scrape something doesn't mean you should.
This article covers the complete journey from legal compliance to production deployment. We'll discuss what's legal and what's risky, how to build scrapers that don't get blocked, how to validate data quality, and how to generate trading signals from scraped data. More importantly, we'll discuss what goes wrong—because in web scraping, things go wrong constantly.
Let's start with the uncomfortable truth: web scraping exists in a legal gray area. There's no law that explicitly permits or prohibits scraping public websites. Instead, there's a patchwork of statutes, court cases, and terms of service that create a complex risk landscape.
The CFAA, enacted in 1986 to combat computer hacking, has become the primary legal weapon against web scraping. The statute prohibits "unauthorized access" to computer systems. The key question: what constitutes "unauthorized"?
The Ninth Circuit's 2019 decision in hiQ Labs v. LinkedIn provided some clarity. LinkedIn tried to block hiQ, a startup that scraped public LinkedIn profiles to predict employee turnover. LinkedIn argued that scraping violated the CFAA because they had sent hiQ a cease-and-desist letter explicitly revoking permission. The court disagreed, ruling that publicly accessible data cannot be "unauthorized" under the CFAA, even if the website owner objects.
This was a major victory for scrapers, but it's not universal law. The decision applies only in the Ninth Circuit (California, Oregon, Washington, and other western states). Other circuits might rule differently. And the Supreme Court's 2021 decision in Van Buren v. United States narrowed the CFAA's scope, focusing on "exceeding authorized access" rather than "unauthorized access"—a subtle distinction that might help scrapers, but the implications are still unclear.
Practical takeaway: Scraping publicly accessible data is probably legal under the CFAA, but "probably" isn't "definitely." If you're scraping at scale, consult a lawyer. If you receive a cease-and-desist letter, take it seriously.
Most websites have Terms of Service (ToS) that prohibit scraping. The question is whether violating ToS is illegal or just a breach of contract.
The answer depends on jurisdiction and circumstances. In some cases, courts have held that violating ToS can constitute unauthorized access under the CFAA. In others, they've ruled that ToS violations are purely contractual matters. The hiQ decision suggested that ToS can't convert public data into private data, but that's not settled law everywhere.
Here's the risk calculus: violating ToS probably won't lead to criminal charges, but it could lead to a civil lawsuit. If you're a hedge fund scraping a major website at scale, the website owner might sue for breach of contract, trespass to chattels (interfering with their servers), or copyright infringement (if you're copying substantial content).
The damages in such lawsuits are often minimal—you're not causing real harm by reading public web pages—but the legal fees can be substantial. And some websites have arbitration clauses in their ToS that force disputes into expensive arbitration rather than court.
Practical takeaway: Read the ToS. If it explicitly prohibits scraping, you're taking legal risk by proceeding. Weigh that risk against the value of the data. For high-value signals (predicting earnings, tracking supply chains), the risk might be worth it. For marginal signals, it's probably not.
Robots.txt is a file that websites use to tell automated crawlers which pages they should and shouldn't access. It's not legally binding—it's a convention, like holding the door for someone behind you. But violating robots.txt can strengthen a website's legal case that your access was "unauthorized."
The polite approach: respect robots.txt. If a site says "don't scrape /api/*", don't scrape it. This reduces legal risk and demonstrates good faith. The aggressive approach: ignore robots.txt for public pages, arguing that it's just a suggestion. This increases legal risk but might be necessary for valuable data.
We respect robots.txt for all our scrapers. It's a small constraint that significantly reduces legal exposure. And in practice, most websites allow scraping of public pages—they just block search engine crawlers from indexing certain sections.
Copyright protects creative expression, not facts. You can't copyright the fact that a product costs $19.99, but you can copyright the product description. Scraping prices is safe; scraping entire product descriptions might infringe copyright.
In the EU, database rights add another layer. The Database Directive protects the "investment" in creating databases, even if the individual facts aren't copyrightable. Scraping a substantial portion of a database might violate these rights, even if you're only extracting facts.
Practical takeaway: Extract facts, not creative content. Scrape prices, not product descriptions. Scrape job titles, not job descriptions. And if you're operating in the EU, be extra cautious about scraping entire databases.
Beyond legality, there's ethics. Just because scraping is legal doesn't mean it's right. Consider:
We've adopted a simple ethical framework: scrape only what you need, respect rate limits, and don't scrape personal information. If a website is clearly struggling with our traffic (slow responses, error messages), we back off. The data isn't worth harming someone's business.
Assuming you've navigated the legal minefield, the next challenge is technical: how do you scrape websites that don't want to be scraped?
Modern websites employ sophisticated anti-bot systems: rate limiting, user agent filtering, JavaScript challenges, CAPTCHAs, and behavioral analysis. Beating these systems requires understanding how they work and building scrapers that mimic human behavior.
The simplest anti-bot measure is user agent filtering. Websites check the User-Agent header in HTTP requests to identify bots. If you're using Python's requests library with the default user agent ("python-requests/2.28.0"), you'll get blocked immediately.
The solution is to rotate user agents, pretending to be different browsers. But it's not enough to just set a Chrome user agent—you need to set all the headers that Chrome sends: Accept, Accept-Language, Accept-Encoding, DNT, Connection, Upgrade-Insecure-Requests, etc. Missing or inconsistent headers are a dead giveaway.
We use the fake-useragent library to generate realistic user agents, and we manually craft header sets that match real browsers. This isn't foolproof—sophisticated systems fingerprint browsers based on header order and values—but it defeats simple filters.
Many modern websites load content dynamically via JavaScript. If you're using requests to fetch HTML, you'll get an empty page or a loading spinner. The actual content is loaded by JavaScript after the page renders.
The solution is to use a headless browser: Selenium, Playwright, or Puppeteer. These tools run a real browser (Chrome or Firefox) in the background, execute JavaScript, and return the fully rendered HTML. The downside is speed—rendering JavaScript is 10-100x slower than fetching static HTML—and resource usage. Running hundreds of headless browsers requires significant CPU and memory.
We use headless browsers selectively, only for sites that require JavaScript. For static sites, we stick with requests for speed. And we cache rendered pages aggressively to avoid re-rendering the same content.
CAPTCHAs are designed to distinguish humans from bots. The classic "type these distorted letters" CAPTCHAs are mostly defeated—commercial CAPTCHA-solving services use human labor or machine learning to solve them for $1-3 per thousand CAPTCHAs.
Modern CAPTCHAs like Google's reCAPTCHA v3 are more sophisticated. They analyze mouse movements, typing patterns, browser fingerprints, and behavioral signals to assign a "humanity score." If your score is too low, you get challenged or blocked.
Beating reCAPTCHA v3 requires mimicking human behavior: moving the mouse naturally, pausing before clicking, varying typing speed. Libraries like undetected-chromedriver patch Chromium to hide automation signals. But this is an arms race—Google constantly updates reCAPTCHA to detect new evasion techniques.
Our approach: avoid sites with aggressive CAPTCHAs. If the data is valuable enough to justify the cost, we use commercial CAPTCHA-solving services. But we've found that most sites with strong CAPTCHAs aren't worth scraping—the data is either low-quality or available elsewhere.
Even if you evade detection, you need to avoid overwhelming servers. Sending 1000 requests per second to a small e-commerce site will get you blocked (and might crash their server).
The polite approach is to rate-limit yourself: 1-2 requests per second for small sites, 10-20 for large sites. Add random jitter to avoid perfectly regular intervals (which look robotic). And respect HTTP 429 ("Too Many Requests") responses by backing off exponentially.
We implement rate limiting at multiple levels: per-domain limits (don't overwhelm any single site), global limits (don't overwhelm our network), and adaptive limits (slow down if we see errors or slow responses). This keeps us under the radar and maintains good relationships with websites.
For one-off scraping tasks, a simple Python script with requests and BeautifulSoup suffices. But for production systems—scraping hundreds of sites continuously, handling failures gracefully, storing data reliably—you need a framework. Scrapy is the industry standard.
Scrapy provides:
Here's a production-grade Scrapy spider that scrapes e-commerce product data, with all the anti-detection measures we've discussed:
1import scrapy
2from datetime import datetime
3import hashlib
4
5class EcommerceProductSpider(scrapy.Spider):
6 """Production spider for e-commerce product data"""
7
8 name = 'ecommerce_products'
9
10 custom_settings = {
11 'CONCURRENT_REQUESTS': 16,
12 'DOWNLOAD_DELAY': 2, # 2 seconds between requests
13 'RANDOMIZE_DOWNLOAD_DELAY': True, # Add jitter
14 'COOKIES_ENABLED': False,
15 'RETRY_TIMES': 3,
16 'RETRY_HTTP_CODES': [500, 502, 503, 504, 408, 429],
17
18 # Rotate user agents
19 'DOWNLOADER_MIDDLEWARES': {
20 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
21 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
22 },
23
24 # Data pipelines
25 'ITEM_PIPELINES': {
26 'myproject.pipelines.ValidationPipeline': 300,
27 'myproject.pipelines.DeduplicationPipeline': 400,
28 'myproject.pipelines.DatabasePipeline': 500,
29 }
30 }
31
32 def __init__(self, category_urls=None, *args, **kwargs):
33 super().__init__(*args, **kwargs)
34 self.category_urls = category_urls or []
35
36 def start_requests(self):
37 """Generate initial requests"""
38 for url in self.category_urls:
39 yield scrapy.Request(url, callback=self.parse_category)
40
41 def parse_category(self, response):
42 """Parse category page and extract product links"""
43 # Extract product links (adjust selectors for target site)
44 product_links = response.css('a.product-link::attr(href)').getall()
45
46 for link in product_links:
47 yield response.follow(link, callback=self.parse_product)
48
49 # Follow pagination
50 next_page = response.css('a.next-page::attr(href)').get()
51 if next_page:
52 yield response.follow(next_page, callback=self.parse_category)
53
54 def parse_product(self, response):
55 """Parse product page and extract data"""
56 yield {
57 'url': response.url,
58 'scraped_at': datetime.utcnow().isoformat(),
59 'title': response.css('h1.product-title::text').get(),
60 'price': self._parse_price(response.css('span.price::text').get()),
61 'availability': response.css('span.availability::text').get(),
62 'rating': self._parse_float(response.css('div.rating::attr(data-rating)').get()),
63 'num_reviews': self._parse_int(response.css('span.num-reviews::text').get()),
64 'brand': response.css('span.brand::text').get(),
65 'category': response.css('span.category::text').get(),
66 }
67
68 @staticmethod
69 def _parse_price(price_str):
70 """Extract numeric price from string"""
71 if not price_str:
72 return None
73 import re
74 price_clean = re.sub(r'[^\d.]', '', price_str)
75 try:
76 return float(price_clean)
77 except ValueError:
78 return None
79
80 @staticmethod
81 def _parse_float(s):
82 try:
83 return float(s) if s else None
84 except ValueError:
85 return None
86
87 @staticmethod
88 def _parse_int(s):
89 if not s:
90 return None
91 import re
92 int_clean = re.sub(r'[^\d]', '', s)
93 try:
94 return int(int_clean)
95 except ValueError:
96 return None
97This spider handles the basics: following links, extracting data, parsing messy HTML. But production systems need more: data validation, deduplication, and storage. That's where Scrapy's pipeline system shines.
Scraped data is messy. Prices might be missing, formatted incorrectly, or negative. URLs might be malformed. Dates might be in unexpected formats. Without validation, garbage data flows into your database and corrupts your signals.
We implement validation at multiple stages:
Items that fail validation are logged for manual review. We don't silently drop bad data—that hides problems. Instead, we track validation failure rates and alert if they spike, indicating a site redesign or scraper bug.
One of our most successful scraping strategies tracks pricing across e-commerce sites to identify arbitrage opportunities and predict retail earnings.
The setup is straightforward: scrape prices for 10,000 products across 20 major retailers (Amazon, Walmart, Target, Best Buy, etc.) daily. Calculate price indices by category (electronics, home goods, apparel). Compare to historical averages and competitors.
The signals:
These signals predict same-store sales and gross margins, which drive retail earnings. In backtests (2018-2023), our price momentum signal predicted earnings surprises with 65% accuracy—not amazing, but tradeable.
The challenge is scale. Scraping 10,000 products across 20 sites daily means 200,000 requests per day. At 2 seconds per request (to avoid rate limiting), that's 111 hours of scraping—more than fits in a day. The solution is parallelization: run 10 scrapers concurrently, each handling 2,000 products. This requires careful coordination to avoid duplicate requests and ensure data consistency.
We use Scrapy Cluster, a distributed scraping framework built on Redis and Kafka. Scrapers pull URLs from a Redis queue, scrape them, and push results to Kafka. A separate consumer process reads from Kafka and writes to our database. This architecture scales horizontally—add more scrapers to increase throughput—and handles failures gracefully.
Job postings are a leading indicator of economic activity. When companies start hiring, they're expecting growth. When they stop posting jobs, they're bracing for contraction. This signal is particularly valuable for sector rotation—identifying which industries are expanding or contracting before it shows up in employment data.
We scrape job postings from corporate career pages and aggregators (Indeed, LinkedIn, Glassdoor). For each posting, we extract: company, job title, location, posting date, and job category (engineering, sales, operations, etc.).
The signals:
In 2019, we noticed that enterprise software companies were dramatically increasing engineering headcount while reducing sales headcount. This suggested a shift from customer acquisition to product development—a sign of market saturation. We reduced exposure to high-valuation SaaS stocks. Six months later, the sector corrected as growth slowed.
The challenge with job postings is noise. Companies post jobs they don't intend to fill (to build talent pipelines), repost old jobs, and leave expired postings online. We filter noise by tracking posting duration (jobs open >90 days are probably not real), deduplicating reposts (using title and description similarity), and focusing on net changes (new postings minus removals) rather than absolute counts.
Real estate listings predict housing market trends before official data. Zillow, Redfin, and Realtor.com publish listings in real-time; government data lags by months. For REITs, homebuilders, and mortgage-related stocks, this edge matters.
We scrape listings daily, extracting: address, price, square footage, bedrooms, bathrooms, listing date, and status (active, pending, sold). We calculate metrics: inventory (active listings), days on market, price per square foot, and sale-to-list price ratio.
The signals:
These signals predicted the 2022 housing slowdown months before official data. In early 2022, we saw inventory rising and DOM increasing in key markets (Phoenix, Austin, Boise). We reduced exposure to homebuilders and mortgage REITs. When the Fed raised rates and housing crashed, we avoided significant losses.
The scraping challenge is scale and complexity. Real estate sites use heavy JavaScript, infinite scroll, and aggressive bot detection. We use Playwright (a headless browser) with careful timing: scroll slowly, pause between actions, vary mouse movements. This mimics human behavior well enough to avoid detection.
Building a scraper is one thing; keeping it running is another. Websites change constantly—redesigns break selectors, new anti-bot measures block requests, servers go down. Production scraping requires constant monitoring and rapid response.
We track:
Alerts trigger when:
When a site redesigns, selectors break. The spider returns empty data or errors. We detect this through data completeness monitoring and respond quickly:
We version our spiders and keep old versions around. If a deployment breaks, we can roll back instantly. And we test changes on a subset of URLs before full deployment.
Even with legal clearance and technical capability, we maintain ethical standards:
This isn't just ethics—it's risk management. Maintaining good relationships with websites reduces legal risk and ensures long-term data access.
Web scraping for trading is maturing. The legal landscape is clarifying (slowly), the tools are improving, and the competition is intensifying. The strategies that worked in 2015—simple scraping with no anti-detection measures—no longer work. Websites have gotten smarter, and so must scrapers.
The future belongs to firms that combine scraping with other data sources (APIs, partnerships, purchases), use sophisticated NLP to extract signals from unstructured text, and move faster than competitors. The data is democratizing, but the expertise remains scarce.
If you're building a scraping strategy today, focus on data quality over quantity. Scrape fewer sites better rather than more sites poorly. Invest in monitoring and maintenance—a scraper that breaks silently is worse than no scraper at all. And always, always consult a lawyer before scraping at scale.
The internet is the world's largest database. The question is: can you query it without getting blocked—or sued?
Legal:
Technical:
Industry:
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.