EDGAR is Useless (And That's the Point)
Why most quantitative strategies fail on public data—and how to structure it so they don't

You can download every SEC filing ever published. That doesn’t mean you’ve learned anything. This is the central paradox of event-driven trading in the age of big data. We have more information than ever, yet most quantitative strategies built on public filings collapse under the weight of their own noise.
The Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system is a treasure trove. It’s comprehensive, public, and updated in real-time. It contains the ground truth for every publicly traded company in the United States. And for most traders, it’s completely useless.
The prevailing wisdom is that if a strategy fails, the market must be too efficient. But this is a convenient excuse. Most failed “event-driven” strategies don’t fail because the market is efficient; they fail because the data is structurally misused. Before you can find a signal, you have to stop lying to your model. And most models are being fed lies.
This article isn’t another tutorial on how to download filings from EDGAR. It’s a guide on how to think about structuring that data so it doesn’t defeat you before you even begin. It’s about understanding why the most sophisticated traders don’t spend their time building faster parsers—they spend it building better classifiers.
The Illusion of Information
There’s a seductive belief in quantitative finance: more data equals better models. If you can just get your hands on all the SEC filings, all the earnings calls, all the insider transactions, you’ll have an edge. The market will be transparent to you. You’ll see patterns no one else sees.
This belief is wrong. And it’s dangerous.
The problem isn’t the absence of data; it’s the presence of too much data, most of which is noise. A 10-K filing can be 200 pages long. A 10-Q can be 100 pages. An 8-K can be anywhere from 5 to 50 pages. Inside those pages are thousands of words of boilerplate legal language, risk disclaimers, accounting footnotes, and regulatory compliance text. Buried somewhere in that mountain of words might be a single sentence that actually matters: “Our lead drug failed Phase 3 trials.”
Most quantitative strategies treat all of this text equally. They download the filings, run them through a parser, extract some features (word counts, sentiment scores, keyword matches), and feed the result into a model. The model then tries to find a pattern. It almost never does. And the trader concludes that the market is efficient, that there’s no edge to be found in public data.
But the real problem is that the model is being asked to find a signal in a dataset that is 99% noise. It’s like trying to hear a whisper in a stadium full of screaming fans. The whisper is there, but you can’t hear it because you’re not filtering out the crowd.
Why EDGAR Data is Deceptively Hard
To understand why EDGAR defeats naive approaches, you have to understand what EDGAR actually is. It’s not a database designed for traders. It’s a regulatory filing system designed for compliance. The SEC requires companies to file documents. Companies comply. The documents get posted. That’s the system.
This creates several structural problems:
Filings Mix Signal with Compliance. A filing might contain a single sentence about a failed clinical trial buried within thousands of words of boilerplate legal text. A naive model sees a wall of text; a smart one sees a needle in a haystack. The problem is that most models are naive. They treat the entire filing as equally important, which means the critical information gets drowned out by the noise.
Event Labels are Inconsistent. The same material event can be reported under different item numbers or described in wildly different terms. One company might describe a clinical trial result in Item 8 (Other Events) of an 8-K. Another might bury it in the risk factors section of a 10-Q. A third might mention it only in passing in a 10-K. If your strategy relies on finding filings with a specific item number or form type, you will miss half the events. If your strategy relies on keyword matching, you will find half the noise.
The Same Form Means Radically Different Things. This is the most important insight. An 8-K announcing a new credit agreement is a routine operational update for a pharmaceutical giant like Pfizer. For a pre-revenue biotech startup, it could be a lifeline that signals another year of runway. The form is the same; the meaning is worlds apart. A model trained on both will learn an average effect that represents neither.
Most Filings Are Operational Exhaust. The vast majority of SEC filings contain nothing of interest to a trader. They are routine quarterly reports, routine annual reports, routine proxy statements. They are the administrative overhead of being a public company. Treating them as potential sources of alpha is like trying to find trading signals in a company’s email archive. You might find something, but you’ll have to wade through millions of irrelevant messages to get there.
The first step in any robust EDGAR strategy is not analysis, but aggressive, intelligent filtering. You have to separate the signal from the noise before you can model either one.
Stage Matters More Than Events
Here’s a question that most EDGAR-based strategies never ask: What does this company actually do?
This might sound obvious, but it’s the most important question you can ask. And the answer determines everything about how you should model the company’s filings.
For a biotech company, the answer falls into one of two categories: either the company sells something, or it doesn’t. This simple binary distinction creates two fundamentally different regimes:
Clinical-Stage (Pre-Revenue). These are R&D-focused firms, often burning through cash in the hope of getting a single product to market. For them, regulatory news (clinical trial results, FDA meetings, patent rulings) is existential. A positive result can create a company overnight; a negative one can bankrupt it. Their stock price is driven by the probability of success on a single or small number of binary events. The volatility is extreme. The stakes are absolute.
Commercial-Stage (Revenue-Generating). These firms have approved products and established sales channels. While regulatory news still matters, it is often administrative or incremental. Their stock price is more sensitive to earnings, competition, and sales figures. Their volatility is lower. Their future is less dependent on a single binary outcome.
This distinction is not unique to biotech. It applies to any sector where companies move through distinct development stages. But it is most pronounced in biotech, where the difference between a pre-revenue startup and a commercial pharma company is the difference between zero and billions of dollars in annual revenue.
Now here’s the critical insight: treating these two types of companies as a monolith destroys any potential signal. A model that tries to find a consistent pattern in 8-K filings across both regimes will find nothing, because no such consistent pattern exists. The effect of the same event is conditional on the company’s stage. A positive clinical trial result might cause a +30% jump in a clinical-stage stock. A minor regulatory update might cause a -0.5% dip in a commercial-stage one. A model trained on both will learn an average effect of +14.75%, a number that represents neither and is useless for prediction.
This is not a biotech anecdote; it’s a universal principle of financial modeling: apply structure before searching for alpha. The structure comes first. The modeling comes second. If you skip the structure step, you will inevitably conclude there is no alpha, when in fact there is no structure.
Technical Walkthrough: Building a Context Layer
Let’s build a simple, rule-based classifier to separate clinical-stage from commercial-stage companies. We’re not predicting anything yet. We’re just trying to stop lying to ourselves by creating a foundational context layer. Our goal is transparency and robustness, not complexity.
4.1 Fetching Filings
First, we need to get the data. While you can parse EDGAR’s RSS feeds manually, using a service like the SEC API is far more practical. It provides structured JSON data for filings, saving you from the headache of parsing raw HTML and XML.
Here’s a lightweight example of how to fetch the latest 10-K for a given ticker:
import requests
SEC_API_KEY = "YOUR_API_KEY"
def get_latest_10k(ticker):
"""Fetches the latest 10-K filing URL for a given ticker."""
query = {
"query": {"query_string": {
"query": f"ticker:{ticker} AND formType:\"10-K\""
}},
"from": "0",
"size": "1",
"sort": [{"filedAt": {"order": "desc"}}]
}
headers = {"Authorization": SEC_API_KEY}
response = requests.post("https://api.sec-api.io", json=query, headers=headers )
response.raise_for_status()
filings = response.json()["filings"]
if not filings:
return None
return filings[0]["linkToFilingDetails"]This query searches for all 10-K filings for a given ticker, sorts by filing date (most recent first), and returns the URL to the filing details. The SEC API handles all the parsing of EDGAR’s XML structure for you. In production, you’d fetch the full text of the filing from this URL and pass it to your classifier.
Don’t dwell on the fetching. The important work starts once you have the text.
4.2 Extracting Meaning, Not Just Text
For our classifier, we are primarily interested in the “Business” section (Item 1) and “Risk Factors” (Item 1A) of a 10-K filing. This is where the company describes what it does, how it makes money (or plans to), and what could go wrong. These sections are the most reliable indicators of a company’s stage.
Here’s why: A clinical-stage company will explicitly state that it has no revenue. It will describe its pipeline of drugs in development. It will talk about clinical trials, regulatory pathways, and the probability of success. A commercial-stage company will describe its approved products, its sales channels, its market share, and its competitive position. These are fundamentally different narratives.
The key is to extract the relevant sections and look for specific keywords and phrases. We’re not trying to build a sophisticated NLP model here. We’re trying to build a transparent, auditable classifier that anyone can understand and debug.
Here’s a function to extract the Business section from a 10-K filing:
def extract_business_section(filing_text):
"""
Extracts the Business section (Item 1) from a 10-K filing.
This is a simplified version; in production, you'd use more robust parsing.
"""
# Find the start of Item 1
item_1_start = filing_text.lower().find("item 1.")
if item_1_start == -1:
return None
# Find the start of Item 1A (Risk Factors), which comes after Item 1
item_1a_start = filing_text.lower().find("item 1a.", item_1_start)
if item_1a_start == -1:
# If Item 1A doesn't exist, use Item 2 as the end boundary
item_2_start = filing_text.lower().find("item 2.", item_1_start)
item_1a_start = item_2_start if item_2_start != -1 else len(filing_text)
# Extract the text between Item 1 and Item 1A
business_section = filing_text[item_1_start:item_1a_start]
return business_sectionThis function is simple and robust. It finds the Business section by looking for the item numbers. It’s not perfect (some companies format their filings differently), but it works for the vast majority of cases. The key insight is that we’re not trying to parse the entire filing; we’re just trying to extract the section that matters.
4.3 A Simple, Rule-Based Classifier
Now, let’s build our classifier. We will define two sets of keywords. If the filing text contains commercial-stage keywords, we classify it as such. If it only contains clinical-stage keywords (or neither), we default to clinical. This conservative approach prevents us from misclassifying a large pharma company as clinical-stage just because it mentions R&D.
def classify_company_stage(filing_text):
"""
Classifies a company as 'Commercial-Stage' or 'Clinical-Stage' based on keywords
found in its 10-K filing text.
"""
filing_text = filing_text.lower()
# Keywords that strongly indicate a company is generating revenue from product sales.
commercial_keywords = [
"revenue from product sales", "net product sales", "commercial sales",
"product revenue", "we sell products", "our products are sold",
"fda-approved product", "approved by the fda"
]
# Keywords that indicate a company is primarily in the R&D and clinical trial phase.
clinical_keywords = [
"pre-revenue", "clinical-stage", "drug discovery", "preclinical",
"phase 1", "phase 2", "phase 3", "clinical trial", "no products approved",
"have not generated any revenue from product sales"
]
# Prioritize commercial keywords. If any are found, the company is commercial.
for keyword in commercial_keywords:
if keyword in filing_text:
return "Commercial-Stage"
# If no commercial keywords are found, check for clinical ones as a confirmation.
for keyword in clinical_keywords:
if keyword in filing_text:
return "Clinical-Stage"
# Default to Clinical-Stage if no definitive commercial keywords are found.
# This is a conservative assumption for biotech.
return "Clinical-Stage"This classifier is simple, transparent, and auditable. You can look at the text and see exactly why a decision was made. If you feel the need to use a neural network at this stage, something has already gone wrong. The goal here is not to build a black box, but to impose a clear, understandable structure on the data.
Let’s test it with two examples:
# Example 1: A company that explicitly states it has no revenue.
text_clinical = """
We are a clinical-stage biopharmaceutical company focused on drug discovery.
To date, we have not generated any revenue from product sales. Our lead program
is currently in Phase 2 clinical trials.
"""
stage = classify_company_stage(text_clinical)
print(f"Example 1: {stage}") # Output: Clinical-Stage
# Example 2: A company that reports product revenue.
text_commercial = """
Our total revenue from product sales was $500 million for the fiscal year.
Our lead drug was approved by the FDA in 2022 and has achieved significant
market penetration.
"""
stage = classify_company_stage(text_commercial)
print(f"Example 2: {stage}") # Output: Commercial-StageThe first example triggers the clinical-stage classifier because it contains the phrase “have not generated any revenue from product sales.” The second example triggers the commercial-stage classifier because it contains “revenue from product sales” and “approved by the FDA.”
This is not a sophisticated algorithm. But that’s the point. The goal is to create a transparent, reproducible, and maintainable classifier that works for the vast majority of cases. Once you have this baseline, you can refine it. You can add more keywords. You can weight certain keywords more heavily. You can even build a machine learning model on top of it. But the foundation should always be simple and understandable.
Why This Comes Before Modeling
This pre-processing step is not optional. It is the most critical part of the entire workflow. Let me explain why with a concrete example.
Imagine you build a model that tries to predict stock returns based on 8-K filings. You train it on 10 years of historical data, using all 8-K filings from all biotech companies. You get a result: the average 8-K filing predicts a +0.21% return on the day of filing.
This is a useless result. But why? Because you’ve averaged out two opposing effects. When a clinical-stage company files an 8-K with positive news (e.g., successful trial results), the stock jumps +30%. When a commercial-stage company files an 8-K with routine news (e.g., a new credit agreement), the stock barely moves (+0.1%). Your model learns the average: +0.21%. This number represents neither effect and is useless for prediction.
But if you had separated the data into two regimes before modeling, you would have discovered that:
•Clinical-stage 8-Ks predict +4.94% returns (highly significant)
•Commercial-stage 8-Ks predict +0.22% returns (not significant)
Now you have two actionable insights instead of one useless average. You can build separate models for each regime. You can apply different trading strategies to each. You can actually make money.
This is why regime separation is foundational. It’s not a nice-to-have; it’s a must-have. Causal inference tools, predictive models, and risk management frameworks only make sense after this fundamental separation has been made. If you skip this step, you will inevitably conclude there is “no alpha” when in fact there is no structure.
The Meta-Lesson: Structure Before Signal
EDGAR doesn’t lack information; it lacks interpretation. The challenge of quantitative finance today is less about finding obscure datasets and more about imposing intelligent structure on the massive public datasets we already have.
Think about it: the data that separates a clinical-stage biotech company from a commercial-stage one is public. It’s in the 10-K. It’s not hidden. It’s not proprietary. It’s right there in plain English. Yet most traders miss it because they’re looking for something more sophisticated. They’re looking for a pattern in the noise instead of first filtering out the noise.
This is a mistake. The most valuable work in quantitative finance is often not in building complex models, but in building better data representations. A simple classifier that separates clinical from commercial is worth more than a sophisticated model trained on undifferentiated data. A transparent rule-based system is worth more than a black-box neural network that no one understands.
Before you ask whether markets are efficient, you must first ask whether your data representation is. More often than not, the failure lies not in the market, but in the map we use to navigate it. Build a better map, and you might just find a path that was there all along.
The next time you download a batch of SEC filings, don’t start by building a model. Start by building a classifier. Separate the signal from the noise. Impose structure on the chaos. Then, and only then, start looking for patterns. You’ll be surprised at what you find.



Completely agree that structure has to come before any modeling.
In practice, I’ve found the hardest unsolved step isn’t classification, but deciding which classified events are actually worth surfacing immediately versus quietly storing. That “interruption threshold” seems under-discussed compared to ingestion and labeling, especially once you try to automate it end-to-end.
Couldn't agree more; what if the core challenge isn't data volume or processing speed, but this deep-seated problem of structurally misusing information, leading to biased models and ineffective strategys across various AI applications, not just trading?