Email Extractor
Scan text and extract all email addresses using regex pattern matching with automatic deduplication.
Back to all tools on ToolForge
Input Text
Extracted Emails
About Email Extractor
This email extractor uses regular expressions to scan text and identify email addresses matching standard format patterns. Results are automatically deduplicated, producing a clean list of unique email addresses for analysis, migration, or compliance workflows.
Email Regex Pattern Explained
The extraction pattern balances RFC compliance with practical coverage:
Pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/gi
Component breakdown:
┌─────────────────┬──────────────────────────────────────┐
│ [a-zA-Z0-9._%+-]+ │ Local part (before @) │
│ │ - Letters (a-z, A-Z) │
│ │ - Digits (0-9) │
│ │ - Dot, underscore, percent, plus, hyphen │
│ @ │ At symbol (required separator) │
│ [a-zA-Z0-9.-]+ │ Domain name │
│ │ - Letters, digits, dots, hyphens │
│ \. │ Literal dot before TLD │
│ [a-zA-Z]{2,} │ Top-level domain (minimum 2 letters) │
└─────────────────┴──────────────────────────────────────┘
Flags:
g = global (find all matches, not just first)
i = case insensitive (USER@ = user@)
Email Format Compliance Matrix
| Format Type | Example | RFC 5322 | This Tool |
|---|---|---|---|
| Simple ASCII | [email protected] |
✓ Valid | ✓ Matched |
| With dots | [email protected] |
✓ Valid | ✓ Matched |
| Plus tagging | [email protected] |
✓ Valid | ✓ Matched |
| Hyphenated | [email protected] |
✓ Valid | ✓ Matched |
| Subdomain | [email protected] |
✓ Valid | ✓ Matched |
| Country code TLD | [email protected] |
✓ Valid | ✓ Matched |
| Quoted string | "john.doe"@example.com |
✓ Valid | ✗ Not matched |
| International (UTF-8) | 用户@例子。广告 |
✓ Valid (RFC 6531) | ✗ Not matched |
| Obfuscated | user [at] domain.com |
✗ Invalid | ✗ Not matched |
Extraction Algorithm
JavaScript Email Extraction Implementation:
function extractEmails(text) {
// Step 1: Define regex pattern
const pattern = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/gi;
// Step 2: Find all matches (returns array or null)
const matches = text.match(pattern) || [];
// Step 3: Deduplicate using Set
const unique = Array.from(new Set(matches));
// Step 4: Return results
return unique;
}
// Usage example:
const text = "Contact [email protected] or [email protected]";
const emails = extractEmails(text);
console.log(emails); // ["[email protected]", "[email protected]"]
Common Use Cases
- Log Analysis: Extract contact emails from support logs, error reports, or audit trails
- Data Migration: Pull email addresses from unstructured text exports or legacy systems
- Lead Generation: Extract contact information from web pages, documents, or PDFs
- GDPR Compliance: Identify personal data (email addresses) in text for data audits and DSAR responses
- Research: Collect author emails from academic papers, articles, or conference proceedings
- Email Cleanup: Find and consolidate duplicate addresses across multiple sources
- Security Analysis: Extract sender/recipient emails from email headers or logs
Email Extraction Example
Input Text: "Contact our team at [email protected] for assistance. You can also reach [email protected] or [email protected]. For press inquiries, email [email protected] or [email protected] again. Alternative: admin subdomain at [email protected]" Extraction Process: Match 1: [email protected] Match 2: [email protected] Match 3: [email protected] Match 4: [email protected] Match 5: [email protected] (duplicate) Match 6: [email protected] Output (deduplicated, one per line): [email protected] [email protected] [email protected] [email protected] [email protected] Total: 5 unique emails found
Extraction Limitations
- Format-only validation: Matches pattern, does not verify domain existence or mailbox validity
- No DNS/MX checks: Does not confirm the domain has mail servers
- Quoted strings: Does not match RFC 5322 quoted local parts ("john.doe"@example.com)
- International domains: Does not match IDN/UTF-8 characters (RFC 6531)
- Obfuscated emails: Does not detect "user at domain dot com" or similar obfuscation
- Case sensitivity: Deduplication is case-sensitive ([email protected] ≠ [email protected])
Email Validation vs Extraction
| Validation Level | What It Checks | This Tool |
|---|---|---|
| Syntax validation | Matches email format pattern | ✓ Yes |
| DNS validation | Checks domain has DNS records | ✗ No |
| MX record check | Confirms domain accepts email | ✗ No |
| SMTP verification | Connects to mail server, checks mailbox | ✗ No |
| Mailbox verification | Sends confirmation email | ✗ No |
Deduplication with Set
JavaScript's Set data structure provides efficient O(n) deduplication:
// Set automatically removes duplicates
const emails = [
"[email protected]",
"[email protected]",
"[email protected]", // duplicate
"[email protected]"
];
const uniqueSet = new Set(emails);
// Set(3) { "[email protected]", "[email protected]", "[email protected]" }
const uniqueArray = Array.from(uniqueSet);
// ["[email protected]", "[email protected]", "[email protected]"]
// Note: Set uses SameValueZero comparison
// "[email protected]" and "[email protected]" are different values
// For case-insensitive dedup, convert to lowercase first:
const lowercaseUnique = Array.from(
new Set(emails.map(e => e.toLowerCase()))
);
How to Extract Emails from Text
- Paste text: Enter or paste the text containing email addresses.
- Click Extract: The tool scans for all email patterns using regex.
- Review results: Unique emails appear in the output box, one per line.
- Copy output: Click "Copy Result" to use the email list elsewhere.
Tips
- Works with any text format: HTML, plain text, code, logs, PDF exports
- Duplicates are automatically removed using Set
- Each unique email appears on a separate line
- For case-insensitive deduplication, convert results to lowercase externally
- Clear input before processing new text
Frequently Asked Questions
- How does regex-based email extraction work?
- Email extraction uses regular expressions to match the standard email format defined in RFC 5322: local-part@domain. The regex pattern scans text for sequences matching allowed characters before @, a domain name with valid characters, and a TLD of 2+ letters. This tool finds all matches and removes duplicates using Set data structure.
- What is the RFC 5322 email standard?
- RFC 5322 defines the Internet Message Format, including email address syntax. The local part (before @) can contain alphanumeric characters and special characters like ._%+- without spaces. The domain part (after @) must be a valid domain name with at least one dot. Full RFC compliance allows quoted strings and escaped characters, but most tools use a simplified pattern for practical extraction.
- Why doesn't this extract all valid email formats?
- Full RFC 5322 regex is extremely complex (hundreds of characters) because it allows edge cases like quoted strings ("john.doe"@example.com), comments, and escaped characters. This tool uses a practical pattern that matches 99% of real-world emails while remaining readable and fast. Exotic formats are rare outside of test suites.
- Are extracted emails validated for existence?
- No. This tool only validates format, not deliverability. It does not check DNS MX records, send verification emails, or confirm mailboxes exist. Format validation ensures the email looks correct; deliverability requires separate verification services that perform SMTP handshakes or send confirmation messages.
- How does deduplication work?
- The tool uses JavaScript's Set data structure to automatically remove duplicates. Set stores only unique values—adding an existing value has no effect. This is O(n) time complexity and handles case-sensitive comparison ([email protected] and [email protected] are treated as different emails).
- What about internationalized email addresses?
- This tool uses ASCII-only regex patterns. Internationalized Email Addresses (RFC 6531) allow UTF-8 characters in local parts and domains (用户@例子。广告). These require Unicode-aware regex and are not yet widely supported. For most use cases, ASCII emails cover the vast majority of addresses in circulation.