How does regex-based email extraction work?

Email extraction uses regular expressions to match the standard email format defined in RFC 5322: local-part@domain. The regex pattern scans text for sequences matching allowed characters before @, a domain name with valid characters, and a TLD of 2+ letters. This tool finds all matches and removes duplicates using Set data structure.

What is the RFC 5322 email standard?

RFC 5322 defines the Internet Message Format, including email address syntax. The local part (before @) can contain alphanumeric characters and special characters like ._%+- without spaces. The domain part (after @) must be a valid domain name with at least one dot. Full RFC compliance allows quoted strings and escaped characters, but most tools use a simplified pattern for practical extraction.

Why doesn't this extract all valid email formats?

Full RFC 5322 regex is extremely complex (hundreds of characters) because it allows edge cases like quoted strings ("john.doe"@example.com), comments, and escaped characters. This tool uses a practical pattern that matches 99% of real-world emails while remaining readable and fast. Exotic formats are rare outside of test suites.

Are extracted emails validated for existence?

No. This tool only validates format, not deliverability. It does not check DNS MX records, send verification emails, or confirm mailboxes exist. Format validation ensures the email looks correct; deliverability requires separate verification services that perform SMTP handshakes or send confirmation messages.

How does deduplication work?

The tool uses JavaScript's Set data structure to automatically remove duplicates. Set stores only unique values—adding an existing value has no effect. This is O(n) time complexity and handles case-sensitive comparison (User@Example.com and user@example.com are treated as different emails).

What about internationalized email addresses?

This tool uses ASCII-only regex patterns. Internationalized Email Addresses (RFC 6531) allow UTF-8 characters in local parts and domains (用户@例子。广告). These require Unicode-aware regex and are not yet widely supported. For most use cases, ASCII emails cover the vast majority of addresses in circulation.

Email Extractor

Scan text and extract all email addresses using regex pattern matching with automatic deduplication.

Back to all tools on ToolForge

Input Text

Extracted Emails

About Email Extractor

This email extractor uses regular expressions to scan text and identify email addresses matching standard format patterns. Results are automatically deduplicated, producing a clean list of unique email addresses for analysis, migration, or compliance workflows.

Email Regex Pattern Explained

The extraction pattern balances RFC compliance with practical coverage:

Pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/gi

Component breakdown:
┌─────────────────┬──────────────────────────────────────┐
│ [a-zA-Z0-9._%+-]+ │ Local part (before @)              │
│                 │ - Letters (a-z, A-Z)                 │
│                 │ - Digits (0-9)                       │
│                 │ - Dot, underscore, percent, plus, hyphen │
│ @               │ At symbol (required separator)       │
│ [a-zA-Z0-9.-]+  │ Domain name                          │
│                 │ - Letters, digits, dots, hyphens     │
│ \.              │ Literal dot before TLD               │
│ [a-zA-Z]{2,}    │ Top-level domain (minimum 2 letters) │
└─────────────────┴──────────────────────────────────────┘

Flags:
  g = global (find all matches, not just first)
  i = case insensitive (USER@ = user@)

Email Format Compliance Matrix

Format Type	Example	RFC 5322	This Tool
Simple ASCII	`[email protected]`	✓ Valid	✓ Matched
With dots	`[email protected]`	✓ Valid	✓ Matched
Plus tagging	`[email protected]`	✓ Valid	✓ Matched
Hyphenated	`[email protected]`	✓ Valid	✓ Matched
Subdomain	`[email protected]`	✓ Valid	✓ Matched
Country code TLD	`[email protected]`	✓ Valid	✓ Matched
Quoted string	`"john.doe"@example.com`	✓ Valid	✗ Not matched
International (UTF-8)	`用户@例子。广告`	✓ Valid (RFC 6531)	✗ Not matched
Obfuscated	`user [at] domain.com`	✗ Invalid	✗ Not matched

Extraction Algorithm

JavaScript Email Extraction Implementation:

function extractEmails(text) {
  // Step 1: Define regex pattern
  const pattern = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/gi;

  // Step 2: Find all matches (returns array or null)
  const matches = text.match(pattern) || [];

  // Step 3: Deduplicate using Set
  const unique = Array.from(new Set(matches));

  // Step 4: Return results
  return unique;
}

// Usage example:
const text = "Contact [email protected] or [email protected]";
const emails = extractEmails(text);
console.log(emails); // ["[email protected]", "[email protected]"]

Common Use Cases

Log Analysis: Extract contact emails from support logs, error reports, or audit trails
Data Migration: Pull email addresses from unstructured text exports or legacy systems
Lead Generation: Extract contact information from web pages, documents, or PDFs
GDPR Compliance: Identify personal data (email addresses) in text for data audits and DSAR responses
Research: Collect author emails from academic papers, articles, or conference proceedings
Email Cleanup: Find and consolidate duplicate addresses across multiple sources
Security Analysis: Extract sender/recipient emails from email headers or logs

Email Extraction Example

Input Text:
"Contact our team at [email protected] for assistance.
 You can also reach [email protected] or [email protected].
 For press inquiries, email [email protected] or [email protected] again.
 Alternative: admin subdomain at [email protected]"

Extraction Process:
  Match 1: [email protected]
  Match 2: [email protected]
  Match 3: [email protected]
  Match 4: [email protected]
  Match 5: [email protected] (duplicate)
  Match 6: [email protected]

Output (deduplicated, one per line):
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Total: 5 unique emails found

Extraction Limitations

Format-only validation: Matches pattern, does not verify domain existence or mailbox validity
No DNS/MX checks: Does not confirm the domain has mail servers
Quoted strings: Does not match RFC 5322 quoted local parts ("john.doe"@example.com)
International domains: Does not match IDN/UTF-8 characters (RFC 6531)
Obfuscated emails: Does not detect "user at domain dot com" or similar obfuscation
Case sensitivity: Deduplication is case-sensitive ([email protected] ≠ [email protected])

Email Validation vs Extraction

Validation Level	What It Checks	This Tool
Syntax validation	Matches email format pattern	✓ Yes
DNS validation	Checks domain has DNS records	✗ No
MX record check	Confirms domain accepts email	✗ No
SMTP verification	Connects to mail server, checks mailbox	✗ No
Mailbox verification	Sends confirmation email	✗ No

Deduplication with Set

JavaScript's Set data structure provides efficient O(n) deduplication:

// Set automatically removes duplicates
const emails = [
  "[email protected]",
  "[email protected]",
  "[email protected]",  // duplicate
  "[email protected]"
];

const uniqueSet = new Set(emails);
// Set(3) { "[email protected]", "[email protected]", "[email protected]" }

const uniqueArray = Array.from(uniqueSet);
// ["[email protected]", "[email protected]", "[email protected]"]

// Note: Set uses SameValueZero comparison
// "[email protected]" and "[email protected]" are different values
// For case-insensitive dedup, convert to lowercase first:
const lowercaseUnique = Array.from(
  new Set(emails.map(e => e.toLowerCase()))
);

How to Extract Emails from Text

Paste text: Enter or paste the text containing email addresses.
Click Extract: The tool scans for all email patterns using regex.
Review results: Unique emails appear in the output box, one per line.
Copy output: Click "Copy Result" to use the email list elsewhere.

Tips

Works with any text format: HTML, plain text, code, logs, PDF exports
Duplicates are automatically removed using Set
Each unique email appears on a separate line
For case-insensitive deduplication, convert results to lowercase externally
Clear input before processing new text

Frequently Asked Questions

How does regex-based email extraction work?: Email extraction uses regular expressions to match the standard email format defined in RFC 5322: local-part@domain. The regex pattern scans text for sequences matching allowed characters before @, a domain name with valid characters, and a TLD of 2+ letters. This tool finds all matches and removes duplicates using Set data structure.
What is the RFC 5322 email standard?: RFC 5322 defines the Internet Message Format, including email address syntax. The local part (before @) can contain alphanumeric characters and special characters like ._%+- without spaces. The domain part (after @) must be a valid domain name with at least one dot. Full RFC compliance allows quoted strings and escaped characters, but most tools use a simplified pattern for practical extraction.
Why doesn't this extract all valid email formats?: Full RFC 5322 regex is extremely complex (hundreds of characters) because it allows edge cases like quoted strings ("john.doe"@example.com), comments, and escaped characters. This tool uses a practical pattern that matches 99% of real-world emails while remaining readable and fast. Exotic formats are rare outside of test suites.
Are extracted emails validated for existence?: No. This tool only validates format, not deliverability. It does not check DNS MX records, send verification emails, or confirm mailboxes exist. Format validation ensures the email looks correct; deliverability requires separate verification services that perform SMTP handshakes or send confirmation messages.
How does deduplication work?: The tool uses JavaScript's Set data structure to automatically remove duplicates. Set stores only unique values—adding an existing value has no effect. This is O(n) time complexity and handles case-sensitive comparison ([email protected] and [email protected] are treated as different emails).
What about internationalized email addresses?: This tool uses ASCII-only regex patterns. Internationalized Email Addresses (RFC 6531) allow UTF-8 characters in local parts and domains (用户@例子。广告). These require Unicode-aware regex and are not yet widely supported. For most use cases, ASCII emails cover the vast majority of addresses in circulation.