How does regex-based URL extraction work?

URL extraction uses regular expressions to match the URI syntax defined in RFC 3986. The pattern looks for the scheme (http:// or https://), followed by the authority (domain), and optional path, query, and fragment components. This tool finds all matches using JavaScript's match() method and removes duplicates using Set.

What is the RFC 3986 URL standard?

RFC 3986 defines the Uniform Resource Identifier (URI) syntax: scheme://authority/path?query#fragment. The authority includes userinfo@, host, and :port. Valid characters vary by component—alphanumeric plus unreserved characters (-._~) are allowed throughout, while reserved characters (:/?#[]@!$&'()*+,;=) have special meanings and may require percent-encoding.

Why are relative URLs not extracted?

Relative URLs (like /page/path or //cdn.example.com) lack a scheme and are resolved relative to a base URL. This tool extracts only absolute URLs with explicit http:// or https:// schemes to ensure results are immediately usable. Relative URLs require context (the base URL) to be meaningful.

How does deduplication handle URL variations?

Deduplication uses exact string matching via JavaScript Set. This means https://Example.com and https://example.com are treated as different URLs (case-sensitive), and https://example.com/page is different from https://example.com/page/. For semantic deduplication (normalizing case, trailing slashes), additional processing would be required.

What URL components are captured?

The pattern captures the full URL including: scheme (http/https), domain/host, optional port number, path segments, query string (everything after ?), and fragment identifier (everything after #). Example: https://api.example.com:8080/v1/users?id=123&name=test#section captures all components.

Why doesn't this extract FTP or other protocols?

This tool focuses on HTTP/HTTPS as they are the most common web protocols. FTP, mailto, tel, file://, and other schemes have different use cases and security considerations. The regex can be extended to support additional schemes by modifying the pattern to (https?|ftp|mailto):// if needed.

URL Extractor

Scan text and extract all HTTP/HTTPS URLs using regex pattern matching with automatic deduplication.

Back to all tools on ToolForge

Input Text

Extracted URLs

About URL Extractor

This URL extractor uses regular expressions to scan text and identify HTTP/HTTPS links matching standard URI syntax. Results are automatically deduplicated, producing a clean list of unique URLs for analysis, crawling, or content workflows.

URL Regex Pattern Explained

The extraction pattern matches common web URL formats:

Pattern: /https?:\/\/[^\s"'<>()]+/gi

Component breakdown:
┌──────────────┬──────────────────────────────────────────┐
│ https?       │ Scheme: http or https (s is optional)    │
│ ://          │ Scheme separator (literal characters)    │
│ [^\s"'<>()]+ │ URL body: one or more characters that   │
│              │ are NOT whitespace, quotes, or brackets │
└──────────────┴──────────────────────────────────────────┘

Character class [^\s"'<>()] excludes:
  \s  - Whitespace (space, tab, newline)
  "   - Double quote
  '   - Single quote
  <   - Less-than (HTML tag delimiter)
  >   - Greater-than (HTML tag delimiter)
  (   - Opening parenthesis
  )   - Closing parenthesis

RFC 3986 URL Structure

URI syntax per RFC 3986:

URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]

Example URL breakdown:
https://user:[email protected]:8080/v1/users?id=123#section
│     │  │    │   │              │     │           │    │
│     │  │    │   │              │     │           │    └─ fragment
│     │  │    │   │              │     │           └────── query
│     │  │    │   │              │     └────────────────── path
│     │  │    │   │              └──────────────────────── host
│     │  │    │   └─────────────────────────────────────── port
│     │  │    └─────────────────────────────────────────── host (domain)
│     │  └──────────────────────────────────────────────── userinfo
│     └─────────────────────────────────────────────────── authority delimiter
└───────────────────────────────────────────────────────── scheme

Components captured by this tool:
  ✓ scheme (http, https)
  ✓ host (domain with optional subdomain)
  ✓ port (optional, e.g., :8080)
  ✓ path (optional, e.g., /api/users)
  ✓ query (optional, e.g., ?id=123)
  ✓ fragment (optional, e.g., #section)

Not captured:
  ✗ userinfo (user:pass@ - rare in modern URLs)
  ✗ IPv6 literals in brackets [::1]
  ✗ Percent-encoded characters handled as-is

Supported URL Formats

Format Type	Example	Extracted
Basic domain	`https://example.com`	✓ Yes
With path	`https://example.com/page/subpage`	✓ Yes
With query string	`https://example.com/search?q=test&lang=en`	✓ Yes
With fragment	`https://example.com/page#section`	✓ Yes
With port	`http://localhost:8080/api`	✓ Yes
Subdomain	`https://api.v2.company.com`	✓ Yes
Combined	`https://api.example.com:8080/v1/users?id=123#top`	✓ Yes
Relative path	`/page/path`	✗ No
Protocol-relative	`//cdn.example.com/script.js`	✗ No
FTP protocol	`ftp://files.example.com/doc.pdf`	✗ No
mailto	`mailto:[email protected]`	✗ No

Extraction Algorithm

JavaScript URL Extraction Implementation:

function extractUrls(text) {
  // Step 1: Define regex pattern for HTTP/HTTPS URLs
  const pattern = /(https?:\/\/[^\s"'<>()]+)/gi;

  // Step 2: Find all matches (returns array or null)
  const matches = text.match(pattern) || [];

  // Step 3: Deduplicate using Set
  const unique = Array.from(new Set(matches));

  // Step 4: Return results
  return unique;
}

// Usage example:
const text = `
  Check out https://example.com and https://example.com again.
  Also visit http://test.org/page?id=123#section for details.
  The API at https://api.company.com:8080/v1/users is documented
  at https://docs.company.com/api.
`;

const urls = extractUrls(text);
console.log(urls);
// [
//   "https://example.com",
//   "http://test.org/page?id=123#section",
//   "https://api.company.com:8080/v1/users",
//   "https://docs.company.com/api"
// ]

Common Use Cases

Log Analysis: Extract visited URLs from browser history, server logs, or proxy logs
Web Scraping: Pull all links from HTML content, markdown, or plain text documents
SEO Auditing: Collect internal and external links for link profile analysis
Redirect Tracking: Find all URLs in redirect chains or HTTP response headers
Link Verification: Extract URLs before running link checker or crawler tools
Research: Collect source URLs from articles, papers, or documentation
Social Media Analysis: Extract links from posts, comments, or messages
Security Analysis: Identify external domains referenced in documents or logs

URL Extraction Example

Input Text:
"Check out https://example.com and https://example.com again.
 Also visit http://test.org/page?id=123#section for details.
 The API at https://api.company.com:8080/v1/users is documented
 at https://docs.company.com/api."

Extraction Process:
  Match 1: https://example.com
  Match 2: https://example.com (duplicate)
  Match 3: http://test.org/page?id=123#section
  Match 4: https://api.company.com:8080/v1/users
  Match 5: https://docs.company.com/api

Output (deduplicated, one per line):
https://example.com
http://test.org/page?id=123#section
https://api.company.com:8080/v1/users
https://docs.company.com/api

Total: 4 unique URLs found

URL Component Reference

Component	Description	Example
Scheme	Protocol identifier	`https://`
Authority	Domain/host with optional port	`example.com`
Port	TCP/IP port number (optional)	`:8080`
Path	Resource location on server	`/api/v1/users`
Query	Key-value parameters	`?id=123&name=test`
Fragment	Page section anchor	`#section1`

Extraction Limitations

HTTP/HTTPS only: FTP, mailto, tel, file://, and other protocols are not extracted
No relative URLs: Paths like /page or protocol-relative //cdn.example.com are not matched
No validation: Does not verify if URLs are accessible or return valid responses
Parentheses handling: URLs ending in ) may include trailing parenthesis if present
Encoded URLs: Punycode international domains (xn--...) are matched as-is
Case sensitivity: Deduplication is case-sensitive (HTTPS://EXAMPLE.COM ≠ https://example.com)
Whitespace in URLs: Spaces terminate the match; encoded spaces (%20) are preserved

Deduplication Behavior

JavaScript Set uses exact string comparison for deduplication:

// Exact string matching
const urls = [
  "https://example.com",
  "https://EXAMPLE.COM",     // Different (case)
  "https://example.com/",    // Different (trailing slash)
  "https://example.com",     // Duplicate
  "http://example.com"       // Different (scheme)
];

const unique = Array.from(new Set(urls));
// Result: All 5 URLs retained (only exact duplicate removed)

// For semantic deduplication, normalize first:
const normalized = urls.map(url => {
  let n = url.toLowerCase();           // Case-insensitive
  if (n.endsWith('/') && n.length > 9) {
    n = n.slice(0, -1);                // Remove trailing slash
  }
  return n;
});
const semanticUnique = Array.from(new Set(normalized));
// Would consolidate variations of same URL

How to Extract URLs from Text

Paste text: Enter or paste the text containing URLs.
Click Extract: The tool scans for all HTTP/HTTPS links using regex.
Review results: Unique URLs appear in the output box, one per line.
Copy output: Click "Copy Result" to use the URL list elsewhere.

Tips

Works with HTML, plain text, markdown, code comments, and logs
Duplicates are automatically removed using Set
Each unique URL appears on a separate line
Full URLs with http:// or https:// prefix are required
For case-insensitive deduplication, normalize results externally

Frequently Asked Questions

How does regex-based URL extraction work?: URL extraction uses regular expressions to match the URI syntax defined in RFC 3986. The pattern looks for the scheme (http:// or https://), followed by the authority (domain), and optional path, query, and fragment components. This tool finds all matches using JavaScript's match() method and removes duplicates using Set.
What is the RFC 3986 URL standard?: RFC 3986 defines the Uniform Resource Identifier (URI) syntax: scheme://authority/path?query#fragment. The authority includes userinfo@, host, and :port. Valid characters vary by component—alphanumeric plus unreserved characters (-._~) are allowed throughout, while reserved characters (:/?#[]@!$&'()*+,;=) have special meanings and may require percent-encoding.
Why are relative URLs not extracted?: Relative URLs (like /page/path or //cdn.example.com) lack a scheme and are resolved relative to a base URL. This tool extracts only absolute URLs with explicit http:// or https:// schemes to ensure results are immediately usable. Relative URLs require context (the base URL) to be meaningful.
How does deduplication handle URL variations?: Deduplication uses exact string matching via JavaScript Set. This means https://Example.com and https://example.com are treated as different URLs (case-sensitive), and https://example.com/page is different from https://example.com/page/. For semantic deduplication (normalizing case, trailing slashes), additional processing would be required.
What URL components are captured?: The pattern captures the full URL including: scheme (http/https), domain/host, optional port number, path segments, query string (everything after ?), and fragment identifier (everything after #). Example: https://api.example.com:8080/v1/users?id=123&name=test#section captures all components.
Why doesn't this extract FTP or other protocols?: This tool focuses on HTTP/HTTPS as they are the most common web protocols. FTP, mailto, tel, file://, and other schemes have different use cases and security considerations. The regex can be extended to support additional schemes by modifying the pattern to (https?|ftp|mailto):// if needed.