URL Extractor

Scan text and extract all HTTP/HTTPS URLs using regex pattern matching with automatic deduplication.

Back to all tools on ToolForge

More in Text Tools

Input Text



Extracted URLs

About URL Extractor

This URL extractor uses regular expressions to scan text and identify HTTP/HTTPS links matching standard URI syntax. Results are automatically deduplicated, producing a clean list of unique URLs for analysis, crawling, or content workflows.

URL Regex Pattern Explained

The extraction pattern matches common web URL formats:

Pattern: /https?:\/\/[^\s"'<>()]+/gi

Component breakdown:
┌──────────────┬──────────────────────────────────────────┐
│ https?       │ Scheme: http or https (s is optional)    │
│ ://          │ Scheme separator (literal characters)    │
│ [^\s"'<>()]+ │ URL body: one or more characters that   │
│              │ are NOT whitespace, quotes, or brackets │
└──────────────┴──────────────────────────────────────────┘

Character class [^\s"'<>()] excludes:
  \s  - Whitespace (space, tab, newline)
  "   - Double quote
  '   - Single quote
  <   - Less-than (HTML tag delimiter)
  >   - Greater-than (HTML tag delimiter)
  (   - Opening parenthesis
  )   - Closing parenthesis

RFC 3986 URL Structure

URI syntax per RFC 3986:

URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]

Example URL breakdown:
https://user:[email protected]:8080/v1/users?id=123#section
│     │  │    │   │              │     │           │    │
│     │  │    │   │              │     │           │    └─ fragment
│     │  │    │   │              │     │           └────── query
│     │  │    │   │              │     └────────────────── path
│     │  │    │   │              └──────────────────────── host
│     │  │    │   └─────────────────────────────────────── port
│     │  │    └─────────────────────────────────────────── host (domain)
│     │  └──────────────────────────────────────────────── userinfo
│     └─────────────────────────────────────────────────── authority delimiter
└───────────────────────────────────────────────────────── scheme

Components captured by this tool:
  ✓ scheme (http, https)
  ✓ host (domain with optional subdomain)
  ✓ port (optional, e.g., :8080)
  ✓ path (optional, e.g., /api/users)
  ✓ query (optional, e.g., ?id=123)
  ✓ fragment (optional, e.g., #section)

Not captured:
  ✗ userinfo (user:pass@ - rare in modern URLs)
  ✗ IPv6 literals in brackets [::1]
  ✗ Percent-encoded characters handled as-is

Supported URL Formats

Format Type Example Extracted
Basic domain https://example.com ✓ Yes
With path https://example.com/page/subpage ✓ Yes
With query string https://example.com/search?q=test&lang=en ✓ Yes
With fragment https://example.com/page#section ✓ Yes
With port http://localhost:8080/api ✓ Yes
Subdomain https://api.v2.company.com ✓ Yes
Combined https://api.example.com:8080/v1/users?id=123#top ✓ Yes
Relative path /page/path ✗ No
Protocol-relative //cdn.example.com/script.js ✗ No
FTP protocol ftp://files.example.com/doc.pdf ✗ No
mailto mailto:[email protected] ✗ No

Extraction Algorithm

JavaScript URL Extraction Implementation:

function extractUrls(text) {
  // Step 1: Define regex pattern for HTTP/HTTPS URLs
  const pattern = /(https?:\/\/[^\s"'<>()]+)/gi;

  // Step 2: Find all matches (returns array or null)
  const matches = text.match(pattern) || [];

  // Step 3: Deduplicate using Set
  const unique = Array.from(new Set(matches));

  // Step 4: Return results
  return unique;
}

// Usage example:
const text = `
  Check out https://example.com and https://example.com again.
  Also visit http://test.org/page?id=123#section for details.
  The API at https://api.company.com:8080/v1/users is documented
  at https://docs.company.com/api.
`;

const urls = extractUrls(text);
console.log(urls);
// [
//   "https://example.com",
//   "http://test.org/page?id=123#section",
//   "https://api.company.com:8080/v1/users",
//   "https://docs.company.com/api"
// ]

Common Use Cases

URL Extraction Example

Input Text:
"Check out https://example.com and https://example.com again.
 Also visit http://test.org/page?id=123#section for details.
 The API at https://api.company.com:8080/v1/users is documented
 at https://docs.company.com/api."

Extraction Process:
  Match 1: https://example.com
  Match 2: https://example.com (duplicate)
  Match 3: http://test.org/page?id=123#section
  Match 4: https://api.company.com:8080/v1/users
  Match 5: https://docs.company.com/api

Output (deduplicated, one per line):
https://example.com
http://test.org/page?id=123#section
https://api.company.com:8080/v1/users
https://docs.company.com/api

Total: 4 unique URLs found

URL Component Reference

Component Description Example
Scheme Protocol identifier https://
Authority Domain/host with optional port example.com
Port TCP/IP port number (optional) :8080
Path Resource location on server /api/v1/users
Query Key-value parameters ?id=123&name=test
Fragment Page section anchor #section1

Extraction Limitations

Deduplication Behavior

JavaScript Set uses exact string comparison for deduplication:

// Exact string matching
const urls = [
  "https://example.com",
  "https://EXAMPLE.COM",     // Different (case)
  "https://example.com/",    // Different (trailing slash)
  "https://example.com",     // Duplicate
  "http://example.com"       // Different (scheme)
];

const unique = Array.from(new Set(urls));
// Result: All 5 URLs retained (only exact duplicate removed)

// For semantic deduplication, normalize first:
const normalized = urls.map(url => {
  let n = url.toLowerCase();           // Case-insensitive
  if (n.endsWith('/') && n.length > 9) {
    n = n.slice(0, -1);                // Remove trailing slash
  }
  return n;
});
const semanticUnique = Array.from(new Set(normalized));
// Would consolidate variations of same URL

How to Extract URLs from Text

  1. Paste text: Enter or paste the text containing URLs.
  2. Click Extract: The tool scans for all HTTP/HTTPS links using regex.
  3. Review results: Unique URLs appear in the output box, one per line.
  4. Copy output: Click "Copy Result" to use the URL list elsewhere.

Tips

Frequently Asked Questions

How does regex-based URL extraction work?
URL extraction uses regular expressions to match the URI syntax defined in RFC 3986. The pattern looks for the scheme (http:// or https://), followed by the authority (domain), and optional path, query, and fragment components. This tool finds all matches using JavaScript's match() method and removes duplicates using Set.
What is the RFC 3986 URL standard?
RFC 3986 defines the Uniform Resource Identifier (URI) syntax: scheme://authority/path?query#fragment. The authority includes userinfo@, host, and :port. Valid characters vary by component—alphanumeric plus unreserved characters (-._~) are allowed throughout, while reserved characters (:/?#[]@!$&'()*+,;=) have special meanings and may require percent-encoding.
Why are relative URLs not extracted?
Relative URLs (like /page/path or //cdn.example.com) lack a scheme and are resolved relative to a base URL. This tool extracts only absolute URLs with explicit http:// or https:// schemes to ensure results are immediately usable. Relative URLs require context (the base URL) to be meaningful.
How does deduplication handle URL variations?
Deduplication uses exact string matching via JavaScript Set. This means https://Example.com and https://example.com are treated as different URLs (case-sensitive), and https://example.com/page is different from https://example.com/page/. For semantic deduplication (normalizing case, trailing slashes), additional processing would be required.
What URL components are captured?
The pattern captures the full URL including: scheme (http/https), domain/host, optional port number, path segments, query string (everything after ?), and fragment identifier (everything after #). Example: https://api.example.com:8080/v1/users?id=123&name=test#section captures all components.
Why doesn't this extract FTP or other protocols?
This tool focuses on HTTP/HTTPS as they are the most common web protocols. FTP, mailto, tel, file://, and other schemes have different use cases and security considerations. The regex can be extended to support additional schemes by modifying the pattern to (https?|ftp|mailto):// if needed.