URL Extractor
Scan text and extract all HTTP/HTTPS URLs using regex pattern matching with automatic deduplication.
Back to all tools on ToolForge
Input Text
Extracted URLs
About URL Extractor
This URL extractor uses regular expressions to scan text and identify HTTP/HTTPS links matching standard URI syntax. Results are automatically deduplicated, producing a clean list of unique URLs for analysis, crawling, or content workflows.
URL Regex Pattern Explained
The extraction pattern matches common web URL formats:
Pattern: /https?:\/\/[^\s"'<>()]+/gi Component breakdown: ┌──────────────┬──────────────────────────────────────────┐ │ https? │ Scheme: http or https (s is optional) │ │ :// │ Scheme separator (literal characters) │ │ [^\s"'<>()]+ │ URL body: one or more characters that │ │ │ are NOT whitespace, quotes, or brackets │ └──────────────┴──────────────────────────────────────────┘ Character class [^\s"'<>()] excludes: \s - Whitespace (space, tab, newline) " - Double quote ' - Single quote < - Less-than (HTML tag delimiter) > - Greater-than (HTML tag delimiter) ( - Opening parenthesis ) - Closing parenthesis
RFC 3986 URL Structure
URI syntax per RFC 3986:
URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment] Example URL breakdown: https://user:[email protected]:8080/v1/users?id=123#section │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └─ fragment │ │ │ │ │ │ │ └────── query │ │ │ │ │ │ └────────────────── path │ │ │ │ │ └──────────────────────── host │ │ │ │ └─────────────────────────────────────── port │ │ │ └─────────────────────────────────────────── host (domain) │ │ └──────────────────────────────────────────────── userinfo │ └─────────────────────────────────────────────────── authority delimiter └───────────────────────────────────────────────────────── scheme Components captured by this tool: ✓ scheme (http, https) ✓ host (domain with optional subdomain) ✓ port (optional, e.g., :8080) ✓ path (optional, e.g., /api/users) ✓ query (optional, e.g., ?id=123) ✓ fragment (optional, e.g., #section) Not captured: ✗ userinfo (user:pass@ - rare in modern URLs) ✗ IPv6 literals in brackets [::1] ✗ Percent-encoded characters handled as-is
Supported URL Formats
| Format Type | Example | Extracted |
|---|---|---|
| Basic domain | https://example.com |
✓ Yes |
| With path | https://example.com/page/subpage |
✓ Yes |
| With query string | https://example.com/search?q=test&lang=en |
✓ Yes |
| With fragment | https://example.com/page#section |
✓ Yes |
| With port | http://localhost:8080/api |
✓ Yes |
| Subdomain | https://api.v2.company.com |
✓ Yes |
| Combined | https://api.example.com:8080/v1/users?id=123#top |
✓ Yes |
| Relative path | /page/path |
✗ No |
| Protocol-relative | //cdn.example.com/script.js |
✗ No |
| FTP protocol | ftp://files.example.com/doc.pdf |
✗ No |
| mailto | mailto:[email protected] |
✗ No |
Extraction Algorithm
JavaScript URL Extraction Implementation:
function extractUrls(text) {
// Step 1: Define regex pattern for HTTP/HTTPS URLs
const pattern = /(https?:\/\/[^\s"'<>()]+)/gi;
// Step 2: Find all matches (returns array or null)
const matches = text.match(pattern) || [];
// Step 3: Deduplicate using Set
const unique = Array.from(new Set(matches));
// Step 4: Return results
return unique;
}
// Usage example:
const text = `
Check out https://example.com and https://example.com again.
Also visit http://test.org/page?id=123#section for details.
The API at https://api.company.com:8080/v1/users is documented
at https://docs.company.com/api.
`;
const urls = extractUrls(text);
console.log(urls);
// [
// "https://example.com",
// "http://test.org/page?id=123#section",
// "https://api.company.com:8080/v1/users",
// "https://docs.company.com/api"
// ]
Common Use Cases
- Log Analysis: Extract visited URLs from browser history, server logs, or proxy logs
- Web Scraping: Pull all links from HTML content, markdown, or plain text documents
- SEO Auditing: Collect internal and external links for link profile analysis
- Redirect Tracking: Find all URLs in redirect chains or HTTP response headers
- Link Verification: Extract URLs before running link checker or crawler tools
- Research: Collect source URLs from articles, papers, or documentation
- Social Media Analysis: Extract links from posts, comments, or messages
- Security Analysis: Identify external domains referenced in documents or logs
URL Extraction Example
Input Text: "Check out https://example.com and https://example.com again. Also visit http://test.org/page?id=123#section for details. The API at https://api.company.com:8080/v1/users is documented at https://docs.company.com/api." Extraction Process: Match 1: https://example.com Match 2: https://example.com (duplicate) Match 3: http://test.org/page?id=123#section Match 4: https://api.company.com:8080/v1/users Match 5: https://docs.company.com/api Output (deduplicated, one per line): https://example.com http://test.org/page?id=123#section https://api.company.com:8080/v1/users https://docs.company.com/api Total: 4 unique URLs found
URL Component Reference
| Component | Description | Example |
|---|---|---|
| Scheme | Protocol identifier | https:// |
| Authority | Domain/host with optional port | example.com |
| Port | TCP/IP port number (optional) | :8080 |
| Path | Resource location on server | /api/v1/users |
| Query | Key-value parameters | ?id=123&name=test |
| Fragment | Page section anchor | #section1 |
Extraction Limitations
- HTTP/HTTPS only: FTP, mailto, tel, file://, and other protocols are not extracted
- No relative URLs: Paths like /page or protocol-relative //cdn.example.com are not matched
- No validation: Does not verify if URLs are accessible or return valid responses
- Parentheses handling: URLs ending in ) may include trailing parenthesis if present
- Encoded URLs: Punycode international domains (xn--...) are matched as-is
- Case sensitivity: Deduplication is case-sensitive (HTTPS://EXAMPLE.COM ≠ https://example.com)
- Whitespace in URLs: Spaces terminate the match; encoded spaces (%20) are preserved
Deduplication Behavior
JavaScript Set uses exact string comparison for deduplication:
// Exact string matching
const urls = [
"https://example.com",
"https://EXAMPLE.COM", // Different (case)
"https://example.com/", // Different (trailing slash)
"https://example.com", // Duplicate
"http://example.com" // Different (scheme)
];
const unique = Array.from(new Set(urls));
// Result: All 5 URLs retained (only exact duplicate removed)
// For semantic deduplication, normalize first:
const normalized = urls.map(url => {
let n = url.toLowerCase(); // Case-insensitive
if (n.endsWith('/') && n.length > 9) {
n = n.slice(0, -1); // Remove trailing slash
}
return n;
});
const semanticUnique = Array.from(new Set(normalized));
// Would consolidate variations of same URL
How to Extract URLs from Text
- Paste text: Enter or paste the text containing URLs.
- Click Extract: The tool scans for all HTTP/HTTPS links using regex.
- Review results: Unique URLs appear in the output box, one per line.
- Copy output: Click "Copy Result" to use the URL list elsewhere.
Tips
- Works with HTML, plain text, markdown, code comments, and logs
- Duplicates are automatically removed using Set
- Each unique URL appears on a separate line
- Full URLs with http:// or https:// prefix are required
- For case-insensitive deduplication, normalize results externally
Frequently Asked Questions
- How does regex-based URL extraction work?
- URL extraction uses regular expressions to match the URI syntax defined in RFC 3986. The pattern looks for the scheme (http:// or https://), followed by the authority (domain), and optional path, query, and fragment components. This tool finds all matches using JavaScript's match() method and removes duplicates using Set.
- What is the RFC 3986 URL standard?
- RFC 3986 defines the Uniform Resource Identifier (URI) syntax: scheme://authority/path?query#fragment. The authority includes userinfo@, host, and :port. Valid characters vary by component—alphanumeric plus unreserved characters (-._~) are allowed throughout, while reserved characters (:/?#[]@!$&'()*+,;=) have special meanings and may require percent-encoding.
- Why are relative URLs not extracted?
- Relative URLs (like /page/path or //cdn.example.com) lack a scheme and are resolved relative to a base URL. This tool extracts only absolute URLs with explicit http:// or https:// schemes to ensure results are immediately usable. Relative URLs require context (the base URL) to be meaningful.
- How does deduplication handle URL variations?
- Deduplication uses exact string matching via JavaScript Set. This means https://Example.com and https://example.com are treated as different URLs (case-sensitive), and https://example.com/page is different from https://example.com/page/. For semantic deduplication (normalizing case, trailing slashes), additional processing would be required.
- What URL components are captured?
- The pattern captures the full URL including: scheme (http/https), domain/host, optional port number, path segments, query string (everything after ?), and fragment identifier (everything after #). Example: https://api.example.com:8080/v1/users?id=123&name=test#section captures all components.
- Why doesn't this extract FTP or other protocols?
- This tool focuses on HTTP/HTTPS as they are the most common web protocols. FTP, mailto, tel, file://, and other schemes have different use cases and security considerations. The regex can be extended to support additional schemes by modifying the pattern to (https?|ftp|mailto):// if needed.