File size: 11,209 Bytes
5fae594 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 |
# tldts - Blazing Fast URL Parsing
`tldts` is a JavaScript library to extract hostnames, domains, public suffixes, top-level domains and subdomains from URLs.
**Features**:
1. Tuned for **performance** (order of 0.1 to 1 μs per input)
2. Handles both URLs and hostnames
3. Full Unicode/IDNA support
4. Support parsing email addresses
5. Detect IPv4 and IPv6 addresses
6. Continuously updated version of the public suffix list
7. **TypeScript**, ships with `umd`, `esm`, `cjs` bundles and _type definitions_
8. Small bundles and small memory footprint
9. Battle tested: full test coverage and production use
# Install
```bash
npm install --save tldts
```
# Usage
Using the command-line interface:
```js
$ npx tldts 'http://www.writethedocs.org/conf/eu/2017/'
{
"domain": "writethedocs.org",
"domainWithoutSuffix": "writethedocs",
"hostname": "www.writethedocs.org",
"isIcann": true,
"isIp": false,
"isPrivate": false,
"publicSuffix": "org",
"subdomain": "www"
}
```
Programmatically:
```js
const { parse } = require('tldts');
// Retrieving hostname related informations of a given URL
parse('http://www.writethedocs.org/conf/eu/2017/');
// { domain: 'writethedocs.org',
// domainWithoutSuffix: 'writethedocs',
// hostname: 'www.writethedocs.org',
// isIcann: true,
// isIp: false,
// isPrivate: false,
// publicSuffix: 'org',
// subdomain: 'www' }
```
Modern _ES6 modules import_ is also supported:
```js
import { parse } from 'tldts';
```
Alternatively, you can try it _directly in your browser_ here: https://npm.runkit.com/tldts
# API
- `tldts.parse(url | hostname, options)`
- `tldts.getHostname(url | hostname, options)`
- `tldts.getDomain(url | hostname, options)`
- `tldts.getPublicSuffix(url | hostname, options)`
- `tldts.getSubdomain(url, | hostname, options)`
- `tldts.getDomainWithoutSuffix(url | hostname, options)`
The behavior of `tldts` can be customized using an `options` argument for all
the functions exposed as part of the public API. This is useful to both change
the behavior of the library as well as fine-tune the performance depending on
your inputs.
```js
{
// Use suffixes from ICANN section (default: true)
allowIcannDomains: boolean;
// Use suffixes from Private section (default: false)
allowPrivateDomains: boolean;
// Extract and validate hostname (default: true)
// When set to `false`, inputs will be considered valid hostnames.
extractHostname: boolean;
// Validate hostnames after parsing (default: true)
// If a hostname is not valid, not further processing is performed. When set
// to `false`, inputs to the library will be considered valid and parsing will
// proceed regardless.
validateHostname: boolean;
// Perform IP address detection (default: true).
detectIp: boolean;
// Assume that both URLs and hostnames can be given as input (default: true)
// If set to `false` we assume only URLs will be given as input, which
// speed-ups processing.
mixedInputs: boolean;
// Specifies extra valid suffixes (default: null)
validHosts: string[] | null;
}
```
The `parse` method returns handy **properties about a URL or a hostname**.
```js
const tldts = require('tldts');
tldts.parse('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv');
// { domain: 'amazonaws.com',
// domainWithoutSuffix: 'amazonaws',
// hostname: 'spark-public.s3.amazonaws.com',
// isIcann: true,
// isIp: false,
// isPrivate: false,
// publicSuffix: 'com',
// subdomain: 'spark-public.s3' }
tldts.parse(
'https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv',
{ allowPrivateDomains: true },
);
// { domain: 'spark-public.s3.amazonaws.com',
// domainWithoutSuffix: 'spark-public',
// hostname: 'spark-public.s3.amazonaws.com',
// isIcann: false,
// isIp: false,
// isPrivate: true,
// publicSuffix: 's3.amazonaws.com',
// subdomain: '' }
tldts.parse('gopher://domain.unknown/');
// { domain: 'domain.unknown',
// domainWithoutSuffix: 'domain',
// hostname: 'domain.unknown',
// isIcann: false,
// isIp: false,
// isPrivate: true,
// publicSuffix: 'unknown',
// subdomain: '' }
tldts.parse('https://192.168.0.0'); // IPv4
// { domain: null,
// domainWithoutSuffix: null,
// hostname: '192.168.0.0',
// isIcann: null,
// isIp: true,
// isPrivate: null,
// publicSuffix: null,
// subdomain: null }
tldts.parse('https://[::1]'); // IPv6
// { domain: null,
// domainWithoutSuffix: null,
// hostname: '::1',
// isIcann: null,
// isIp: true,
// isPrivate: null,
// publicSuffix: null,
// subdomain: null }
tldts.parse('[email protected]'); // email
// { domain: 'emailprovider.co.uk',
// domainWithoutSuffix: 'emailprovider',
// hostname: 'emailprovider.co.uk',
// isIcann: true,
// isIp: false,
// isPrivate: false,
// publicSuffix: 'co.uk',
// subdomain: '' }
```
| Property Name | Type | Description |
| :-------------------- | :----- | :---------------------------------------------- |
| `hostname` | `str` | `hostname` of the input extracted automatically |
| `domain` | `str` | Domain (tld + sld) |
| `domainWithoutSuffix` | `str` | Domain without public suffix |
| `subdomain` | `str` | Sub domain (what comes after `domain`) |
| `publicSuffix` | `str` | Public Suffix (tld) of `hostname` |
| `isIcann` | `bool` | Does TLD come from ICANN part of the list |
| `isPrivate` | `bool` | Does TLD come from Private part of the list |
| `isIP` | `bool` | Is `hostname` an IP address? |
## Single purpose methods
These methods are shorthands if you want to retrieve only a single value (and
will perform better than `parse` because less work will be needed).
### getHostname(url | hostname, options?)
Returns the hostname from a given string.
```javascript
const { getHostname } = require('tldts');
getHostname('google.com'); // returns `google.com`
getHostname('fr.google.com'); // returns `fr.google.com`
getHostname('fr.google.google'); // returns `fr.google.google`
getHostname('foo.google.co.uk'); // returns `foo.google.co.uk`
getHostname('t.co'); // returns `t.co`
getHostname('fr.t.co'); // returns `fr.t.co`
getHostname(
'https://user:[email protected]:8080/some/path?and&query#hash',
); // returns `example.co.uk`
```
### getDomain(url | hostname, options?)
Returns the fully qualified domain from a given string.
```javascript
const { getDomain } = require('tldts');
getDomain('google.com'); // returns `google.com`
getDomain('fr.google.com'); // returns `google.com`
getDomain('fr.google.google'); // returns `google.google`
getDomain('foo.google.co.uk'); // returns `google.co.uk`
getDomain('t.co'); // returns `t.co`
getDomain('fr.t.co'); // returns `t.co`
getDomain('https://user:[email protected]:8080/some/path?and&query#hash'); // returns `example.co.uk`
```
### getDomainWithoutSuffix(url | hostname, options?)
Returns the domain (as returned by `getDomain(...)`) without the public suffix part.
```javascript
const { getDomainWithoutSuffix } = require('tldts');
getDomainWithoutSuffix('google.com'); // returns `google`
getDomainWithoutSuffix('fr.google.com'); // returns `google`
getDomainWithoutSuffix('fr.google.google'); // returns `google`
getDomainWithoutSuffix('foo.google.co.uk'); // returns `google`
getDomainWithoutSuffix('t.co'); // returns `t`
getDomainWithoutSuffix('fr.t.co'); // returns `t`
getDomainWithoutSuffix(
'https://user:[email protected]:8080/some/path?and&query#hash',
); // returns `example`
```
### getSubdomain(url | hostname, options?)
Returns the complete subdomain for a given string.
```javascript
const { getSubdomain } = require('tldts');
getSubdomain('google.com'); // returns ``
getSubdomain('fr.google.com'); // returns `fr`
getSubdomain('google.co.uk'); // returns ``
getSubdomain('foo.google.co.uk'); // returns `foo`
getSubdomain('moar.foo.google.co.uk'); // returns `moar.foo`
getSubdomain('t.co'); // returns ``
getSubdomain('fr.t.co'); // returns `fr`
getSubdomain(
'https://user:[email protected]:443/some/path?and&query#hash',
); // returns `secure`
```
### getPublicSuffix(url | hostname, options?)
Returns the [public suffix][] for a given string.
```javascript
const { getPublicSuffix } = require('tldts');
getPublicSuffix('google.com'); // returns `com`
getPublicSuffix('fr.google.com'); // returns `com`
getPublicSuffix('google.co.uk'); // returns `co.uk`
getPublicSuffix('s3.amazonaws.com'); // returns `com`
getPublicSuffix('s3.amazonaws.com', { allowPrivateDomains: true }); // returns `s3.amazonaws.com`
getPublicSuffix('tld.is.unknown'); // returns `unknown`
```
# Troubleshooting
## Retrieving subdomain of `localhost` and custom hostnames
`tldts` methods `getDomain` and `getSubdomain` are designed to **work only with _known and valid_ TLDs**.
This way, you can trust what a domain is.
`localhost` is a valid hostname but not a TLD. You can pass additional options to each method exposed by `tldts`:
```js
const tldts = require('tldts');
tldts.getDomain('localhost'); // returns null
tldts.getSubdomain('vhost.localhost'); // returns null
tldts.getDomain('localhost', { validHosts: ['localhost'] }); // returns 'localhost'
tldts.getSubdomain('vhost.localhost', { validHosts: ['localhost'] }); // returns 'vhost'
```
## Updating the TLDs List
`tldts` made the opinionated choice of shipping with a list of suffixes directly
in its bundle. There is currently no mechanism to update the lists yourself, but
we make sure that the version shipped is always up-to-date.
If you keep `tldts` updated, the lists should be up-to-date as well!
# Performance
`tldts` is the _fastest JavaScript library_ available for parsing hostnames. It is able to parse _millions of inputs per second_ (typically 2-3M depending on your hardware and inputs). It also offers granular options to fine-tune the behavior and performance of the library depending on the kind of inputs you are dealing with (e.g.: if you know you only manipulate valid hostnames you can disable the hostname extraction step with `{ extractHostname: false }`).
Please see [this detailed comparison](./comparison/comparison.md) with other available libraries.
## Contributors
`tldts` is based upon the excellent `tld.js` library and would not exist without
the many contributors who worked on the project:
<a href="graphs/contributors"><img src="https://opencollective.com/tldjs/contributors.svg?width=890" /></a>
This project would not be possible without the amazing Mozilla's
[public suffix list][]. Thank you for your hard work!
# License
[MIT License](LICENSE).
[badge-ci]: https://secure.travis-ci.org/remusao/tldts.svg?branch=master
[badge-downloads]: https://img.shields.io/npm/dm/tldts.svg
[public suffix list]: https://publicsuffix.org/list/
[list the recent changes]: https://github.com/publicsuffix/list/commits/master
[changes Atom Feed]: https://github.com/publicsuffix/list/commits/master.atom
[public suffix]: https://publicsuffix.org/learn/
|