|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467 |
- <p align="center">
-
- <img src="./logo.png" alt="Hyntax project logo — lego bricks in the shape of a capital letter H" width="150">
-
- </p>
-
- # Hyntax
-
- Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).
-
- - **Simple.** API is straightforward, output is clear.
- - **Forgiving.** Just like a browser, normally parses invalid HTML.
- - **Supports streaming.** Can process HTML while it's still being loaded.
- - **No dependencies.**
-
-
-
- ## Table Of Contents
-
- - [Usage](#usage)
- - [TypeScript Typings](#typescript-typings)
- - [Streaming](#streaming)
- - [Tokens](#tokens)
- - [AST Format](#ast-format)
- - [API Reference](#api-reference)
- - [Types Reference](#types-reference)
-
-
-
- ## Usage
-
- ```bash
- npm install hyntax
- ```
-
- ```javascript
- const { tokenize, constructTree } = require('hyntax')
- const util = require('util')
-
- const inputHTML = `
- <html>
- <body>
- <input type="text" placeholder="Don't type">
- <button>Don't press</button>
- </body>
- </html>
- `
-
- const { tokens } = tokenize(inputHTML)
- const { ast } = constructTree(tokens)
-
- console.log(JSON.stringify(tokens, null, 2))
- console.log(util.inspect(ast, { showHidden: false, depth: null }))
- ```
-
- ## TypeScript Typings
-
- Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.
-
-
-
- ## Streaming
-
- Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
-
- ```javascript
- const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
- const http = require('http')
- const util = require('util')
-
- http.get('http://info.cern.ch', (res) => {
- const streamTokenizer = new StreamTokenizer()
- const streamTreeConstructor = new StreamTreeConstructor()
-
- let resultTokens = []
- let resultAst
-
- res.pipe(streamTokenizer).pipe(streamTreeConstructor)
-
- streamTokenizer
- .on('data', (tokens) => {
- resultTokens = resultTokens.concat(tokens)
- })
- .on('end', () => {
- console.log(JSON.stringify(resultTokens, null, 2))
- })
-
- streamTreeConstructor
- .on('data', (ast) => {
- resultAst = ast
- })
- .on('end', () => {
- console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
- })
- }).on('error', (err) => {
- throw err;
- })
- ```
-
-
-
- ## Tokens
-
- Here are all kinds of tokens which Hyntax will extract out of HTML string.
-
- 
-
- Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.
-
-
-
- ## AST Format
-
- Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.
-
- <!-- You can play around with the [AST Explorer](https://astexplorer.net) to see how AST looks like. -->
-
- ```javascript
- {
- nodeType: TreeConstructor.NodeTypes.Document,
- content: {
- children: [
- {
- nodeType: TreeConstructor.NodeTypes.AnyNodeType,
- content: {…}
- },
- {
- nodeType: TreeConstructor.NodeTypes.AnyNodeType,
- content: {…}
- }
- ]
- }
- }
- ```
-
- Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.
-
-
-
- ## API Reference
-
- ### Tokenizer
-
- Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
-
- #### Interface
-
- ```typescript
- tokenize(html: String): Tokenizer.Result
- ```
-
- #### Arguments
-
- - `html`
- HTML string to process
- Required.
- Type: string.
-
- #### Returns [Tokenizer.Result](#TokenizerResult)
-
- ### Tree Constructor
-
- After you've got an array of tokens, you can pass them into tree constructor to build an AST.
-
- #### Interface
-
- ```typescript
- constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
- ```
-
- #### Arguments
-
- - `tokens`
- Array of tokens received from the tokenizer.
- Required.
- Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
-
- #### Returns [TreeConstructor.Result](#TreeConstructorResult)
-
-
-
- ## Types Reference
-
- #### Tokenizer.Result
-
- ```typescript
- interface Result {
- state: Tokenizer.State
- tokens: Tokenizer.AnyToken[]
- }
- ```
-
- - `state`
- The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
- - `tokens`
- Array of resulting tokens.
- Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
-
- #### TreeConstructor.Result
-
- ```typescript
- interface Result {
- state: State
- ast: AST
- }
- ```
-
- - `state`
- The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
-
- - `ast`
- Resulting AST.
- Type: [TreeConstructor.AST](#treeconstructorast)
-
- #### Tokenizer.Token
-
- Generic Token, other interfaces use it to create a specific Token type.
-
- ```typescript
- interface Token<T extends TokenTypes.AnyTokenType> {
- type: T
- content: string
- startPosition: number
- endPosition: number
- }
- ```
-
- - `type`
- One of the [Token types](#TokenizerTokenTypesAnyTokenType).
-
- - `content `
- Piece of original HTML string which was recognized as a token.
-
- - `startPosition `
- Index of a character in the input HTML string where the token starts.
-
- - `endPosition`
- Index of a character in the input HTML string where the token ends.
-
- #### Tokenizer.TokenTypes.AnyTokenType
-
- Shortcut type of all possible tokens.
-
- ```typescript
- type AnyTokenType =
- | Text
- | OpenTagStart
- | AttributeKey
- | AttributeAssigment
- | AttributeValueWrapperStart
- | AttributeValue
- | AttributeValueWrapperEnd
- | OpenTagEnd
- | CloseTag
- | OpenTagStartScript
- | ScriptTagContent
- | OpenTagEndScript
- | CloseTagScript
- | OpenTagStartStyle
- | StyleTagContent
- | OpenTagEndStyle
- | CloseTagStyle
- | DoctypeStart
- | DoctypeEnd
- | DoctypeAttributeWrapperStart
- | DoctypeAttribute
- | DoctypeAttributeWrapperEnd
- | CommentStart
- | CommentContent
- | CommentEnd
- ```
-
- #### Tokenizer.AnyToken
-
- Shortcut to reference any possible token.
-
- ```typescript
- type AnyToken = Token<TokenTypes.AnyTokenType>
- ```
-
- #### TreeConstructor.AST
-
- Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)
-
- ```typescript
- type AST = TreeConstructor.DocumentNode
- ```
-
- ### AST Node Types
-
- There are 7 possible types of Node. Each type has a specific content.
-
- ```typescript
- type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>
- ```
-
- ```typescript
- type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>
- ```
-
- ```typescript
- type TextNode = Node<NodeTypes.Text, NodeContents.Text>
- ```
-
- ```typescript
- type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>
- ```
-
- ```typescript
- type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>
- ```
-
- ```typescript
- type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>
- ```
-
- ```typescript
- type StyleNode = Node<NodeTypes.Style, NodeContents.Style>
- ```
-
- Interfaces for each content type:
-
- - [Document](#TreeConstructorNodeContentsDocument)
- - [Doctype](#TreeConstructorNodeContentsDoctype)
- - [Text](#TreeConstructorNodeContentsText)
- - [Tag](#TreeConstructorNodeContentsTag)
- - [Comment](#TreeConstructorNodeContentsComment)
- - [Script](#TreeConstructorNodeContentsScript)
- - [Style](#TreeConstructorNodeContentsStyle)
-
- #### TreeConstructor.Node
-
- Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
-
- ```typescript
- interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
- nodeType: T
- content: C
- }
- ```
-
- #### TreeConstructor.NodeTypes.AnyNodeType
-
- Shortcut type of all possible Node types.
-
- ```typescript
- type AnyNodeType =
- | Document
- | Doctype
- | Tag
- | Text
- | Comment
- | Script
- | Style
- ```
-
- ### Node Content Types
-
- #### TreeConstructor.NodeTypes.AnyNodeContent
-
- Shortcut type of all possible types of content inside a Node.
-
- ```typescript
- type AnyNodeContent =
- | Document
- | Doctype
- | Text
- | Tag
- | Comment
- | Script
- | Style
- ```
-
- #### TreeConstructor.NodeContents.Document
-
- ```typescript
- interface Document {
- children: AnyNode[]
- }
- ```
-
- #### TreeConstructor.NodeContents.Doctype
-
- ```typescript
- interface Doctype {
- start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
- attributes?: DoctypeAttribute[]
- end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
- }
- ```
-
- #### TreeConstructor.NodeContents.Text
-
- ```typescript
- interface Text {
- value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
- }
- ```
-
- #### TreeConstructor.NodeContents.Tag
-
- ```typescript
- interface Tag {
- name: string
- selfClosing: boolean
- openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
- attributes?: TagAttribute[]
- openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
- children?: AnyNode[]
- close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
- }
- ```
-
- #### TreeConstructor.NodeContents.Comment
-
- ```typescript
- interface Comment {
- start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
- value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
- end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
- }
- ```
-
- #### TreeConstructor.NodeContents.Script
-
- ```typescript
- interface Script {
- openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
- attributes?: TagAttribute[]
- openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
- value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
- close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
- }
- ```
-
- #### TreeConstructor.NodeContents.Style
-
- ```typescript
- interface Style {
- openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
- attributes?: TagAttribute[],
- openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
- value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
- close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
- }
- ```
-
- #### TreeConstructor.DoctypeAttribute
-
- ```typescript
- interface DoctypeAttribute {
- startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
- value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
- endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
- }
- ```
-
- #### TreeConstructor.TagAttribute
-
- ```typescript
- interface TagAttribute {
- key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
- startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
- value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
- endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
- }
- ```
|