版博士V2.0程序
Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.

1 год назад
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467
  1. <p align="center">
  2. <img src="./logo.png" alt="Hyntax project logo — lego bricks in the shape of a capital letter H" width="150">
  3. </p>
  4. # Hyntax
  5. Straightforward HTML parser for JavaScript. [Live Demo](https://astexplorer.net/#/gist/6bf7f78077333cff124e619aebfb5b42/latest).
  6. - **Simple.** API is straightforward, output is clear.
  7. - **Forgiving.** Just like a browser, normally parses invalid HTML.
  8. - **Supports streaming.** Can process HTML while it's still being loaded.
  9. - **No dependencies.**
  10. ## Table Of Contents
  11. - [Usage](#usage)
  12. - [TypeScript Typings](#typescript-typings)
  13. - [Streaming](#streaming)
  14. - [Tokens](#tokens)
  15. - [AST Format](#ast-format)
  16. - [API Reference](#api-reference)
  17. - [Types Reference](#types-reference)
  18. ## Usage
  19. ```bash
  20. npm install hyntax
  21. ```
  22. ```javascript
  23. const { tokenize, constructTree } = require('hyntax')
  24. const util = require('util')
  25. const inputHTML = `
  26. <html>
  27. <body>
  28. <input type="text" placeholder="Don't type">
  29. <button>Don't press</button>
  30. </body>
  31. </html>
  32. `
  33. const { tokens } = tokenize(inputHTML)
  34. const { ast } = constructTree(tokens)
  35. console.log(JSON.stringify(tokens, null, 2))
  36. console.log(util.inspect(ast, { showHidden: false, depth: null }))
  37. ```
  38. ## TypeScript Typings
  39. Hyntax is written in JavaScript but has [integrated TypeScript typings](./index.d.ts) to help you navigate around its data structures. There is also [Types Reference](#types-reference) which covers most common types.
  40. ## Streaming
  41. Use `StreamTokenizer` and `StreamTreeConstructor` classes to parse HTML chunk by chunk while it's still being loaded from the network or read from the disk.
  42. ```javascript
  43. const { StreamTokenizer, StreamTreeConstructor } = require('hyntax')
  44. const http = require('http')
  45. const util = require('util')
  46. http.get('http://info.cern.ch', (res) => {
  47. const streamTokenizer = new StreamTokenizer()
  48. const streamTreeConstructor = new StreamTreeConstructor()
  49. let resultTokens = []
  50. let resultAst
  51. res.pipe(streamTokenizer).pipe(streamTreeConstructor)
  52. streamTokenizer
  53. .on('data', (tokens) => {
  54. resultTokens = resultTokens.concat(tokens)
  55. })
  56. .on('end', () => {
  57. console.log(JSON.stringify(resultTokens, null, 2))
  58. })
  59. streamTreeConstructor
  60. .on('data', (ast) => {
  61. resultAst = ast
  62. })
  63. .on('end', () => {
  64. console.log(util.inspect(resultAst, { showHidden: false, depth: null }))
  65. })
  66. }).on('error', (err) => {
  67. throw err;
  68. })
  69. ```
  70. ## Tokens
  71. Here are all kinds of tokens which Hyntax will extract out of HTML string.
  72. ![Overview of all possible tokens](./tokens-list.png)
  73. Each token conforms to [Tokenizer.Token](#TokenizerToken) interface.
  74. ## AST Format
  75. Resulting syntax tree will have at least one top-level [Document Node](#ast-node-types) with optional children nodes nested within.
  76. <!-- You can play around with the [AST Explorer](https://astexplorer.net) to see how AST looks like. -->
  77. ```javascript
  78. {
  79. nodeType: TreeConstructor.NodeTypes.Document,
  80. content: {
  81. children: [
  82. {
  83. nodeType: TreeConstructor.NodeTypes.AnyNodeType,
  84. content: {…}
  85. },
  86. {
  87. nodeType: TreeConstructor.NodeTypes.AnyNodeType,
  88. content: {…}
  89. }
  90. ]
  91. }
  92. }
  93. ```
  94. Content of each node is specific to node's type, all of them are described in [AST Node Types](#ast-node-types) reference.
  95. ## API Reference
  96. ### Tokenizer
  97. Hyntax has its tokenizer as a separate module. You can use generated tokens on their own or pass them further to a tree constructor to build an AST.
  98. #### Interface
  99. ```typescript
  100. tokenize(html: String): Tokenizer.Result
  101. ```
  102. #### Arguments
  103. - `html`
  104. HTML string to process
  105. Required.
  106. Type: string.
  107. #### Returns [Tokenizer.Result](#TokenizerResult)
  108. ### Tree Constructor
  109. After you've got an array of tokens, you can pass them into tree constructor to build an AST.
  110. #### Interface
  111. ```typescript
  112. constructTree(tokens: Tokenizer.AnyToken[]): TreeConstructor.Result
  113. ```
  114. #### Arguments
  115. - `tokens`
  116. Array of tokens received from the tokenizer.
  117. Required.
  118. Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
  119. #### Returns [TreeConstructor.Result](#TreeConstructorResult)
  120. ## Types Reference
  121. #### Tokenizer.Result
  122. ```typescript
  123. interface Result {
  124. state: Tokenizer.State
  125. tokens: Tokenizer.AnyToken[]
  126. }
  127. ```
  128. - `state`
  129. The current state of tokenizer. It can be persisted and passed to the next tokenizer call if the input is coming in chunks.
  130. - `tokens`
  131. Array of resulting tokens.
  132. Type: [Tokenizer.AnyToken[]](#tokenizeranytoken)
  133. #### TreeConstructor.Result
  134. ```typescript
  135. interface Result {
  136. state: State
  137. ast: AST
  138. }
  139. ```
  140. - `state`
  141. The current state of the tree constructor. Can be persisted and passed to the next tree constructor call in case when tokens are coming in chunks.
  142. - `ast`
  143. Resulting AST.
  144. Type: [TreeConstructor.AST](#treeconstructorast)
  145. #### Tokenizer.Token
  146. Generic Token, other interfaces use it to create a specific Token type.
  147. ```typescript
  148. interface Token<T extends TokenTypes.AnyTokenType> {
  149. type: T
  150. content: string
  151. startPosition: number
  152. endPosition: number
  153. }
  154. ```
  155. - `type`
  156. One of the [Token types](#TokenizerTokenTypesAnyTokenType).
  157. - `content `
  158. Piece of original HTML string which was recognized as a token.
  159. - `startPosition `
  160. Index of a character in the input HTML string where the token starts.
  161. - `endPosition`
  162. Index of a character in the input HTML string where the token ends.
  163. #### Tokenizer.TokenTypes.AnyTokenType
  164. Shortcut type of all possible tokens.
  165. ```typescript
  166. type AnyTokenType =
  167. | Text
  168. | OpenTagStart
  169. | AttributeKey
  170. | AttributeAssigment
  171. | AttributeValueWrapperStart
  172. | AttributeValue
  173. | AttributeValueWrapperEnd
  174. | OpenTagEnd
  175. | CloseTag
  176. | OpenTagStartScript
  177. | ScriptTagContent
  178. | OpenTagEndScript
  179. | CloseTagScript
  180. | OpenTagStartStyle
  181. | StyleTagContent
  182. | OpenTagEndStyle
  183. | CloseTagStyle
  184. | DoctypeStart
  185. | DoctypeEnd
  186. | DoctypeAttributeWrapperStart
  187. | DoctypeAttribute
  188. | DoctypeAttributeWrapperEnd
  189. | CommentStart
  190. | CommentContent
  191. | CommentEnd
  192. ```
  193. #### Tokenizer.AnyToken
  194. Shortcut to reference any possible token.
  195. ```typescript
  196. type AnyToken = Token<TokenTypes.AnyTokenType>
  197. ```
  198. #### TreeConstructor.AST
  199. Just an alias to DocumentNode. AST always has one top-level DocumentNode. See [AST Node Types](#ast-node-types)
  200. ```typescript
  201. type AST = TreeConstructor.DocumentNode
  202. ```
  203. ### AST Node Types
  204. There are 7 possible types of Node. Each type has a specific content.
  205. ```typescript
  206. type DocumentNode = Node<NodeTypes.Document, NodeContents.Document>
  207. ```
  208. ```typescript
  209. type DoctypeNode = Node<NodeTypes.Doctype, NodeContents.Doctype>
  210. ```
  211. ```typescript
  212. type TextNode = Node<NodeTypes.Text, NodeContents.Text>
  213. ```
  214. ```typescript
  215. type TagNode = Node<NodeTypes.Tag, NodeContents.Tag>
  216. ```
  217. ```typescript
  218. type CommentNode = Node<NodeTypes.Comment, NodeContents.Comment>
  219. ```
  220. ```typescript
  221. type ScriptNode = Node<NodeTypes.Script, NodeContents.Script>
  222. ```
  223. ```typescript
  224. type StyleNode = Node<NodeTypes.Style, NodeContents.Style>
  225. ```
  226. Interfaces for each content type:
  227. - [Document](#TreeConstructorNodeContentsDocument)
  228. - [Doctype](#TreeConstructorNodeContentsDoctype)
  229. - [Text](#TreeConstructorNodeContentsText)
  230. - [Tag](#TreeConstructorNodeContentsTag)
  231. - [Comment](#TreeConstructorNodeContentsComment)
  232. - [Script](#TreeConstructorNodeContentsScript)
  233. - [Style](#TreeConstructorNodeContentsStyle)
  234. #### TreeConstructor.Node
  235. Generic Node, other interfaces use it to create specific Nodes by providing type of Node and type of the content inside the Node.
  236. ```typescript
  237. interface Node<T extends NodeTypes.AnyNodeType, C extends NodeContents.AnyNodeContent> {
  238. nodeType: T
  239. content: C
  240. }
  241. ```
  242. #### TreeConstructor.NodeTypes.AnyNodeType
  243. Shortcut type of all possible Node types.
  244. ```typescript
  245. type AnyNodeType =
  246. | Document
  247. | Doctype
  248. | Tag
  249. | Text
  250. | Comment
  251. | Script
  252. | Style
  253. ```
  254. ### Node Content Types
  255. #### TreeConstructor.NodeTypes.AnyNodeContent
  256. Shortcut type of all possible types of content inside a Node.
  257. ```typescript
  258. type AnyNodeContent =
  259. | Document
  260. | Doctype
  261. | Text
  262. | Tag
  263. | Comment
  264. | Script
  265. | Style
  266. ```
  267. #### TreeConstructor.NodeContents.Document
  268. ```typescript
  269. interface Document {
  270. children: AnyNode[]
  271. }
  272. ```
  273. #### TreeConstructor.NodeContents.Doctype
  274. ```typescript
  275. interface Doctype {
  276. start: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeStart>
  277. attributes?: DoctypeAttribute[]
  278. end: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeEnd>
  279. }
  280. ```
  281. #### TreeConstructor.NodeContents.Text
  282. ```typescript
  283. interface Text {
  284. value: Tokenizer.Token<Tokenizer.TokenTypes.Text>
  285. }
  286. ```
  287. #### TreeConstructor.NodeContents.Tag
  288. ```typescript
  289. interface Tag {
  290. name: string
  291. selfClosing: boolean
  292. openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStart>
  293. attributes?: TagAttribute[]
  294. openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEnd>
  295. children?: AnyNode[]
  296. close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTag>
  297. }
  298. ```
  299. #### TreeConstructor.NodeContents.Comment
  300. ```typescript
  301. interface Comment {
  302. start: Tokenizer.Token<Tokenizer.TokenTypes.CommentStart>
  303. value: Tokenizer.Token<Tokenizer.TokenTypes.CommentContent>
  304. end: Tokenizer.Token<Tokenizer.TokenTypes.CommentEnd>
  305. }
  306. ```
  307. #### TreeConstructor.NodeContents.Script
  308. ```typescript
  309. interface Script {
  310. openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartScript>
  311. attributes?: TagAttribute[]
  312. openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndScript>
  313. value: Tokenizer.Token<Tokenizer.TokenTypes.ScriptTagContent>
  314. close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagScript>
  315. }
  316. ```
  317. #### TreeConstructor.NodeContents.Style
  318. ```typescript
  319. interface Style {
  320. openStart: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagStartStyle>,
  321. attributes?: TagAttribute[],
  322. openEnd: Tokenizer.Token<Tokenizer.TokenTypes.OpenTagEndStyle>,
  323. value: Tokenizer.Token<Tokenizer.TokenTypes.StyleTagContent>,
  324. close: Tokenizer.Token<Tokenizer.TokenTypes.CloseTagStyle>
  325. }
  326. ```
  327. #### TreeConstructor.DoctypeAttribute
  328. ```typescript
  329. interface DoctypeAttribute {
  330. startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperStart>,
  331. value: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttribute>,
  332. endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.DoctypeAttributeWrapperEnd>
  333. }
  334. ```
  335. #### TreeConstructor.TagAttribute
  336. ```typescript
  337. interface TagAttribute {
  338. key?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeKey>,
  339. startWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperStart>,
  340. value?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValue>,
  341. endWrapper?: Tokenizer.Token<Tokenizer.TokenTypes.AttributeValueWrapperEnd>
  342. }
  343. ```