astro-ghostcms/.pnpm-store/v3/files/1a/950a43487288f154a1624e7cb0a...

151 lines
4.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# parse-latin
[![Build][build-badge]][build]
[![Coverage][coverage-badge]][coverage]
[![Downloads][downloads-badge]][downloads]
[![Size][size-badge]][size]
[![Chat][chat-badge]][chat]
A Latin-script language parser for [**retext**][retext] producing **[nlcst][]**
nodes.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
`parse-latin` does a good job at tokenizing it.
Note also that `parse-latin` does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.
## Install
This package is ESM only: Node 12+ is needed to use it and it must be `import`ed
instead of `require`d.
[npm][]:
```sh
npm install parse-latin
```
## Use
```js
import {inspect} from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'
const tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
```
Which, when inspecting, yields:
```txt
RootNode[1] (1:1-1:19, 0-18)
└─0 ParagraphNode[1] (1:1-1:19, 0-18)
└─0 SentenceNode[6] (1:1-1:19, 0-18)
├─0 WordNode[1] (1:1-1:2, 0-1)
│ └─0 TextNode "A" (1:1-1:2, 0-1)
├─1 WhiteSpaceNode " " (1:2-1:3, 1-2)
├─2 WordNode[1] (1:3-1:9, 2-8)
│ └─0 TextNode "simple" (1:3-1:9, 2-8)
├─3 WhiteSpaceNode " " (1:9-1:10, 8-9)
├─4 WordNode[1] (1:10-1:18, 9-17)
│ └─0 TextNode "sentence" (1:10-1:18, 9-17)
└─5 PunctuationNode "." (1:18-1:19, 17-18)
```
## API
This package exports the following identifiers: `ParseLatin`.
There is no default export.
### `ParseLatin(value)`
Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If `value` is passed here, its not needed to give it to `#parse()`.
#### `ParseLatin#tokenize(value)`
Tokenize `value` (`string`) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.
###### Returns
[`Array.<Node>`][nlcst] — Nodes.
#### `ParseLatin#parse(value)`
Tokenize `value` (`string`) into an [NLCST][] tree.
The returned node is a `RootNode` with in it paragraphs and sentences.
###### Returns
[`Node`][nlcst] — Root node.
## Algorithm
> Note: The easiest way to see **how parse-latin tokenizes and parses**, is by
> using the [online parser demo][demo], which
> shows the syntax tree corresponding to the typed text.
`parse-latin` splits text into white space, word, and punctuation tokens.
`parse-latin` starts out with a pretty easy definition, one that most other
tokenizers use:
* A “word” is one or more letter or number characters
* A “white space” is one or more white space characters
* A “punctuation” is one or more of anything else
Then, it manipulates and merges those tokens into a ([nlcst][]) syntax tree,
adding sentences and paragraphs where needed.
* Some punctuation marks are part of the word they occur in, such as
`non-profit`, `shes`, `G.I.`, `11:00`, `N/A`, `&c`, `nineteenth- and…`
* Some full-stops do not mark a sentence end, such as `1.`, `e.g.`, `id.`
* Although full-stops, question marks, and exclamation marks (sometimes) end a
sentence, that end might not occur directly after the mark, such as `.)`,
`."`
* And many more exceptions
## License
[MIT][license] © [Titus Wormer][author]
<!-- Definitions -->
[build-badge]: https://github.com/wooorm/parse-latin/workflows/main/badge.svg
[build]: https://github.com/wooorm/parse-latin/actions
[coverage-badge]: https://img.shields.io/codecov/c/github/wooorm/parse-latin.svg
[coverage]: https://codecov.io/github/wooorm/parse-latin
[downloads-badge]: https://img.shields.io/npm/dm/parse-latin.svg
[downloads]: https://www.npmjs.com/package/parse-latin
[size-badge]: https://img.shields.io/bundlephobia/minzip/parse-latin.svg
[size]: https://bundlephobia.com/result?p=parse-latin
[chat-badge]: https://img.shields.io/badge/join%20the%20community-on%20spectrum-7b16ff.svg
[chat]: https://spectrum.chat/unified/retext
[npm]: https://docs.npmjs.com/cli/install
[demo]: https://wooorm.com/parse-latin/
[license]: license
[author]: https://wooorm.com
[retext]: https://github.com/retextjs/retext
[nlcst]: https://github.com/syntax-tree/nlcst