How to Write for Strict Mode

In loosey goosey mode you write what ever and I try to figure out what you meant. I'm often wrong.

In strict mode, you annote your text enough for the parser to detect numbers, neologisms, foreign words, prepositions and so on.

Rules

  1. Numbers are prefixed with #, e.g. #luka li nanpa.
  2. Neologisms are prefixed with +, e.g. +nupa li nimi sin.
  3. Foreign words are wrapped in double quotes, e.g. "Cromulent"
  4. Prepositions are lead by a comma, e.g. mi, lon ma Mewika. jan li, tawa ma Mewika.
  5. Prepositions can also start with ~, e.g. ~lon, ~sama, ~poka
  6. Compound words are joined by -, e.g. jan-pona
  7. Direct quotes are in single quotes, e.g. jan li toki e ni: 'ale li pona.'

Numbers

There are lots and lots of community proposals for numbers. I support Stupid, Half-Stupid, Poman, and Body numbers. You can learn more about half-stupid numbers here.

The parser works best with explicit numbers, i.e. #wan, #luka, #luka-wan. If you don't use explicit numbers, you can get better results by specifying what number system you are using.

Stupid

Stupid/Poman

Poman are the same as below, except you capitalize and use only the first letter, eg. AAMMTW = 100+100+20+20+2+1 = 243.

Body

This system has no particular blessing, no evidence of use by anyone but me as of 2014. It happened to be easier to pick any system than to have to deal with the above systems or pretend that numbers don't show up in real life translation exercises.

At the moment, until I have no names for place holders.

It is still recommended to write numbers as #123 instead of #nena-oko-kute or #wan-tu-kute. Definitely anything over 100 should be written in Arabic numbers. If you use body numbers, you just about have to write them explicitly. There isn't an easy way to infer implicit body numbers. (Even luka, mute, ale are challenging to identify correctly in half stupid numbers.)

Punctuating Input

Also, the status of prepositional phrases as modifiers is undefined and they make things hard to parse right now. So if you have a prepositional phrase modifier, especially in the subject phrase, join it with dashes, e.g. jan lon-ma-mi li jo e mani suli. I can sometimes normalize vocatives correctly, but if you can put period, colon, question or exclamation mark, vocatives will parse better. If sentences don't have terminating punctuation, I assume it is a run on sentence. I can't easily guess that a sentence has ended just because of white space, a blank line, parenthesis or so on. The problem is especially acute with titles, conversational fragments and poetry. Also, the parser has a hard time dealing with elipses, such as, mi wile e .... --- Which I suppose could happen in conversation, but the parser thinks the string has terminated early.

Edgy toki pona

I like to write edgy toki pona, but the parser can't deal with it. Example include: compound prepositions, transformatives, noun phrases in the verb phrase, intentional and accidental subordinate clasues (I'm going to have to get back with an example), nor "mixed modifiers" that use en to join modifiers, e.g. kule pi laso en pimeja.

Bracket Help

Phrase Brackets

Part of Speech markers

What are tags. They are undeliminted, i.e. not set off from other phrases by particles like pi, e, li, etc. They have a scope that can be unusual. I don't know how to describe it. When a word is tagged, that word can pop up anywhere that a word can. Perfect example is anu X and ala in the verb phrase. To not invent the concept of tagged words is to suggest the verb paradigm has a slot in between each word for an anu X or ala.

Close related to tagged words are conjunction tags, which sit at the front of a sentence, undeliminated from the subject and the anu seme tag, which sits at the end of a sentence without deliminators from what ever might have gone before.

What is this

The parser translates tp text into C# data structures. From this I can do a variety of experiments, such as sentence diagramming, glossing to English, spell checking, and grammar checking. After the text has been converted to an object-oriented data structure, I can convert it back to a toki pona string. When the code is bug free, it will be nearly identical. Right now, there are bugs and I lose white space and punctuation when converting back to toki pona.

Normalization

TP has two design mistakes. First, there is extensive overlap between content and function words. The difference between content and function words is like porn, you know it when you see it. Function words in tp are pi, li, la, e, and the six prepositions. The prepositions can be used as nouns and verbs, the pi, li la, e particles can't. In normalization, I make best guesses at what is being used as a preposition.

The other mistake was dropping li from mi/sina. It makes parsing more difficult and the language isn't any more concise. So I also normalize by putting back the missing li. People can do this effortlessly, computers do it poorly.

Implicit vs Explicit Part of Speech

Some part of speech types are really hard to identify, but if you add some annotation, the task is much easier. This includes neologism, prepositions, numbers, foreign text, numbers and a few more categories.

Glossing

In glossing, I look up words by part of speech and pick one alternative at random. I'm starting with the jan Sonja classic dictionary, which to my surprise is actually missing a lot of entries. Most content words can be plausibly used in all POS, but are only defined for a few. This used to lead to arguments. Right now it leads to [square bracketed] error messages.

Grammar Checking

I can detect things like extra li after mi, too few words after pi, missing li, missing e, but not always. Often bad grammar is missing particles, which looks like a long string of modifiers.

Serialization

If you've read this far, surely you want to see what a tp sentence looks like serialized to json or XML: xml/json