lair of the dustbunny: August 2008

Tuesday, August 26, 2008

Puzzle: Determine the smallest integer

Here is a puzzle a friend forwarded to me recently:

Ten (not necessarily distinct) integers have the property that if all but one of them are added the possible results are: 82, 83, 84, 85, 87, 89, 90, 91, 92. What is the smallest of the integers?

(I'll post the solution and my method in a few days.)

Tuesday, August 19, 2008

emacs python mode from scratch: stage 4 - utilities (the rest)

Let's continue from the last entry and finish off the utilities.

The rest of the utilities:


(defun python-comment-line-p ()
(defun python-blank-line-p ()
(defun python-beginning-of-string ()
(defun python-open-block-statement-p (&optional bos)
(defun python-close-block-statement-p (&optional bos)
(defun python-outdent-p ()

python-comment-line-p

Go to the end of the line, if the emacs parser says we we are in a comment, go to the beginning of the line per indentation and check if we are looking at a comment or the end of a line.
This is weird. Seems like there is no way for this to ever be an end of line.
comment-start seems to be a symbol required by rx not necessarily a value set in define-derived-mode (which I'm not setting yet)
As an aside, the regex for start of comment is \s<
I don't think that line-end is a necessary part of that regular expression or at least if i take it out it doesn't seem to change the behavior but I'll just leave it in for now.
Presumably we are just depending on the comment character set in python-mode-syntax-table

python-blank-line-p

This is about as straight forward as you get. Go to the beginning of the line and check if you are looking at 0 or more white space chars followed by an end of line. \\s- is the ever so strange looking way emacs regular expressions represent white space.

python-beginning-of-string

Hey, this stuff is starting to look familiar. First we determine what state a parser would be at the current position. Then if the parser says we are in a string, we go the the starting point (8th item of the state list)

python-open-block-statement-p

So this and python-close-block-statement-p and python-outdent-p call a number of things that aren't already defined which is strange since you'd think some functions described as utilities would be self contained. The additional functions are:
```
python-beginning-of-statement and python-previous-statement
```
So we'll assume these functions do what they claim for now and investigate them more closely down below
This logic is pretty straight forward. Go to the beginning of the statement and then use a regular expression match to see if we are at the beginning of a block.
NOTE: there is an optional parameter that let's you skip the movement to the beginning of the statement. Seems like a simple performance helper.

python-close-block-statement-p

This is the exact analog of python-open-block-statement-p. Except here we check to see if it is something that for sure is the last member of a block.

python-outdent-p

Check if current line should be "outdented". Again pretty straight forward
code. Move to the beginning of the the current indentation and check if all of the following are true: looking at (else, finally, except or elif) and not in a comment or string and check that the previous statement is neither a close block nor an open block.
In other words if on something like an else statement and the previous statement is not something that requires indentation you should outdent.
Seems like this is just sort of a heuristic and not quite accurate.
E.g. for an improperly nested "else" following a return. It will say not to outdent. Of course, what *should* you do when the code is invalid?

python-beginning-of-statement

(this requires python-skip-out so we add that to the list as well)
Here we will move to the beginning of the line and then if on a continuation of some sort we'll either check if we are on a backslash style of continuation and move backward over these or we will move backward over strings and "skip out" (jump up) from nested brackets.

python-previous-statement

(this needs python-next-statement so we add that to the list as well)
If it receives a negative argument they really want to go forward, so pass off to python-next-statement. Otherwise, go to the beginning of the statement and skip over comments and blanks and continue going to the beginnnig of the statement until counter is 0 (or reach beginning of buffer)

python-skip-out

Pop up out of nested brackets to the front by default and to the
end if "forward" is set. Additionally if "syntax" of point is already available, then that can be passed in. For well formed nesting we just call backward-up-list with the depth. For ill-matched brackets we try to go backward up the list over and over until we get an error.

python-next-statement

This is pretty much an exact analogue of python-previous-statement.

There were a surprising number of functions that the utilities them self depended on. I continue to be amazed by the complexity of making a language sensitive mode for emacs.

I also noticed that I'm over 10% of the way done. Making progress.

Thursday, August 7, 2008

emacs python mode from scratch: stage 3 - some utilities

As I mentioned last time my plan was to do tabbing/indentation next but there was a lot of reliance on the utility section of code so the next step becomes getting these helper functions copied over and understood.

There are about 100 lines in the utility section so let's do half at a time. In the first 50 lines we have the following functions/definitions:


(defsubst python-in-string/comment ()
(defconst python-space-backslash-table
(defun python-skip-comments/blanks (&optional backward)
(defun python-backslash-continuation-line-p ()
(defun python-continuation-line-p ()

python-in-string/comment

defsubst is a marcro for inlining a function which seems strange. I wonder if this was just a performance booster. (Is there any other reason to inline code?)

syntax-ppss returns info on what the parser state would be at the current position. Below are the states and its clear why nth 8 is the thing we need for determining if we are in a string or comment


  0. The depth in parentheses, counting from 0. Warning: this can be negative if there are more close parens than open parens between the start of the defun and point.
  1. The character position of the start of the innermost parenthetical grouping containing the stopping point; nil if none.
  2. The character position of the start of the last complete subexpression terminated; nil if none.
  3. Non-nil if inside a string. More precisely, this is the character that will terminate the string, or t if a generic string delimiter character should term inate it.
  4. t if inside a comment (of either style), or the comment nesting level if inside a kind of comment that can be nested.
  5. t if point is just after a quote character.
  6. The minimum parenthesis depth encountered during this scan.
  7. What kind of comment is active: nil for a comment of style â?oaâ?? or when not inside a comment, t for a comment of style â?ob,â?? and syntax-table for comment that should be ended by a generic comment delimiter character.
  8. The string or comment start position. While inside a comment, this is the position where the comment began; while inside a string, this is the position where the string began. When outside of strings and comments, this element is nil.
  9. Internal data for continuing the parsing. The meaning of this data is subject to change; it is used if you pass this list as the state argument to another call.

syntax-ppss seems like an insanely powerful feature. My respect for the sophistication going on behind the scenes when emacs opens a file for a certain language continues to grow.

python-space-backslash-table and python-skip-comments/blanks

This seems to be a "throw away" data structure for overriding the *real* syntax table temporarily
It provides a syntax table that redefines "\" as a whitespace class. Presumably this is for allowing movement over otherwise blank continuation lines as if they were whitespace.
What this meant wasn't immediately obvious, but it seems to be addressing the type of situation below (which honestly I haven't seen in the wild before, but seems to be legal python code).
```
if x and \
        \
        \
        \
        \
  y:
   print z
```
For movement we will use forward-comment which moves over up to X comments. It moves backward if arg is negative. The forward motion is straight forward. The backward motion starts by positioning point so that it is at the *start* of the comment (probably to avoid having it forward-comment move it to the start of the comment itself which would probably seem like a noop to the user)

python-backslash-continuation-line-p

This is pretty straight forward. Check if the last char on the previous line is a \ and make sure not in a comment or string
syntax-ppss-context is a an undocumented function (but trivial) that checks if the current syntax parsing state is string or comment

python-continuation-line-p

is an extension of the above function and checks for the case of a continuation char (ie the above function) or if in a matching paren type context that is allowed to span multiple lines.
The syntax-ppss-depth function tells you how far nested you are in parens, braces, etc. The interesting part of this function is that if you have unmatched parens then it tries to move up the list and assumes that if you succeed then you must be in something close enough to matching lists after all. I wasn't able come up with a set of braces that tickled this case but otherwise the logic makes sense.

So that is the first half of the utility functions.

I'm still overwhelmed by the amount of complexity required to get a language mode working. I haven't even gotten something seemingly simple like tab support working. I guess this is like a lot of things. When you start to understand a new system, many of the hard things turn out to be trivial and many of the seemingly trivial things turn out to be quite complex.

I had never really browsed through the emacs lisp manual before and I'm finding it is quite helpful and not too hard to navigate.

Eventually I'll have to understand a mode for some other language(s) as well. I'm curious how much python's significant whitespace makes makes things harder. Presumably the default language mode behaviors are tuned for something like lisp and/or C.

I also wonder if it would be possible to use pythons own parser in place of emacs for doing some of the syntax checking activities. That would be an interesting project to pursue. As will be looking at the pymacs module.

Settle down there tiger. One thing at a time....

Friday, August 1, 2008

emacs python mode from scratch: stage 2 - comments

So my next plan was to look at indenting/tabbing, but I found almost immediately that (a) it's really bizarrely complicated and (b) it depends (in some of the supporting functions) on comments being recognized correctly. So I'm forced to figure that out now. So, rather than reading the emacs manual (I mean *anyone* could do that) I used the tried and true binary code search technique and started commenting things out (starting with the original code) until font-locking for comments stopped working.

That worked like a charm. The missing piece of the puzzle was this little fella:


(defvar python-mode-syntax-table
  (let ((table (make-syntax-table)))
    ;; Give punctuation syntax to ASCII that normally has symbol
    ;; syntax or has word syntax and isn't a letter.
    (let ((symbol (string-to-syntax "_"))
   (sst (standard-syntax-table)))
      (dotimes (i 128)
 (unless (= i ?_)
   (if (equal symbol (aref sst i))
       (modify-syntax-entry i "." table)))))
    (modify-syntax-entry ?$ "." table)
    (modify-syntax-entry ?% "." table)
    ;; exceptions
    (modify-syntax-entry ?# "<" table)
    (modify-syntax-entry ?\n ">" table)
    (modify-syntax-entry ?' "\"" table)
    (modify-syntax-entry ?` "$" table)
    table))

Strangely you don't have to actually *use* this variable anywhere, it just needs to be defined. If I had to guess off the top of my head, I'd say that define-derived-mode uses this syntax-table if it's available. Otherwise it just falls back to some default behavior.

I was surprised also to see that my friends:


;;   (set (make-local-variable 'parse-sexp-lookup-properties) t)
;;   (set (make-local-variable 'parse-sexp-ignore-comments) t)
;;   (set (make-local-variable 'comment-start) "# ")

still don't seem to be necessary. So they remain commented.

Because I'm now using the block of code above in my "from scratch" python mode, technically I'm supposed to understand it now that I've copied it in. So let's look at what this guy does.

Looking in derived.el we find this little guy:


(defsubst derived-mode-syntax-table-name (mode)
  "Construct a syntax-table name based on a MODE name."
  (intern (concat (symbol-name mode) "-syntax-table")))

So it's not too surprising that it's using the name python-mode-syntax-table somewhat auto-magically.

And to really seal the deal we find in the elisp manual that "<" is the syntax class for comment starter. So that's part of what's getting set when this syntax-table variable is defined.

Also of interest is that initially all symbol constituent chars (except "_") are reassigned to the punctuation character class. This makes sense since python only allows numbers/letters and _ in variable names. Everything else is a type of punctuation.

Additionally we change the syntax class of a few more characters. Notably we tell it that "'" is a quote character and "`" is a paired delimiter.

Cool. Now we have comments that get correctly colored as comments.

lair of the dustbunny