MiniJudge Info Page

What is MiniJudge?

Last updated on September 4, 2008

MiniJudge (currently implemented as MiniJudgeJS 1.1.1 and MiniJudgeJava 0.9.9) is a tool for small-scale experimental syntax. MiniJudge helps you to design and run factorial judgment experiments with multiple speakers and sentences in random order, and then uses statistics run in R to test for main effects, interactions, and satiation, and summarizes it all in a detailed report. Though designed specifically for syntax judgment experiments, MiniJudge can be exploited for more creative purposes, including collecting judgments relating to pragmatics, semantics, morphology or phonology, or for running judgment experiments that are more than fully "small-scale".

Why experimental syntax?

Phillips and Lasnik (2003:61) are entirely right to emphasize that the "[g]athering of native-speaker judgments is a trivially simple kind of experiment, one that makes it possible to obtain large numbers of highly robust empirical results in a short period of time, from a vast array of languages." The experimental syntactician Cowart (1997:1) concurs: "The new data made available to linguists via the inclusion of judgments of sentence acceptability have, in conjunction with other innovations, brought a vast array of syntactic phenomena within the reach of contemporary theory." Even Labov (1996:102), otherwise sharply critical of judgments as a data source, admits that "[f]or the great majority of sentences cited by linguists," native-speaker intuitions "are reliable."

To use a tool wisely, however, it is necessary to recognize its limitations. As Phillips and Lasnik (2003:61) point out, "it is a truism in linguistics, widely acknowledged and taken into account, that acceptability ratings can vary for many reasons independent of grammaticality." Unfortunately, in actual practice linguists don't take the distinction between "acceptability" (a native speaker's sense of how "native" or "natural" a sentence feels) and "grammaticality" (the degree to which a sentence is generable or otherwise valued by one's internal mental grammar) as seriously as they know they should (Chomsky 1965 tried to head off confusion by introducing the term "grammaticalness," but it never caught on). The result is that the "trivially simple" methods of standard syntactic practice, in which "grammaticality judgments" (rather than "acceptability judgments") are treated as "direct" evidence about competence, can make for an insufficiently precise tool.

Almost since the beginning of generative linguistics (see Schütze 1996 for a thorough review), syntacticians have observed that judgments are often subtle, leading to disagreements among native-speaking syntacticians (not to mention between syntax teachers and their students). The most serious disagreements are over the factual status of claimed judgment contrasts, where only some speakers claim to get them, or where even a single speaker wavers in his or her judgments. Particularly worrisome is the possibility that subtle judgments may be easily swayed by bias, especially when "experimenter" and "experimental participant" are embodied in the same person.

As an example of a debate over facts, consider the German sentence in (1). According to Meinunger (2001), it is acceptable, thereby supporting certain theoretical claims. By contrast, Rapp and von Stechow (1999) claim that it is not acceptable, thereby supporting a competing set of theoretical claims. Non-native speakers have no way to choose between the competing empirical claims.

(1)	Gestern traf sie mich fast. yesterday met she me almost "Yesterday she almost met me."

Some apparent data disagreements may actually concern the proper interpretation of facts that are not really open to question, as Newmeyer (1983) has noted. In principle, such debates can be resolved by testing new sentences that pit one putative analysis against another. This is where factorial experimental designs become crucial, as discussed below.

Note that the problem of extracting reliable information about competence from noisy performance data (judgment-making) is precisely the same sort of problem faced by experimental cognitive scientists, who are trying to extract reliable information about processes hidden inside the "black box" of the mind (e.g. memory representations) via noisy behavioral channels (e.g. reaction times in a recall task). Thus a reasonable response to the empirical challenges faced by syntacticians would be to adopt the protocols standard in the rest of the experimental cognitive sciences, painstakingly developed over the course of an almost 200-year history: multiple stimuli and subjects (naive ones rather than the bias-prone experimenters themselves), systematic controls, factorial designs, continuous response measures, filler items, counterbalancing, and statistical analysis.

When judgments are collected with these more careful protocols, they sometimes reconfirm widely accepted phenomena, but they also reveal hitherto unsuspected complexity. Recent examples of the growing experimental syntax literature include Bard et al. 1996, Cowart 1997, McDaniel and Cowart 1999, Keller 2000, Snyder 2000, Sorace and Keller 2005, Featherston 2005a,b, and Clifton et al. 2006. An example from Cowart 1997 is sketched below in some detail. Cowart 1997 is a user-friendly handbook for syntacticians interested in learning proper experimental methodology. Mayo, Corley, and Keller 2005 describes software for designing and running syntactic judgment experiments online.

What is "small-scale" experimental syntax?

While undoubtedly useful, full-fledged experimental syntax is complex and time-consuming. The typical experiment will have over 100 sentences (including many fillers, included merely to hide the purpose of the experiment from the participants), and may involve a large number of speakers (typically around 40). These speakers are typically divided into groups so that they can be presented with counterbalanced sentence lists to ensure that every sentence type is displayed to some speaker without any one speaker seeing overly similar sentences. These speakers, and any assistants needed to create the sentences and run the experiment, must be compensated. Finally, analyzing the results requires statistical skills that syntacticians often have no training in. The consequence of all this is that the researcher must spend a lot of time on work that is not theoretically very interesting, making the time lag between positing a hypothesis about a specific judgment contrast and getting results far longer than what syntacticians are used to (weeks rather than hours or days).

However, the complexity of an experiment should be proportional to the subtlety of the effects it is trying to detect. Very clear judgments are detectable with traditional "trivially simple" methods; very subtle judgments, or research questions involving complex processing issues, may require full-fledged experimental methods. But in the vast area in between, a compromise seems appropriate, where methods are powerful enough to yield statistically valid results, yet are simple and cheap enough to learn and apply quickly: a small-scale experimental syntax.

As currently conceived, small-scale experimental syntax is defined by the following characteristics:

Experimental sentences only (no fillers)
Only as many sentence sets as are needed for statistical validity (around 5)
Only as many speakers as are needed for statistical validity (around 7)
All speakers get all sentences (no counterbalancing of sentence lists)
Binary YES/NO judgments
Maximum of two binary factors
Random sentence order, which is treated as a factor (helps control for some processing effects)

Some syntacticians might find these built-in restrictions to be overly limiting. In particular, the use of binary judgments restricts the potential amount of information. For example, with two speakers judging two sentences there are only 16 (= 2⁴) logically possible outcomes of a binary-judgment experiment, but with a 3-point scale there are 81 (= 3⁴). This means that detecting subtle patterns will, in general, require testing more speakers or sentences than are necessary with scales that are more information-rich. Nevertheless, the restriction to binary judgments is justifiable for a number of reasons:

Binary judgments are trivially simple for naive speakers to make, as long as they are encouraged to guess when they're not sure (what feels like a "random guess" may actually be systematic).

Magnitude estimation, which allows judgments to be made on a continuous scale, is the "gold standard" of judgment measures in terms of its informational richness (see Bard et al. 1996, Featherston 2005b, Sprouse 2007). However, unlike making binary judgments, making magnitude estimation judgments requires special training on the part of the judges.
Multi-point scales, ubiquitous in social science research, can be confusing to novices, as Bard et al. (1996) and Snyder (2000) discovered; the latter (p. 576, fn. 2) reports that one participant deducted only one point for a that-trace violation because only one word (that) was wrong!
A 3-point scale (YES, NO, NOT SURE) seems particularly problematic, since speakers are free to restrict YES and NO responses to those few sentences at the extreme ends of the acceptability scale, leaving NOT SURE as the most common (and useless) response.
In actual practice, binary judgment experiments can detect rather subtle contrasts even with relatively few sentences and speakers (see e.g. Myers 2007).
It is true that there may be systematic differences across speakers in their interpretation of the binary scale (e.g. an overall bias to respond YES when unsure), or a shift in the use of the scale over the course of the experiment. Moreover, Sprouse (2007) has recently provided evidence that such shifts may be associated more with binary scales than with continuous scales. However, biases like these can be factored out automatically in the statistics (see below for a bit more on this issue).
Even gradient degrees of acceptability (see Sorace and Keller 2005 for the theoretic interest of this) are detectable with binary judgments, as Cowart 1997:68 has pointed out. This is because cumulatively across speakers, the distribution of YES/NO judgments reflects gradience of the acceptability "sensation" itself: a sentence that is 70% acceptable, say, will tend to receive YES judgments 70% of the time. As with the detection of subtle contrasts, it turns out that demonstrating the statistical reliability of gradience is possible in a binary judgment experiment even with a very small number of speakers and sentences (see below for an example).

Another objection that can be made about small-scale experimental syntax from a psycholinguistic perspective is the lack of fillers and counterbalancing of sentence lists. Without fillers, speakers may guess the purpose of the task and give judgments partly in accordance with what they expect is wanted (or, if in a contrary mood, with the opposite). Without counterbalancing of sentence lists, speakers will get sets of lexically matched sentences, leading to biased responses, including comparative judgments on sentence pairs. Though comparative judgments are standard in linguistic practice, they are generally avoided in psycholinguistics. This is because comparison may artificially inflate contrasts or have other unknown context effects. It is currently unknown how serious these problems really are. Cowart (1997) reports that the choice of fillers doesn't seem to affect judgment patterns other than raising or lowering the whole scale, but this doesn't mean that the total absence of fillers won't have any serious effect. The lack of counterbalancing in sentence sets may be partially compensated for by counterbalancing order or factoring out order effects, as discussed below.

While conducting a small-scale experiment is much simpler than conducting a full-fledged judgment experiment, some steps may still be overly complex and/or intimidating to the novice experimenter, in particular the design of the experimental sentences and the statistical analysis. The purpose of the MiniJudge software is to automate these steps as much as possible.

Can MiniJudge go beyond small-scale experimental syntax?

Indeed it can. Most obviously, there's no reason why the linguistic items used as experimental materials have to be sentences. If you're interested in pragmatics or semantics, feel free to add long contexts (e.g. paragraphs) and to change the judgment task from acceptability to plausibility or detection of agents or antecedents (as long as the test questions can always be answered with YES or NO). Morphologists and phonologists will find that the automatic material generation algorithm can be used to create lists of nonce words or syllables, since MiniJudge will segment by orthographic unit if the prototype set contains no spaces. Phonologists may even enter the names of sound files as their experimental materials so that the randomly ordered lists generated for each survey can be used to play auditory stimuli (with additional software and hardware).

MiniJudge is restricted to at most two binary factors, but there are tricks that can be used to squeeze in more factors. Tips on modifying the R code are given below.

Similarly, though MiniJudge is designed to analyze binary judgments, you might use it to generate surveys for a judgment experiment using scalar judgments or magnitude estimation. However, the current version of MiniJudge expects only binary judgments, so it cannot help you to prepare the data for analysis.

Why do we have to test multiple speakers and sentences?

The reason why "proper" experiments do better than informal methods is simple. Experiments are designed to rule out as many "boring" explanations for an observation as possible, leaving only the "interesting" ones. Thus unless you pay careful attention to possible influences of these "boring" competitors, you can never be sure you've really ruled them out. As Cowart (1997:47) puts it, "any one person's response to any one sentence is usually massively confounded," so the only way to distinguish interesting causes from the confounds is to test multiple speakers (who vary in numerous uninteresting ways) and multiple sentences (which also vary in numerous uninteresting ways). Linguists intuitively know that they need to test multiple speakers and sentences, so they tend to test more than they explicitly report in their papers. Unfortunately, however, too often they simply assume, without any testing, that a "typical" speaker will agree that their chosen examples are indeed as "typical" as they think.

There is no simple answer to the question of how many speakers and sentences are needed (that is, how big the sample should be) to properly test for a hypothesized effect, since proper sample size depends on how strong the effect is, which you usually don't know until you've tested it. Most psycholinguistics experiments use around 20-30 speakers and 30-40 sets of items, but we've found significant results using MiniJudge on 10 or fewer speakers judging 5 or fewer sets of items. Generally, it's more important to have "enough" speakers than "enough" sentence sets, since speakers will probably be more different from each other than the sentence sets will. This is especially true, if the sentences within each set are very closely matched, that is, differ (almost) only in terms of the experimental factor(s) (and not, for example, length or lexical content).

The absolute smallest experiment, statistically speaking, would have six speakers judging just one pair of sentences (or a single speaker judging six pairs of sentences). If all six show the expected judgment pattern, the probability of this happening by chance alone is less than 0.05, and thus would be considered statistically significant. A future version of MiniJudge will give the option of performing such "super-small-scale" experiments, though it should be obvious that they depend on the rather unrealistic assumptions that a single sentence pair or a single speaker can be "typical" enough to stand for all.

Is it valid to keep adding speakers until the results become statistically significant? This is generally considered to be a bad idea, if not outright "cheating". Suppose, for example, that you hypothesize that Sentence A is better than Sentence B, when really they are the same. You test one person at a time, calculating the statistical significance of the cumulative results each time. The probability that, by "bad" luck alone, you will hit on a sample size where A is "significantly" better than B, is higher than if you had fixed a sample size ahead of time and stuck with it. For this reason, psycholinguists typically decide on their sample sizes before starting the research. They may start their research by running small-scale pilot experiments to get a feel for the plausibility of their hypothesis (MiniJudge is well suited for this), but they won't mix pilot results with the results of the "real" experiment.

The most time-consuming part of preparing a syntactic judgment experiment is the generation of multiple sentence sets. MiniJudge automates this process somewhat, but a native speaker is still needed to confirm that each sentence set, designed to highlight theoretically interesting contrasts, differs from the other sentence sets only in theoretically unimportant ways. To this end, Cowart (1997:50-51) suggests using a thesaurus to come up with a list of semantically related verbs, so that syntactically matched sentences can be generated around a subset of them. Making the process even easier are the Web thesauruses available for many major languages.

Though MiniJudge treats all differences across speakers as random, many linguists assume that speakers within a speech community may also show systematic differences in grammar (idiolects). It is possible in principle to test for idiolects (see Rousseau & Sankoff 1978), and perhaps future versions of MiniJudge will include such a test as an option, but statistical validity requires testing so many speakers and sentences that it's not clear that such an experiment would still be "small-scale." See also Labov 1996 for a skeptical counterargument against the reality of idiolects (as well as important observations about the difficulty of interpreting judgments even when collected in careful experiments.)

What is a factorial experiment?

An experiment is a tool for testing a hypothesis of the form "If X then Y." The basic logic is to input X (e.g. a sentence) into the system (e.g. a speaker's brain) and see if Y (e.g. a judgment of the expected type) comes out. Implicit in the statement "If X then Y" is a comparison: "If [+X] then [+Y], but if [-X] then [-Y]." Here Y is the dependent variable, X is the independent variable or factor, and [+X] and [-X] are the levels of the factor, with [-X] the control.

The informal collection of judgments involve experiments in the sense that sentences are the inputs and judgments are the outputs, but they cannot be considered factorial experiments unless they explicitly compare sentences designed to test specific factors. Unfortunately, syntacticians often elicit judgments of isolated sentences, without any controls, and interpret binary YES/NO acceptability judgments as directly reflecting grammaticality. However, as is well known, acceptability judgments do not in fact reflect grammaticality directly (Chomsky 1965). Hence without a control sentence to compare with, there is no way to know which aspect of any individual sentence is responsible for how it is judged.

In other words, the need for factorial design follows immediately from the fact that many factors play roles in any observation, and we can only figure out which factors are crucial if we design experiments with these factors in mind. Linguists should find this principle entirely familiar, since it underlies the use of minimal pairs.

There is no reason why an experiment must test only one factor at a time, and in fact, the typical psycholinguistic experiment involves two or more independent variables. This not only saves time (a single experiment can test more than one hypothesis) but it also allows the effects of different factors to be distinguished. Distinguishing factors is particularly important when only one of them is theoretically important, with the other a mere nuisance variable that is confounded with the interesting variable. Thus unless the experimenter explicitly acknowledges the nuisance variable, it is impossible to be confident that the observed effects are really due to the interesting variable. A two-factor experiment makes it possible to factor out the effects of the nuisance variable, so the effects of the interesting variable (if any) can stand out clearly. (Of course, the more factors that are used in an experimental design, the more complex the experiment and its interpretation, which is why MiniJudge only allows a maximum of two factors.)

To illustrate the logic of a factorial design, Cowart (1997) discusses his experiments on that-trace effects in English. The basic hypothesis is that a sentence like (2) is ungrammatical.

(2) Who do you think that likes John?

If (2) is truly ungrammatical, we predict that it will receive an "unacceptable" judgment. But since even grammatical sentences can vary in acceptability, we need a comparison sentence that is matched as closely as possible to (2), but which is hypothesized to be grammatical, and thus acceptable. The most obvious candidate for this would be (3), which removes the offending that.

(3) Who do you think likes John?

As Cowart (1997) points out, however, sentences like (2) and (3) actually differ in more than one way: not only does (2) contain a that-trace sequence, but it also contains a that. The presence of that in (2) might, by itself, be enough to give it a lower degree of acceptability than (3), for example by making the sentence longer, by making the complex (subordinate clause) structure more salient, or in any number of unknown ways. Hence we need a way to disentangle the that-trace effect from any possible that effect.

The solution is to cross the [+/-That] factor with a factor defining the position of the extraction site: [+/-Subject]. This results in a set of four sentences:

(2) [+T,+S] Who do you think that _ likes John?
(3) [-T,+S] Who do you think _ likes John?
(4) [+T,-S] Who do you think that John likes _ ?
(5) [-T,-S] Who do you think John likes _ ?

We now predict that sentence (2) (with a that-trace sequence) will be judged worse than sentence (3) (without such a sequence), relative to any judgment difference between sentences (4) and (5) (which act as controls testing for a that effect). Cowart (1997) confirmed these predictions, using full-fledged experimental procedures. In addition, however, Cowart also confirmed the reality of the nuisance that effect: sentences like (4) were indeed judged worse than sentences like (5), despite their both being (presumably) grammatical. Thus a that-trace effect cannot be established by comparison of sentences like (2) and (3) alone.

Note that the resulting sentence set is similar to the sort already often cited in syntax papers: factorial logic is implicit even when "trivially simple" methods are used. The problem is that syntacticians rarely follow this logic systematically enough. Thus it is quite common for them to cite individual sentences with no comparison controls, or pairs of sentences when quartets are actually necessary, which happens whenever the theoretical claim involves two components (e.g. that and trace). Even when sentence quartets are cited, they often involve confounds beyond the two binary factors being tested (an example is shown in the next section).

What is an interaction?

If two factors do not interact, the effect of each factor stays the same even if the value of the other factor is changed. If they do interact, the observed effect depends on the combination of both factors, so neither factor can be studied independently.

The that-trace example discussed above involves an interaction between the two factors [+/-That] and [+/-Subject], where the [That] effect is stronger with [+Subject] sentences (i.e. where it is the subject that is extracted) than with [-Subject] sentences. Equivalently, the [Subject] factor only affects judgments for [+That] sentences; with [-That] sentences, subject and object extraction are equally acceptable.

Syntactic hypotheses often concern interactions. As noted above, this happens whenever the theoretical claim involves the relationship between two elements, such as that and a trace. As another example, consider Li 1998, where one of the points is that in Chinese, the existential morpheme you is allowed with a number if and only if the number is used in an individual-denoting sense (as opposed to a quantity-denoting sense). Now the two factors are [+/-you] and [+/-Individual-denoting]. Four of the sentences cited in Li's squib fit into a quasi-factorial design, as shown below. It is not a perfect factorial design, since the sentences are only matched pairwise. Moreover, the sentence were originally scattered across the squib, obscuring the factorial logic still further.

(6)	[+Ind, +you]	You sange xuesheng zai xuexiao shoushang le. [= (3) in Li] have three-classifier student at school hurt aspect "Three students were hurt at school."
(7)	[+Ind, -you]	*Sange xuesheng zai xuexiao shoushang le. [= (1) in Li] three-classifier student at school hurt aspect "Three students were hurt at school."
(8)	[-Ind, +you]	*You sanzhi gunzi gou ni da ta ma? [= (17a) in Li] have three-classifier stick enough you hit him question "Are three sticks enough for you to hit him (with)?"
(9)	[-Ind, -you]	Sanzhi gunzi gou ni da ta ma? [= (8) in Li] three-classifier stick enough you hit him question "Are three sticks enough for you to hit him (with)?"

Note that the pattern of stars above shows an interaction. We have no reason to think there would any main effects of the [Ind] and [you] factors themselves (e.g. that the average judgment for (6)-(7) is different from that for (8)-(9)). It is only the interaction that is theoretically relevant.

Often the easiest way to understand an interaction is to graph the results. MiniJudge does this automatically.

The logic of interactions is also used in a special way by MiniJudge in order to test for satiation, as explained below.

Why and how is the sentence order randomized?

One of the most powerful nuisance variables in psycholinguistics is order of presentation. Experimental participants change the pattern of their responses over the course of an experiment in complex ways, becoming both more practiced (hence more accurate) and more tired or bored (hence less accurate). If materials in the experiment are similar to each other (as they necessarily will be), they will also prime each other. That is, exposure to an item earlier in the experiment will prepare the brain for processing an item of a similar type later in the experiment. As a matter of fact, acceptability judgments of grammatical sentences are indeed affected by prior exposure to similar sentences, as shown by Luka and Barsalou (2005).

The practical effect of this situation is that experimenters must mix the order of presentation across participants, so that on average, any given item has an equal chance of appearing at any time over the course of the experiment. Otherwise, if all [+X] items come earlier than all [-X] items, the factor [X] is confounded with order, so it's hard to be sure which is really causing the observed effects.

The most common way to neutralize order confounds is to randomize the order of presentation (i.e. the order of sentences in the list). However, if there aren't a lot of items or participants, bad luck could still result in order confounds. The standard solution to this problem is to use partial randomization. The implementation used by MiniJudge is described in Cowart (1997:101). First, random numbers are assigned to every sentence. Then the sentences are sorted by these random numbers within sentence types (e.g. [+F,+G]). Each sentence within a type is associated with a different block (with as many blocks as there are sentences of each type). Finally, sentences within blocks are sorted by the random numbers. The result is that every sentence has an equal chance to appear at any point in the experiment (due to randomization of the blocks), but the sentence types are distributed evenly (and randomly) across the experiment as well (by first sorting within types).

Because MiniJudge experiments involve only a small number of sentences, without fillers or counterbalancing lists across speakers, it is quite likely that even with partial randomization of order, speakers will still be presented with adjacent sentences from the same sentence set. Given the shortness of the overall sentence list, and the fact that all sentences are presented at once anyway (on paper or in an email), it's not clear that complicating the algorithm to block such pairings would be worthwhile.

MiniJudge partly compensates for this sort of shortcoming by including order as an independent variable in the statistical analysis, so that lingering effects of order can be factored out. Such effects include those that are usually handled by counterbalancing sentence lists across speakers. List counterbalancing is used in full-fledged experimental syntax so that speakers don't judge sentences within the same set (like sentences (2)-(5)), which may induce an explicit comparison strategy (undesirable for the reasons explained above.) This relates to order because comparison can only occur when the second sentence of a matched pair is encountered. If roughly half of the speakers get sentence type [+F] first and half get sentence type [-F] first, then on average, judgments for [+F] vs. [-F] are only partially influenced by a comparison strategy. Moreover, the comparison strategy (if any) will be detectable as an order effect: early judgments (when comparison is impossible) will be different from later judgments.

It may also be informative to look at interactions with order, both to factor out order effects even more, and also to test for satiation, as explained next.

What is satiation?

Syntacticians have long noted an annoying phenomenon: for certain types of sentence contrasts, repeated testing dulls the intuitions, so it becomes harder and harder to be confident in one's judgments. Snyder (2000), who was one of the first to study this phenomenon experimentally, called it syntactic satiation.

Snyder suggested that far from being an annoyance, satiation could provide a new window into the nature of grammar and/or processing. This position has been examined in several studies (e.g. Goodall 2004, Hiramatsu 2000, Sprouse 2007). Even when satiation is not considered of theoretical interest, however, it may obscure other effects. An example of this is shown in Myers (2007), where the theoretically interest judgment contrast does not reach statistical significance unless satiation is factored out.

Note that Snyder defined satiation has increasing acceptability. MiniJudge factors out such overall changes in acceptability by default, by including the order of the sentences in the experiment as a factor. However, of greater interest are cases where acceptability increases only for ungrammatical sentences, while that of grammatical ones remains constant. This change in the strength of the factor defining grammaticality would be reflected in an interaction between the factor and the order of sentences. This is the sense in which MiniJudge tests "satiation". The same test can also detect "anti-satiation", where the acceptability contrast increases, rather than decreases, over time. For example, Ko (2007) found that acceptability may increase for grammatical sentences while remaining low for ungrammatical ones (perhaps because speakers became more accustomed to parsing complex structures).

MiniJudge doesn't test for interactions with the order of sentences unless explicitly requested. This is because including interactions with a continuous variable like order can make main effects harder to interpret, even causing them to become nonsignificant or reverse. Thus it's safer to keep the statistical model simple unless you have a specific interest in satiation or in factoring out all traces of order effects in your data.

If your analysis shows a significant interaction with order, the easiest way to see if this represents satiation or anti-satiation is to graph the results. MiniJudge does this automatically.

Why and how does MiniJudge use statistics?

Statistics basics.

Since an experiment is designed to test a hypothesis of the form "If X then Y," it can be thought of as a tool for detecting a systematic correlation between X and Y. The human brain is biased towards seeing patterns, whether or not they actually exist, so the safest course is to calculate the probability that an observed "correlation" could have arisen by pure chance, and then only accept a "correlation" as "real" if this chance probability is lower than some pre-set threshold. The is the standard logic of inferential statistics. Thus statistics is not a rhetorical device you tack on to the end of your analysis, but instead plays an integral role in the design of your research from the very beginning.

Chance probability is symbolized by p; MiniJudge adopts the standard threshold of p < .05 (i.e. p < 1/20). It represents the probability that the null hypothesis is correct. A p value below this threshold is considered statistically significant. Whether or not it is also "significant" in the ordinary sense depends on the size of the judgment difference and the number of observations; a 1% difference in judgments detected with 100 speakers and 500 sentences may turn out to be statistically significant without having any real practical relevance (see Cowart 1997:123 for a real example of such a case). Moreover, a threshold set at p < .05 means that even if your theoretical hypotheses are totally wrong, you will get a "significant" result by pure chance one out of every twenty times you run the same experiment (see Ioannidis 2005 for discussion of the dramatic implications of this simple mathematical fact). Another thing to keep in mind is that if we find p > .05, we cannot claim that the pattern is necessarily "pure chance," only that we have failed to rule out this possibility with a conventional level of confidence. Thus a nonsignificant trend may be worthy of follow-up with improved materials or design, or an increased number of speakers or sentences.

Types of statistical models.

Chance is modeled differently depending on the type of data and the type of hypothesis being tested. The most familiar statistical model used in psycholinguistics is ANOVA (ANalysis Of VAriance), but the data derived in a MiniJudge experiment cannot be modeled using ANOVA. This is because unlike most psycholinguistic data (e.g. reaction times or accuracy scores), the data in a MiniJudge experiment are categorical (specifically, binary): either "acceptable" or "unacceptable."

The most familiar statistical model for categorical data is chi-square, often taught in introductory statistics classes. Yet MiniJudge data are not appropriate for this model either, since they are repeated-measures data. This means the observations come in grouped clusters, rather than being totally independent: the clusters are the speakers, each of whom judges multiple sentences. In psycholinguistics, continuous repeated-measures data are usually handled by repeated-measures ANOVAs across participants, after first averaging across items. If the items are not matched, a separate by-item analysis (averaging across participants) is standardly called for. But MiniJudge data isn't continuous, and categorical data can't be "averaged."

An additional complexity is the treatment of order as a factor, which unlike the main binary factors, is continuous. If the dependent variable were continuous, this could be handled by ANCOVA (ANalysis of COVAriance), or by a linear regression, which extracts the best-fitting equation for a correlation (i.e. by interpreting "If X then Y" as "Y = f(X) + noise"). If our categorical data were not repeated-measures data, we could use logistic regression, which is a generalization of ordinary regression; it is at the core of the sociolinguistic variable-rule analyzing program VARBRUL and its descendants GoldVarb (for Macs) and GOLDVARB 2001 (for PCs) (see Mendoza-Denton et al. 2003). (In fact, this online JavaScript-based logistic regression program, written by John C. Pezzullo, was a key inspiration in the creation of MiniJudgeJS.)

But MiniJudge data are both categorical and repeated-measures. Currently the best known way to handle this kind of data is to use something called mixed-effects logistic regression, a special case of GLMM (Generalized Linear Mixed Modeling) (see e.g. Baayen 2008, Agresti et al. 2000). Like ordinary logistic regression, GLMM extracts the best-fitting equation for the data, with each component of the equation relating to a different factor or an interaction. It also permits including both continuous and categorical factors as independent variables, the latter via effect coding, where [+F] is represented by 1 and [-F] by -1 (this coding makes it easier to test the significance of interactions). For each factor or interaction, GLMM computes not only p values, but also coefficients, which for our purposes are relevant only in their sign. That is, the coefficients show whether a significant effect is positive or negative.

A positive effect for a binary factor means that the probability of getting an "acceptable" judgment is significantly higher for the [+] value of the factor; a negative effect means the same for the [-] value. A positive effect for order means that judgments get better over the course of the experiment; a negative effect means they get worse. An interaction between some factor and order means satiation or anti-satiation: the judgment contrast for that factor changed over the course of the experiment.

The actual sign of an interaction between order and a factor, or between factors, depends partly on how the factors are defined (i.e. whether [+F] is grammatical or ungrammatical). It's much easier to understand the nature of an interaction by looking at the number of "yes" judgments for each combination of factor values (e.g. perhaps [+F+G] and [+F-G] give about the same number of "yes" judgments, whereas [-F+G] gives many more "yes" judgments than [-F-G]).

Handling cross-speaker and cross-sentence variability.

As noted above, in psycholinguistics it is standard to compute two ANOVA models: one that takes cross-subject (speaker) variability into account, and one that takes cross-item (sentence) variability into account. A commonly made justification for including a by-item analysis is that it is required to test for generality across items, just as by-subject analyses test for generality across subjects. However, as pointed out by Raaijmakers et al. 1999, this convention is based on a misinterpretation of Clark 1973.

First, it is wrong to think that by-item analyses check to see if any item behaves atypically (i.e. is an outlier). For models like ANOVA, it is quite possible for a single outlier to cause an illusory significant result. Thus to test for outliers, there's no substitute for checking the individual by-item results yourself. MiniJudge helps with this by reporting the by-sentence rates of "yes" judgments; items with unusually low or high acceptability relative to others of their type should stand out clearly (an example is shown below).

Second, Clark's advice actually only applies to experiments without matched items, for example, an experiment comparing a random set of sentences with transitive verbs ("eat" etc) with a random set of sentences with unrelated intransitive verbs ("sleep" etc). Such sentences will differ in more than just the crucial factor (transitive vs. intransitive), so even if a difference in judgments is found, it may actually relate to uninteresting confounded properties (e.g. the lexical frequency of the verbs). However, if matched items are used, as in the that-trace experiment described above, there is no such confound, since items within each set differ only in terms of the experimental factor(s). In essence, matching means that the factor(s) and sets are crossed, so that variation due to the factor(s) can be clearly distinguished from noise due to non-factor-related sentence differences. Matching of the speakers is not possible, so the analysis must still include speakers explicitly in the model, which is what the by-subject analysis accomplishes.

Many syntax experiments involve sets of matched sentences, but sometimes factors are defined in terms of lexical classes, which are difficult to match perfectly (e.g. transitive/intransitive, unaccusative/unergative, psych-/non-psych verbs, animate/inanimate subjects). Other times the confounding of syntactic and lexical factors is more subtle, as in the demonstration example below showing the results of an experiment on complex NP islands in Chinese: complex NPs are not only syntactically different from simple NPs, but they also involve extra lexical content (e.g. "the boy" vs. "the boy who visited the girl"). Morphologists and phonologists who use MiniJudge to collect judgments on words face this confounding problem even more seriously. If items are sufficiently well matched (which may even happen when factors are defined by lexical classes), taking cross-item variation into account won't make any difference in the analysis (except to make it much more complicated), but if they are not well matched, ignoring the cross-item variation will result in misleadingly low p values.

Nevertheless, if we only computed models that take cross-item variation into account, we might lose useful information. After all, a high p value does not mean that there is no pattern at all, only that we have failed to detect it this time. Thus it may be useful to know if a by-speaker analysis is significant even if the by-sentence analysis is not. Such an outcome could mean that the significant by-speaker result is an illusion due to an uninteresting lexical confound, but it could instead mean that if we do a better job matching the items in our next experiment, we will be able to demonstrate the validity of our theoretically interesting factor. Thus MiniJudge runs both types of analyses, and only includes the by-item analysis in the main report if a statistically significant confound between factors and items is detected. The full results of both analyses are saved in an off-line file, along with the results of the statistical comparison of them.

One final note: Models like GLMM make it possible to take cross-speaker and cross-sentence variation into account at the same time: there is no need to do more than one analysis, contrary to what most psycholinguists still assume. To learn more about how advances in statistics have made some psycholinguistic traditions obsolete, see Baayen 2004.

Limitations.

Though GLMM is the best statistical model currently available for MiniJudge-type data, it does have some limitations.

GLMM tests significance using z scores, which are reliable only if the number of observations is sufficiently large. In practice this is not really a problem, since "sufficiently large" may be as small as 50, and 50 judgments are trivial to collect (e.g. 5 speakers judging 10 sentences each).
Like regression in general, GLMM assumes that the correlation between the dependent and independent variables is not perfect; if it is, the estimation algorithm may not converge on any coefficient value, despite the fact that with a reasonable number of sentence sets and speakers, a perfect correlation must truly be significant.
Like logistic regression (but unlike ANOVA or ordinary regression), it is impossible to calculate GLMM coefficients and p values perfectly; they can only be estimated. Unfortunately, the best way to estimate GLMM values is extremely complicated, and even high-powered statistical programs tend to use "simpler" yet less accurate estimation methods.
Even the "simpler" estimation methods for GLMM are quite complicated to program, so MiniJudge depends on another free program to do this job, as explained next.

What is R?

R is free software available at www.r-project.org. R is by far the best free statistics package available, and its power and flexibility (not to mention its freeness) have made it a worldwide standard. If you do any serious quantitative research, it's well worth owning. You can learn more about it in Crawley 2005, Johnson 2004 (with a chapter on experimental syntax), and Baayen (2008), among many other places.

The main downside to R is that it is not very easy to learn or use. It relies on a command-line interface, not the menus and dialog boxes familiar from most modern programs; this puts a serious burden on memory, especially since its online help can be frustrating. There are currently several ongoing projects to create a simpler interface for it; one of the most advanced is R Commander (Fox 2005).

Since R is difficult to use, MiniJudge does all the communication with R, both generating the necessary R code and translating R's output into an easy-to-read format.

Here are the two key R links:

When you feel psychologically prepared to download R for real, do the following:

Click on the last link above to see a list of downloading locations for R.
On this list, find the location nearest to you and click on it.
In the Download and Install R box, click the link for your operating system.

What happens next depends on your operating system:

Windows: Choose "base," then the setup program (e.g. "R-2.4.1-win32.exe").
Mac: Download the installer appropriate for your computer.
Linux: Any tips would just hurt your hacker pride.

The specific R package used by MiniJudge for GLMM is called "lme4" (and its prerequisite package "Matrix"), authored by Douglas Bates, Martin Maechler and Bin Dai, and maintained by Douglas Bates. The R code generated by MiniJudge will guide you through the downloading and installation of these packages. They are still undergoing modification, so it is possible that in the future MiniJudge will need to be modified to interface properly with it. To get updated versions of your installed R packages, choose the "Update packages..." option in R's "Packages" menu, or paste in the following R code:

update.packages()

What do the statistical results mean?

MiniJudge translates R's output into an easier-to-read format, but it also saves a more detailed report in a file. Here is a line-by-line interpretation of a typical report, where satiation was not tested. The data file is here (taken from an experiment on complex NP islands in Chinese, conducted by Yu-guang Ko as a class exercise).

R OUTPUT	EXPLANATION
Analysis of mjdemo.txt: Factor1 = ComplexNP Factor2 = Topic	Header.
Default model only including cross-speaker variation:	By-subject-only model.
Generalized linear mixed model fit by the Laplace approximation	GLMM estimated using R's currently most powerful GLMM algorithm.
Formula: Judgment ~ Factor1 * Factor2 + Order + (1 \| Speaker)	Judgment: dependent variable (0 or 1). Y~X: Y varies as a function of X. Factor1 * Factor2: test both factors and their interaction. (1\|Speaker): group by Speaker.
Data: minexp	Default name of loaded data.
AIC BIC logLik deviance 74.84 92.49 -31.42 62.84	Measures for fit of model to data.
Random effects: Groups Name Variance Std.Dev. Speaker (Intercept) 3.6564 1.9122	The grouping factor Speaker is treated as a random variable since the experiment wasn't design to test differences across speakers.
Number of obs: 140, groups: Speaker, 7	A total of 140 data points from 7 speakers.
Fixed effects:	These are the variables we are testing for significance.
Estimate Std. Error z value Pr(>\|z\|) (Intercept) 2.79110 1.19183 2.342 0.0192 * Factor1 -1.27336 0.49373 -2.579 0.0099 Factor2 -3.37312 0.65556 -5.145 2.67e-07 * Order -0.07763 0.06535 -1.188 0.2349 Factor1:Factor2 -1.29558 0.49743 -2.605 0.0092 **	Intercept: Measures effect of random factor (here, speakers); ignored by MiniJudge. Factor1:Factor2: interaction. Estimate: GLMM regression coefficient (only the sign is crucial). Pr(>\|z\|): p value (two-tailed). z value: measures how far coefficient is from chance expectation of 0. Std. Error: standard error (used to compute z value).
Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1	Classification of p values (MiniJudge only checks if p < .05).
Correlation of Fixed Effects: (Intr) Factr1 Factr2 Order Factor1 -0.025 Factor2 -0.423 0.200 Order -0.635 0.095 0.188 Fctr1:Fctr2 -0.065 -0.332 0.217 0.153	Intr: Intercept. These values tell you more about the relationships among the effects, where 1 (-1) would be a perfect positive (negative) correlation.
More complex model including both cross-speaker and cross-sentence variation:	May be needed if sentences are not matched.
Generalized linear mixed model fit by the Laplace approximation Formula: Judgment ~ Factor1 * Factor2 + Order + (1\|Speaker) + (1\|Item)	Note that "Item" has been added as a grouping factor.
Data: minexp AIC BIC logLik deviance 72.83 93.43 -29.42 58.83 Random effects: Groups Name Variance Std.Dev. Item (Intercept) 5.1858 2.2772 Speaker (Intercept) 10.8234 3.2899 Number of obs: 140, groups: Item, 20; Speaker, 7	Note that by adding Item as a random variable, the variance for Speaker rises somewhat, a hint that this model is noisier than the simpler one.
Fixed effects: Estimate Std. Error z value Pr(>\|z\|) (Intercept) 4.7756 2.0916 2.283 0.0224 * Factor1 -2.0717 0.8814 -2.351 0.0187 * Factor2 -5.5892 1.3436 -4.160 3.19e-05 *** Order -0.1337 0.1069 -1.251 0.2110 Factor1:Factor2 -2.0925 0.8875 -2.358 0.0184 * --- Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Correlation of Fixed Effects: (Intr) Factr1 Factr2 Order Factor1 -0.034 Factor2 -0.474 0.146 Order -0.594 0.068 0.225 Fctr1:Fctr2 -0.078 -0.181 0.166 0.138	For this data set, including cross-sentence variability in the model makes no real change in the pattern observed in the simpler model. (Note that R's lmer function was recently updated; earlier versions were incapable of detecting significant patterns in the more complex model for this data set.)
Comparison of the two models: Data: minexp Models: glmm1: Judgment ~ Factor1 * Factor2 + Order + (1\|Speaker) glmm2: Judgment ~ Factor1 * Factor2 + Order + (1\|Speaker) + (1\|Item) Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) glmm1 6 74.838 92.488 -31.419 glmm2 7 72.834 93.425 -29.417 4.0041 1 0.04539 * --- Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1	glmm1, glmm2: MiniJudge's names of the two models. First "Df": encodes model complexity. AIC, BIC, logLik: measures of model fit; smaller AIC & BIC mean better fit. Chisq & Chi Df: used to compute p value. Pr(>Chisq): p value for model comparison (highlighted in blue). Since here p < .05, the complex model is (just barely) better than the simpler one.
By-item percentages of YES judgments: Factor(s) Item %YES ============================ [+C][+T] 1 0 [+C][+T] 5 14 [+C][+T] 9 14 [+C][+T] 13 0 [+C][+T] 17 0 ---------------------------- [+C][-T] 2 100 [+C][-T] 6 86 [+C][-T] 10 100 [+C][-T] 14 100 [+C][-T] 18 100 ---------------------------- [-C][+T] 3 86 [-C][+T] 7 71 [-C][+T] 11 29 [-C][+T] 15 71 [-C][+T] 19 86 ---------------------------- [-C][-T] 4 100 [-C][-T] 8 86 [-C][-T] 12 100 [-C][-T] 16 100 [-C][-T] 20 100 ----------------------------	This information can help you detect whether any sentences go against the general judgment pattern. Note that in this case, sentence 11 seems somewhat anomalous, with a lower "YES" rate than the other sentences of its type. It's also interesting that even though this experiment was very small and involved binary judgments, the data still revealed gradience: [-T] sentences (nontopicalized) were treated as essentially "perfect," [+C][+T] sentences (topics extracted from complex NPs) were treated as essentially "impossible," but [-C][+T] sentences (topics extracted from simple NPs) fell in between.
Number of YES judgments for each category: [+C] [-C] Total C = ComplexNP [+T] 2 24 26 T = Topic [-T] 34 34 68 Total 36 58 94	Percentages are not given because the raw number of judgments is just as important as the proportions. Note that the interaction between the two factors involves an effect of [Topic] on judgments only in the context of [+ComplexNP], just as we would expect. However, the large effect of [Topic] is suspicious, since such structures should be grammatical in Chinese. There may be something wrong with the sentences or the design.
Significance summary (p < .05): The factor ComplexNP had a significant negative effect. The factor Topic had a significant negative effect. The interaction between ComplexNP and Topic had a significant negative effect. There were no other significant effects. Items and factors were significantly confounded, so the above results take cross-item variability into account.	This summary, generated by MiniJudge, is intended to be self-explanatory. See above for how to interpret positive/negative effects.

To report the above results in the standard way, you would refer to the red values above, which summarize the key values for the preferred model (in this case, the by-subjects-and-items model). You can tell that the by-subjects-and-items analysis is the relevant one because of the blue p value for the comparison above between the default and complex analyses; here, p < .05, so there is a significant advantage in using the complex analysis. For each significant factor and interaction, you should report all four values. For example, for Factor1 (ComplexNP), you could report as follows, using standard notation (note also the rounding): "B = -2.07, SE = 0.88, z = -2.35, p < .05."

A note about p values: A value containing the string "e-" is very very tiny; e.g. 3.19e-05 = 3.19 x 10^-5 = 3.19/100000 = 0.0000319. That's why Factor2 (WhMove) is marked as significant above.

If R encounters a fatal problem while analyzing the data (e.g. a perfect correlation that makes it impossible for the estimation algorithm to converge), it will return "NA" (not available) and "NaN" (not a number) instead of actual values. MiniJudge will still give a basic results summary, but it won't be able to tell you if the pattern is statistically significant.

Statistics beyond MiniJudge

After running MiniJudge's automatically generated R code, you can continue examining the results by entering new R code yourself. By studying the sample R code below, you may be able to figure out how to do what you want. Note that "#" marks explanatory comments in the code; everything after it in a line is ignored by R. Note also that you should replace the underlined portions with the information appropriate to your data. For example, for some analyses, you need to replace the generic names "Factor1" and "Factor2" by your actual factor names, and the names of data files with the actual names of your data files.

Modifying the built-in MiniJudge analyses.

The mathematical "objects" generated by MiniJudge's R code will still be active, so you can continue to manipulate them. The most relevant objects are Factor1, Factor2 (if any), Order, glmm1 (the default by-subjects-only analysis), and glmm2 (the by-subjects-and-items analysis). Note that capitalization and spacing is crucial: R won't recognize "factor1" or "Factor 1"!

For example, suppose you run an analysis of a two-factor experiment, ignoring satiation, where only one factor (e.g. Factor1) is crucial to your theory (the other is just a nuisance variable that you want to control). You don't care about any interaction. The results show, in fact, that there is no significant interaction. Unfortunately, however, you find that your "important" factor is also nonsignificant. This might mean that your hypothesis is wrong, of course, but maybe a real effect is being hidden in an overly complex model. In particular, it could be that by factoring out that nonsignificant interaction, you are also removing information relevant to Factor1 itself. In such a situation, you might want to rerun the analysis without the interaction factor. (NOTE: This only makes sense if the interaction is nonsignificant. If it's significant, then of course it's "cheating" to ignore it.)

To do this in R, simply paste in the following code. Note that in this case, you don't need to replace the generic factor names, since you're updating an analysis that R has just run.

# Subtract interaction from original glmm1 model
  glmm1.noint = update(glmm1, . ~ . - Factor1:Factor2)

# Subtract interaction from original glmm2 model
  glmm2.noint = update(glmm2, . ~ . - Factor1:Factor2)

# Compare the two new models
  anova(glmm1.noint, glmm2.noint)

# Display the preferred model
  glmm1.noint # ... or glmm2.noint, depending on the previous step

Interpretation of the results works the same way as above. Using similar code, you can remove Order as a factor, do a satiation analysis that only looks at the interaction between Order and one of the factors, not both of them, and so on.

Analyzing more than two factors.

MiniJudge restricts the number of factors to two, since interpreting three or more factors becomes quite complex, especially regarding the interactions. However, there are two general types of situations where it might be useful to include an additional factor.

(1) Adding a third factor defined by the materials. For example, you might hypothesize that the that-trace effect (defined by the factors [That] and [Subject]) is stronger with wh-movement than with topicalization. In this case, we need the within-groups factor [WhMove] (so called because the same speakers get both wh-movement and topicalization sentences. Your hypothesis predicts a three-way interaction between [That], [Subject], and [WhMove].

The trick here is to design your sentence sets such that the first half represents one value of the third factor, and the second half represents the other value. For example, if you have 10 sentence sets, the first 5 sets should involve wh-movement and the second 5 should involve topicalization, but otherwise be matched. Thus sets 1 and 6 might be as follows:

Set 1: [+W]

[+T,+S] Who do you think that _ likes John?
[-T,+S] Who do you think _ likes John?
[+T,-S] Who do you think that John likes _ ?
[-T,-S] Who do you think John likes _ ?

Set 6: [-W]

[+T,+S] Mary, you think that _ likes John.
[-T,+S] Mary, you think _ likes John.
[+T,-S] Mary, you think that John likes _ .
[-T,-S] Mary, you think John likes _ .

Once you analyze the whole experiment with MiniJudge, you can use the following code to add the third factor Factor3, where the first half of the sets represents [-Factor3] and the second half represents [+Factor3]. Again, you don't need to replace the generic names of the other factors, since they'll still be active.

# Find dividing point between [-Factor3] and [+Factor3] sets
  DividePoint = max(Set)/2

# Define Factor3
  Factor3 = (Set > DividePoint)*2-1

# Add Factor3 to data set
  data3 = cbind(minexp, Factor3)

# Run by-subjects-only GLMM
  glmm1.3 = update(glmm1, . ~ Factor1 * Factor2 * Factor3 + Order
   + (1|Speaker), data = data3, family = "binomial")

# Run by-subjects-and-items GLMM
  glmm2.3 = update(glmm2, . ~ Factor1 * Factor2 * Factor3 + Order
   + (1|Speaker) + (1|Sentence), data = data3, family = "binomial")

# Compare the two models
  anova(glmm1.3, glmm2.3)

# Display the preferred model
  glmm1.3 # ... or glmm2.3, depending on the previous step

Interpretation of the analysis works the same way as above, except that you can now also see whether there is any main effect of Factor3 or interactions with it. Note that adding this third factor almost doubles the complexity of your model, which may mean that previously significant effects will disappear in all the noise. The default two-factor analysis involves a model with five parameters (Factor1, Factor2, their interaction, Order, and the intercept). With three factors, the model now has nine parameters (all of the above, plus a main effect of Factor3, two-way interactions each with Factor1 and Factor2, and the three-way interaction Factor1:Factor2:Factor3). Moreover, it is likely that at least some of these interactions are theoretically irrelevant. To remove nonsignificant interactions from the analysis, you can adapt the code given above.

(2) Testing the same sentences on two different groups of speakers. For example, you might want to compare children vs. adults, or native speakers vs. second-language learners. The data could be collected, using MiniJudge, simply by creating two sets of surveys (e.g. 10 surveys for children, 10 surveys for adults), and then you could use MiniJudge to generate two data files (e.g. Data1.txt and Data2.txt). To analyze the results, we need to add the between-groups factor of [SpeakerGroup]. The following code does not require that you run MiniJudge's built-in R analyses first. However, this time you do need to replace the generic names Factor1 and Factor2 by the actual factor names, since you'll be creating a new data set.

# Activate lme4 package (if it isn't already)
  library(lme4)

# Read in data; "T" indicates that the data files have headers
  data1 = read.table("Data1.txt",T)
  data2 = read.table("Data2.txt",T)

# Add SpeakerGroup factor (-1 = first group, 1 = second group)
  SpeakerGroup = -1
  data1.g = cbind(SpeakerGroup,data1)
  SpeakerGroup = 1
  data2.g = cbind(SpeakerGroup,data2)

# Combine two groups
  data.all = rbind(data1.g,data2.g)

# Activate variable names
  attach(data.all)

# Run by-subjects-only GLMM
  glmm1.g = lmer(Judgment ~ Factor1 * Factor2 * SpeakerGroup + Order
   + (1|Speaker), data = data.all, family = "binomial")

# Run by-subjects-and-items GLMM
  glmm2.g = lmer(Judgment ~ Factor1 * Factor2 * SpeakerGroup + Order
   + (1|Speaker) + (1|Sentence), data = data.all, family = "binomial")

# Compare the two models
  anova(glmm1.g, glmm2.g)

# Display the preferred model
  glmm1.g # ... or glmm2.g, depending on the previous step

Interpretation of the analyses works the same way as above, except that you can now also see if there is any main effect of SpeakerGroup (e.g. a positive effect would mean that the second group gave more "yes" judgments than the first group). More importantly, you can also test if there is any interaction with SpeakerGroup. Such an interaction would mean that the two groups showed different patterns in their judgments (e.g. one showed an effect of Factor1 while the other didn't). To explore the nature of this cross-group difference, you can use MiniJudge's own R code to analyze each group's data separately.

MiniJudge updates

Currently there are two implementations of MiniJudge. The first is MiniJudgeJS, which runs in the internet browser but requires the user to copy and paste text files by hand. The second is MiniJudgeJava, which runs offline and handles text files a bit more simply. Both hand the statistics over to R, but it is hoped that later versions will be able to run GLMM internally. All versions of MiniJudge will remain free and open-source.

A complementary program for analyzing small corpora, called MiniCorp, is also in the works, though it's primarily designed for phonology and morphology.

References

Agresti, A., Booth, J. G., Hobert, J. P., & Caffo, B. (2000). Random-effects modeling of categorical response data. Sociological Methodology, 30, 27-80.

Baayen, R. H. (2004). Statistics in psycholinguistics: A critique of some current gold standards. Mental Lexicon Working Papers, 1, 1-45. University of Alberta, Canada. Available at www.mpi.nl/world/persons/private/baayen/submitted/statistics.pdf

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.

Bard, E. G., Robertson, D., & Sorace, A. (1996). Magnitude estimation of linguistic acceptability. Language, 72 (1), 32-68.

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335-359.

Clifton, C., Fanselow, G., & Frazier, L. (2006). Amnestying superiority violations: Processing multiple questions. Linguistic Inquiry, 37, (1), 51-68.

Cowart, W. (1997). Experimental syntax: Applying objective methods to sentence judgments. London: Sage Publications.

Crawley, M. J. 2005. Statistics: An introduction using R. Wiley.

Featherston, S. (2005a). That-trace in German. Lingua, 115 (9), 1277-1302.

Featherston, S. (2005b). Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua, 115 (11),1525-1550.