Minggu, 16 September 2012



  1. differ measurement to evaluation

Measurement refers to the process by which the attributes or dimensions of some physical object are determined. One exception seems to be in the use of the word measure in determining the IQ of a person. The phrase, "this test measures IQ" is commonly used. Measuring such things as attitudes or preferences also applies. However, when we measure, we generally use some standard instrument to determine how big, tall, heavy, voluminous, hot, cold, fast, or straight something actually is. Standard instruments refer to instruments such as rulers, scales, thermometers, pressure gauges, etc. We measure to obtain information about what is. Such information may or may not be useful, depending on the accuracy of the instruments we use, and our skill at using them. There are few such instruments in the social sciences that approach the validity and reliability of say a 12" ruler. We measure how big a classroom is in terms of square feet, we measure the temperature of the room by using a thermometer, and we use Ohm meters to determine the voltage, amperage, and resistance in a circuit. In all of these examples, we are not assessing anything; we are simply collecting information relative to some established rule or standard. Assessment is therefore quite different from measurement, and has uses that suggest very different purposes. When used in a learning objective, the definition provided on the ADPRIMA for the behavioral verb measure is: To apply a standard scale or measuring device to an object, series of objects, events, or conditions, according to practices accepted by those who are skilled in the use of the device or scale.

            Evaluation is perhaps the most complex and least understood of the terms. Inherent in the idea of evaluation is "value." When we evaluate, what we are doing is engaging in some process that is designed to provide information that will help us make a judgment about a given situation. Generally, any evaluation process requires information about the situation in question. A situation is an umbrella term that takes into account such ideas as objectives, goals, standards, procedures, and so on. When we evaluate, we are saying that the process will yield information regarding the worthiness, appropriateness, goodness, validity, legality, etc., of something for which a reliable measurement or assessment has been made. For example, I often ask my students if they wanted to determine the temperature of the classroom they would need to get a thermometer and take several readings at different spots, and perhaps average the readings. That is simple measuring. The average temperature tells us nothing about whether or not it is appropriate for learning. In order to do that, students would have to be polled in some reliable and valid way. That polling process is what evaluation is all about. A classroom average temperature of 75 degrees is simply information. It is the context of the temperature for a particular purpose that provides the criteria for evaluation. A temperature of 75 degrees may not be very good for some students, while for others, it is ideal for learning. We evaluate every day. Teachers, in particular, are constantly evaluating students, and such evaluations are usually done in the context of comparisons between what was intended (learning, progress, behavior) and what was obtained. When used in a learning objective, the definition provided on the ADPRIMA site for the behavioral verb evaluate is: To classify objects, situations, people, conditions, etc., according to defined criteria of quality. Indication of quality must be given in the defined criteria of each class category. Evaluation differs from general classification only in this respect.










2. Based on the goal, test are classified into formative, summative, and diagnostic

Formative assessment
Assessment becomes formative when the evidence from assessment is used to adapt teaching to improve learning.  It is an integral part of the teaching and learning process. It is sometimes referred to as assessment for learning.
Since 2003 the main focus of the Assessment Resource Banks has been on formative assessment (assessment for learning). Many of the tasks are designed to find out not only a students’ response to a task, but why they have made that response. The Teachers’ Guide pages provide information to help teachers analyse students’ responses and make decisions about what to do next.
Self- and peer-assessment
Self- and peer-assessment are an integral part of formative assessment. In this context students need to be actively involved in making judgements about their work and their progress towards understanding ideas.
For students to learn from assessment they not only have to gather evidence of their learning, but also:
·         analyse their work in terms of the goal/standard;
  • make decisions about what they need to do to improve;
·         know what to do to close the gap; and
  • monitor their progress towards achieving this.
Selecting ARB resources for self- and peer-assessment
Use keywords "self assessment" or "peer assessment" when searching for resources.
Use keyword "work samples" to search for annotated examples of student work. These can be used by students to
  • set goals for their own work
  • identify features of the task to attend to
  • identify features of exemplary work
  • practise critiquing work
The information in the Teachers’ Guide pages can be used by teachers or students to develop self- and peer-assessment tasks.
Diagnostic assessment
Diagnostic assessment is when assessment is used to identify possible strengths and weaknesses of individual students. It may be specific, to check on a particular skill or understanding, or it may be broad to indicate at the beginning of a unit of work areas that need attention.
Selecting ARB resources for diagnostic assessment
If the assessment focus is specific, make sure that the assessment focus of the resource matches the area of interest.
It can sometimes be useful to select four or five resources with a similar focus, but with an escalating level of difficulty. Refer to the level of difficulty provided for many resources in the Teachers’ Guide.
Total marks are unimportant. Instead, analyse the student responses to identify patterns of strength and weakness, and plan to cater for these during teaching.

Pre- and post-tests

Teachers may select resources to assess levels of knowledge and understanding before a new phase of teaching. The same resources, or a selection of similar ones, may be administered at the end of the teaching phase to check progress.
Summative assessment
Summative assessment is intended to summarise student achievement at a particular time (Crooks, 2001).
Summative assessment can be used to
  • identify students’ achievement of learning
  • track progress of learning
  • compare against a standard
  • rank students
  • provide evidence of learning to parents
When results of a class, school or group of students are collated, summative assessment data can be used
  • to inform planning and resourcing for broad areas that need attention
  • to show shifts in achievement across the group
  • for accountability.
Summative assessment data can also be formative if it is used to provide feedback to the student that leads to further improvement.
Selecting ARB resources for summative assessment
  • Check that the assessment focus matches the learning outcomes/intentions for the student work.
  • Tasks chosen should accurately reflect the content of the work that has been taught.
  • Check that the level of difficulty fits the range of achievements expected by the class.
  • Consider whether the resource assesses "deep" learning or surface features. Does this match the learning intentions being assessed?
  • You may want to put together several resources that assess a broad range of learning intentions. Use the My Folder facility to sort a range of related assessments.

Think about what you want students to know about their performance.
  • A mark will tell them whether they were right or wrong. 
  • A total score may give them an indication of their overall level of achievement for what is being assessed.
  • Written or oral feedback related to the assessment focus may provide information that leads to further learning.
Teacher-made assessment
Teachers who need to prepare their own assessment materials may wish to adapt some of the approaches and ideas used in the ARB resources.
Student tasks have an MS Word version that enables the resource to be tailored to meet specfic needs. A different picture can be inserted, a question deleted, or a word changed, as required.

Note: If you change a resource, the information and difficulty levels in the Teachers’ Guide pages may no longer be appropriate for the modified resource.
Good assessment practice includes providing a range of options for students to show what they know and can do.
National Standards
The ARB resources can be used to contribute to Overall Teacher Judgements (OTJ) about students. They can be used to probe areas of interest in greater depth. Completing an appropriate ARB task may
  • provide additional evidence that an aspect of a standard has been reached
  • confirm that an aspect of a standard has not yet been mastered
  • provide further information for focusing next teaching.
Many ARB tasks include ways to explore why a student gives a particular response. Once it is understood what is going wrong for a student it is much easier to decide what to do next.
Extensive notes for teachers accompany each ARB task to help them analyse student responses. Suggestions are made for possible next learning steps.









































3. a. W
    b. W
    c. E










































4. Mention types of test based on their establishment time

Types of Reliability

You learned in the Theory of Reliability that it's not possible to calculate reliability exactly. Instead, we have to estimate reliability, and this is always an imperfect endeavor. Here, I want to introduce the major reliability estimators and talk about their strengths and weaknesses.
There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:
  • Inter-Rater or Inter-Observer Reliability
    Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.
  • Test-Retest Reliability
    Used to assess the consistency of a measure from one time to another.
  • Parallel-Forms Reliability
    Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.
  • Internal Consistency Reliability
    Used to assess the consistency of results across items within a test.
Let's discuss each of these in turn.

Inter-Rater or Inter-Observer Reliability

 

 

 

 

 

Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.
So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you're kind of stuck. Probably it's best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren't changing.
There are two major ways to actually estimate inter-rater reliability. If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.
The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.
You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.

Test-Retest Reliability

We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval.

testret.gif (3593 bytes)

Parallel-Forms Reliability

In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is often no easy feat. Furthermore, this approach makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance this will sometimes not be the case. The parallel forms approach is very similar to the split-half reliability described below. The major difference is that parallel forms are constructed so that the two forms can be used independent of each other and considered equivalent measures. For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. it would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest. With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.

paraform.gif (2555 bytes)

Internal Consistency Reliability

In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. There are a wide variety of internal consistency measures that can be used.

Average Inter-item Correlation

The average inter-item correlation uses all of the items on our instrument that are designed to measure the same construct. We first compute the correlation between each pair of items, as illustrated in the figure. For example, if we have six items we will have 15 different item pairings (i.e., 15 correlations). The average interitem correlation is simply the average or mean of all these correlations. In the example, we find an average inter-item correlation of .90 with the individual correlations ranging from .84 to .95.

avintitm.gif (5786 bytes)

Average Itemtotal Correlation

This approach also uses the inter-item correlations. In addition, we compute a total score for the six items and use that as a seventh variable in the analysis. The figure shows the six item-to-total correlations at the bottom of the correlation matrix. They range from .82 to .88 in this sample analysis, with the average of these at .85.
avittot.gif (5645 bytes)

Split-Half Reliability

In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.
splithlf.gif (5804 bytes)

Cronbach's Alpha (a)

Imagine that we compute one split-half reliability and then randomly divide the items into another set of split halves and recompute, and keep doing this until we have computed all possible split half estimates of reliability. Cronbach's Alpha is mathematically equivalent to the average of all possible split-half estimates, although that's not how we compute it. Notice that when I say we compute all possible split-half estimates, I don't mean that each time we go an measure a new sample! That would take forever. Instead, we calculate all split-half estimates from the same sample. Because we measured all of our sample on each of the six items, all we have to do is have the computer analysis do the random subsets of items and compute the resulting correlations. The figure shows several of the split-half estimates for our six item example and lists them as SH with a subscript. Just keep in mind that although Cronbach's Alpha is equivalent to the average of all possible split half correlations we would never actually calculate it that way. Some clever mathematician (Cronbach, I presume!) figured out a way to get the mathematical equivalent a lot more quickly.
cronalph.gif (5078 bytes)

Comparison of Reliability Estimators

Each of the reliability estimators has certain advantages and disadvantages. Inter-rater reliability is one of the best ways to estimate reliability when your measure is an observation. However, it requires multiple raters or observers. As an alternative, you could look at the correlation of ratings of the same single observer repeated on two different occasions. For example, let's say you collected videotapes of child-mother interactions and had a rater code the videos for how often the mother smiled at the child. To establish inter-rater reliability you could take a sample of videos and have two raters code them independently. To estimate test-retest reliability you could have a single rater code the same videos on two different occasions. You might use the inter-rater approach especially if you were interested in using a team of raters and you wanted to establish that they yielded consistent results. If you get a suitably high inter-rater reliability you could then justify allowing them to work independently on coding different videos. You might use the test-retest approach when you only have a single rater and don't want to train any others. On the other hand, in some studies it is reasonable to do both to help establish the reliability of the raters or observers.
The parallel forms estimator is typically only used in situations where you intend to use the two forms as alternate measures of the same thing. Both the parallel forms and all of the internal consistency estimators have one major constraint -- you have to have multiple items designed to measure the same construct. This is relatively easy to achieve in certain contexts like achievement testing (it's easy, for instance, to construct lots of similar addition problems for a math test), but for more complex or subjective constructs this can be a real challenge. If you do have lots of items, Cronbach's Alpha tends to be the most frequently used estimate of internal consistency.
The test-retest estimator is especially feasible in most experimental and quasi-experimental designs that use a no-treatment control group. In these designs you always have a control group that is measured on two occasions (pretest and posttest). the main problem with this approach is that you don't have any information about reliability until you collect the posttest and, if the reliability estimate is low, you're pretty much sunk.
Each of the reliability estimators will give a different value for reliability. In general, the test-retest and inter-rater reliability estimates will be lower in value than the parallel forms and internal consistency ones because they involve measuring at different times or with different raters. Since reliability estimates are often used in statistical analyses of quasi-experimental designs (e.g., the analysis of the nonequivalent group design), the fact that different estimates can differ considerably makes the analysis even more complex.





























5.what do these terms mean:TOEFL, EQ TEST, GMAT, SAT

The IELTS and TOEFL exams are known and feared by English language students worldwide. Both exams are used by universities to assess the English language ability of applicants. IELTS is widely used in the UK and Australia and also recognised by most American and Canadian universities, including Harvard Business School; TOEFL is used mainly by American universities, though also accepted in the UK and Australia. Next month we will examine the TOEFL exam; this month we will focus on IELTS.

IELTS stands for the International English Language Testing system. It operates on a nine point band, where a nine indicates that the student has a level of English equivalent to a highly educated native speaker, and it tests all four skills ( reading, writing, listening and speaking) in an academic context. Generally speaking, undergraduate students need to obtain a score of 5.5 overall to gain university admission and postgraduate students need a score of 6.5 overall. Some universities will ask for higher grades for all courses, or for specific programmes. The IELTS exam can be taken at centres worldwide and at frequent intervals throughout the year. Candidates pay a fee to take the exam.

For language school owners the preparation courses for the IELTS exam are frequently a major source of income. IELTS is becoming increasingly popular throughout Asia and students usually need to attend a year long preparation course to do well enough in the exam to apply to an overseas university. There is a wealth of preparation material to use on these courses, including plenty of mock exam material. Teachers who are asked to teach IELTS should be given a thorough briefing on the demands of the exam and a brief training course.



Emotional intelligence is the innate potential to feel, use, communicate, recognize, remember, describe, identify, learn from, manage, understand and explain emotions.



The GMAT (Graduate Management Admission Test) is a standardized exam used by business schools to assess how well students are likely to do in an MBA Program. The GMAT exam measures basic verbal, mathematical, and analytical writing skills.

The SAT is the most widely used standardized test for college admissions. The exam is created and administered by CollegeBoard. It covers three subject areas: critical reading, mathematics and writing. Students have 3 hours and 45 minutes to complete the exam. Each section is worth 800 points, so the highest possible score is 2400. The exam is offered seven times a year: January, March, May June, October, November and December. The SAT is designed to measure critical thinking and problem solving skills that are essential for success in college. The average scores for different colleges vary widely. The other standardized test accepted by most colleges is the ACT.

Tidak ada komentar:

Posting Komentar